The first step of retrospective harmonization is finding the relevant concepts, operationalized in questions that need to be harmonized among two or more surveys.
examples_dir <- system.file("examples", package = "retroharmonize")
survey_files <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))]
survey_files
#> [1] "ZA5913.rds" "ZA6863.rds" "ZA7576.rds"
With smaller data frames representing your surveys, the most efficient way to work with the information is to read them into a list of surveys.
Read the surveys into a list object in the memory:
example_surveys <- read_surveys(survey_paths, .f = "read_rds")
#> Warning: Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
If you may ran out of memory, you can work with files. The advantage of keeping the surveys in memory is that later it will be much faster to continue working with them, but from the metadata point of view, the returned object is the same either way.
#not evaluated
example_metadata <- metadata_create (survey_paths = survey_paths, .f = "read_rds")
#> Warning: Unknown or uninitialised column: `rowid`.
#> Read: /tmp/Rtmp8U8wWC/Rinst11e862c69a74/retroharmonize/examples/ZA5913.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> Read: /tmp/Rtmp8U8wWC/Rinst11e862c69a74/retroharmonize/examples/ZA6863.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> Read: /tmp/Rtmp8U8wWC/Rinst11e862c69a74/retroharmonize/examples/ZA7576.rds
Let’s work in the memory now. Map the metadata contents of the files:
set.seed(2022)
metadata_create(survey_list = example_surveys) %>%
dplyr::sample_n(12)
#> filename id var_name_orig class_orig
#> 1 ZA6863.rds ZA6863 qa8a_3 haven_labelled_spss
#> 2 ZA6863.rds ZA6863 qd7.8 haven_labelled
#> 3 ZA5913.rds ZA5913 p3 haven_labelled_spss
#> 4 ZA7576.rds ZA7576 qd6.3 haven_labelled_spss
#> 5 ZA5913.rds ZA5913 qa10_2 haven_labelled_spss
#> 6 ZA5913.rds ZA5913 p4 haven_labelled
#> 7 ZA7576.rds ZA7576 p2 haven_labelled
#> 8 ZA7576.rds ZA7576 qa6a_8 haven_labelled_spss
#> 9 ZA5913.rds ZA5913 doi character
#> 10 ZA6863.rds ZA6863 d60 haven_labelled
#> 11 ZA5913.rds ZA5913 qd3_12 haven_labelled
#> 12 ZA7576.rds ZA7576 d8 haven_labelled
#> var_label_orig labels
#> 1 trust_in_institutions_army 1, 2, 3, 9
#> 2 important_values_pers_solidarity 0, 1
#> 3 duration_of_interview_minutes 2, 225, 999
#> 4 important_values_pers_human_rights 0, 1, 9
#> 5 european_commission_trust 1, 2, 3
#> 6 n_of_persons_present_during_interview 1, 2, 3, 4
#> 7 time_of_interview 1, 2, 3, 4, 5, 6, 8
#> 8 trust_in_institutions_national_government 1, 2, 3, 9
#> 9 digital_object_identifier NA
#> 10 difficulties_paying_bills_last_year 1, 2, 3, 7
#> 11 important_values_pers_respect_for_cultures 0, 1
#> 12 age_education 0, 2, 89, 97, 98, 99
#> valid_labels na_labels na_range n_labels n_valid_labels n_na_labels
#> 1 1, 2, 3 9 NA 4 3 1
#> 2 0, 1 NA 2 2 0
#> 3 2, 225 999 NA 3 2 1
#> 4 0, 1 9 NA 3 2 1
#> 5 1, 2 3 NA 3 2 1
#> 6 1, 2, 3, 4 NA 4 4 0
#> 7 1, 2, 3, 4, 5, 6, 8 NA 7 7 0
#> 8 1, 2, 3 9 NA 4 3 1
#> 9 NA NA NA 0 0 0
#> 10 1, 2, 3, 7 NA 4 4 0
#> 11 0, 1 NA 2 2 0
#> 12 0, 2, 89, 97, 98, 99 NA 6 6 0
The current retroharmonize uses the metadata_create() function to restore the encoded metadata into a tidy table that can be the start of further steps. This function should be revised after much use, and brought to a simpler format, and renamed, preferably choosing a DDI Glossary term. (Ingest? Or just mapping? Should not contain any tidyverse verbs.) C2: The selected variables from the metadata table (which needs a better word) we subset the surveys either in memory or, in case of many files, sequentially from file. This the subset_survey() function. It will need a thorough upgrade to correctly retain the attributes of the datacube-inheritted new survey class, but it functions well.
This stage should be harmonized with the DDI Codebook. One problem appears to me is that DDI calls a “codebook” differently than we do. DDI uses the term codebook on the level of file (survey), and we use it on the level of individual observations.
Codebook: A document that provides information on the structure, contents, and layout of a data file. Source: DDI Glossary.
Here is a DDI Codebook example in PDF.
Because normally we want to use standardized codes, and we started to harmonize with the SDMX statistical metadata standard, a good resolution seems to be to differentiate between a Codebook (DDI term) and a Codelist (SDMX term, but I am sure it has a more general RDF definition.)
We roughly have a DDI Codebook regarding the concepts and question items, but the number of Valid and Invalid responses were not collected at ingestion:
set.seed(12)
example_metadata %>%
select ( Filename = .data$filename,
Name = .data$var_name_orig,
Label = .data$var_label_orig,
Type = .data$class_orig,
Format = .data$labels) %>%
mutate ( Valid = NA_real_,
Invalid = NA_real_,
Question = NA_character_) %>%
sample_n(12)
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"class_orig"` instead of `.data$class_orig`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"labels"` instead of `.data$labels`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Filename Name Label
#> 1 ZA7576.rds serialid serial_case_id_appointed_by_kantar
#> 2 ZA6863.rds qd7.13 important_values_pers_none_spont
#> 3 ZA7576.rds isocntry country_code_iso_3166
#> 4 ZA6863.rds qd7.2 important_values_pers_respect_human_life
#> 5 ZA5913.rds qd3_14 important_values_pers_dk
#> 6 ZA7576.rds w3 weight_germany
#> 7 ZA6863.rds w1 weight_result_from_target_redressment
#> 8 ZA7576.rds qd6.5 important_values_pers_democracy
#> 9 ZA7576.rds qd6.1 important_values_pers_rule_of_law
#> 10 ZA6863.rds doi digital_object_identifier
#> 11 ZA7576.rds qa6b_2 trust_in_institutions_european_union_tcc
#> 12 ZA7576.rds qa6a_5 trust_in_institutions_army
#> Type Format Valid Invalid Question
#> 1 numeric NA NA NA <NA>
#> 2 haven_labelled 0, 1 NA NA <NA>
#> 3 character NA NA NA <NA>
#> 4 haven_labelled 0, 1 NA NA <NA>
#> 5 haven_labelled 0, 1 NA NA <NA>
#> 6 numeric NA NA NA <NA>
#> 7 numeric NA NA NA <NA>
#> 8 haven_labelled_spss 0, 1, 9 NA NA <NA>
#> 9 haven_labelled_spss 0, 1, 9 NA NA <NA>
#> 10 character NA NA NA <NA>
#> 11 haven_labelled_spss 1, 2, 3, 9 NA NA <NA>
#> 12 haven_labelled_spss 1, 2, 3, 9 NA NA <NA>
The DDI Codebook is however, a lot more, because it contains survey-level metadata that we did not use in retroharmonize so far. We assumed that the user (researcher) did a comparison of sampling methods, collection modes, etc, which are all part of the DDI Codebook standard.
It would be very easy to write a codebook_create() function that would create a partial DDI codebook as a component of a future DDI Codebook function codebook_create_ddi() and keep working with this.
However, we have a problem, the current, released retroharmonize has
a more complex create_codebook()
function. This should be
depracted.
set.seed(12)
my_codebook <- create_codebook (
survey = read_rds (
system.file("examples", "ZA7576.rds",
package = "retroharmonize")
)
)
#> Warning: Unknown or uninitialised column: `rowid`.
sample_n(my_codebook, 12)
#> # A tibble: 12 × 12
#> entry id filename var_name_orig var_label_orig val_code_orig
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 29 ZA7576 ZA7576.rds qa6a_4 trust_in_institutions_po… 1
#> 2 11 ZA7576 ZA7576.rds nuts region_nuts_codes TR82
#> 3 11 ZA7576 ZA7576.rds nuts region_nuts_codes TR41
#> 4 11 ZA7576 ZA7576.rds nuts region_nuts_codes LV005
#> 5 11 ZA7576 ZA7576.rds nuts region_nuts_codes FR23
#> 6 29 ZA7576 ZA7576.rds qa6a_4 trust_in_institutions_po… 9
#> 7 11 ZA7576 ZA7576.rds nuts region_nuts_codes AL033
#> 8 35 ZA7576 ZA7576.rds qa6b_3 trust_in_institutions_un… 3
#> 9 11 ZA7576 ZA7576.rds nuts region_nuts_codes EL13
#> 10 11 ZA7576 ZA7576.rds nuts region_nuts_codes BE35
#> 11 11 ZA7576 ZA7576.rds nuts region_nuts_codes BE21
#> 12 12 ZA7576 ZA7576.rds d7 marital_status 8
#> # ℹ 6 more variables: val_label_orig <chr>, label_range <chr>,
#> # na_range <named list>, n_labels <dbl>, n_valid_labels <dbl>,
#> # n_na_labels <dbl>
The tasks that we do with this information is variable name and variable label harmonization.
metadata <- metadata_create(example_surveys)
metadata$var_name_suggested <- label_normalize(metadata$var_name)
metadata$var_name_suggested[metadata$label_orig == "age_education"] <- "age_education"
harmonized_example_surveys <- harmonize_var_names(survey_list = example_surveys,
metadata = metadata )
lapply(harmonized_example_surveys, names)
#> [[1]]
#> [1] "doi" "version" "uniqid" "isocntry" "p1" "p3"
#> [7] "p4" "nuts" "d7" "d8" "d25" "d60"
#> [13] "qa10_3" "qa10_2" "qa10_1" "qa7_4" "qa7_2" "qa7_3"
#> [19] "qa7_1" "qa7_5" "qd3_1" "qd3_2" "qd3_3" "qd3_4"
#> [25] "qd3_5" "qd3_6" "qd3_7" "qd3_8" "qd3_9" "qd3_10"
#> [31] "qd3_11" "qd3_12" "qd3_13" "qd3_14" "w1" "w3"
#> [37] "rowid"
#>
#> [[2]]
#> [1] "doi" "version" "uniqid" "serialid" "isocntry" "p1"
#> [7] "p2" "p3" "p4" "nuts" "d7" "d8"
#> [13] "d25" "d60" "qa14_3" "qa14_2" "qa14_1" "qa8a_3"
#> [19] "qa8a_9" "qa8b_2" "qa8a_1" "qa8a_7" "qa8a_8" "qa8a_2"
#> [25] "qa8a_5" "qa8b_1" "qa8a_4" "qa8a_6" "qa8a_10" "qa8b_3"
#> [31] "qd7 1" "qd7 2" "qd7 3" "qd7 4" "qd7 5" "qd7 6"
#> [37] "qd7 7" "qd7 8" "qd7 9" "qd7 10" "qd7 11" "qd7 12"
#> [43] "qd7 13" "qd7 14" "w1" "w3" "wex" "rowid"
#>
#> [[3]]
#> [1] "doi" "version" "uniqid" "caseid" "serialid" "isocntry"
#> [7] "p1" "p2" "p3" "p4" "nuts" "d7"
#> [13] "d8" "d25" "d60" "qa14_5" "qa14_3" "qa14_2"
#> [19] "qa14_4" "qa14_1" "qa6a_5" "qa6a_10" "qa6b_2" "qa6a_3"
#> [25] "qa6a_1" "qa6b_4" "qa6a_8" "qa6a_9" "qa6a_4" "qa6a_2"
#> [31] "qa6b_1" "qa6a_6" "qa6a_7" "qa6a_11" "qa6b_3" "qd6 1"
#> [37] "qd6 2" "qd6 3" "qd6 4" "qd6 5" "qd6 6" "qd6 7"
#> [43] "qd6 8" "qd6 9" "qd6 10" "qd6 11" "qd6 12" "qd6 13"
#> [49] "qd6 14" "qg1b" "qg8" "w1" "w3" "wex"
#> [55] "rowid"
There is, however, an important extra step, what the DDI codebook calls Type and Format matching. This is software/computer language dependent, but our codebook could easily accommodate this with containing the generic DDI Codebook
data.frame (
Type = rep("discrete", 3),
Format = c("numeric-1.0", "numeric-2.0", "numeric-6.0"),
r_type = rep("integer",3),
range = c("0..9", "10..99", "100000..999999" )
) %>% knitr::kable()
Type | Format | r_type | range |
---|---|---|---|
discrete | numeric-1.0 | integer | 0..9 |
discrete | numeric-2.0 | integer | 10..99 |
discrete | numeric-6.0 | integer | 100000..999999 |
These variables can be mapped either to our labelled_spss_survey class or Adrian Dusa’s declared.
Considerations: - The labelled_spss_survey or declared is necessary because R does not have a missing case identifier that can distinguish declined answers or answers that were not collected. - There must be a clear coercion (without “lazy” and ambiguous coercion) to at least R integer, numeric, character or factor classes for further use in R’s statistical functions or visualization functions. - Integers can easily be coerced into characters, but this is not necessarily a good idea, because some functions anyway want a numeric input, and characters require a lot more space to be stored in memory or in a file.
we can assume that we only use integer representation for coded questionnaire items, but we still may have open text responses or observation identifiers that are character vectors. It is likely that the use of character-represented identifiers is a better idea in later stages. So we must work with a class that can be converted (coerced) into both integer (numeric) and character formats.
The choice has profound consequences for variable label harmonization and the harmonization of codelists, but not at the level of concepts, questions and codebooks.
data.frame (
Type = rep("discrete", 3),
Format = c("numeric-1.0", "numeric-2.0", "numeric-6.0"),
r_type = rep("declared",3),
range = c("Male|Female|DK", "10..99", "100000..999999" )
) %>% knitr::kable()
Type | Format | r_type | range |
---|---|---|---|
discrete | numeric-1.0 | declared | Male|Female|DK |
discrete | numeric-2.0 | declared | 10..99 |
discrete | numeric-6.0 | declared | 100000..999999 |
Question banks contain information about questions asked about the same concepts in different surveys.
“Using DDI as a foundation for a question bank enables you to reuse metadata and to find identical and similar questions and or response sets across surveys for purposes of data comparison, harmonization, or new questionnaire development.”
Social Science Variables Database: (located at ICPSR) Search over 4 million variables. Also able to compare questions across studies and series. http://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/index.jsp
UK Data Service Variable and Question Bank: Search hundreds of surveys. http://discover.ukdataservice.ac.uk/variables
Survey Data Netherlands: Over 36,000 questions to search. http://surveydata.nl
Obviously we should facilitate the use of existing question banks, and create question banks that interoperable with existing ones.
Let’s take a look at concerts in the Eurobarometer series. https://www.icpsr.umich.edu/web/ICPSR/series/26/variables?q=concert Here is the variable that we use in our use case: https://www.icpsr.umich.edu/web/ICPSR/studies/35505/datasets/0001/variables/QB1_4?archive=icpsr
A short caveat: the questionnaire item may or may not be copyright protected. The reuse of the questionnaire requires further research.
Here we have Values (1…5) and their labels (Not in the last 12 months, 1-2 times, etc)
The question bank information already contains information for the next step, the harmonization of value labels and codelists not covered in this vignette.
We should follow the rOpenSci Packages: Development, Maintenance, and Peer Review for future changes. In designing and deprecating functions, the relevant parts are
create_codebook()
will be deprecated, luckily, it does
not meet the rOpenSci object_verb suggestion.codebook_create()
will create a DDI-Codebook
compatible, partial codebook, only covering tasks that are relevant for
retroharmonize. The core of the codebook will be compatible with
DDI-Codebook, but further information about the R specific
implementation of the codebook will be added.codebook_export_ddi()
will add further data (whatever
we have but do not use) to make a more complete, but not necessarily
complete DDI Codebook object. [Not a high priority now.]