The retroharmonize package arrives with small subsamples of three Eurobarometer surveys with a few variables and a limited set of responses. They are not as interesting as the full datasets – they serve testing and illustration purposes.
Survey data, i.e., data derived from questionnaires or systematic data collection, such as inspecting objects in nature, recording prices at shops are usually stored databases, and converted to complex files retaining at least coding, labelling metadata together with the data. This must be imported to R so that the appropriate harmonization tasks can be carried out with the appropriate R types.
The survey harmonization almost always requires the work of several source files. The harmonization of their contents is important because there the contents of these files do not match, they cannot be joined, integrated, binded together.
Our importing functions, read_csv, read_rda, read_spss, read_dta are slightly modify the read.csv, readRDS, and the haven::read_spss, haven::read_data importing functions. Instead of importing into a data.frame or a tibble, they import to an inherited data frame called survey. The survey class works as a data frame, but tries to retain as much metadata as possible for future harmonization steps and resources planning—for example, original source file names.
You can find the package illustration files with system.file().
examples_dir <- system.file("examples", package = "retroharmonize")
survey_files <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))]
survey_files
#> [1] "ZA5913.rds" "ZA6863.rds" "ZA7576.rds"
The read_survey() function calls the appropriate importing function (based on the file extension of the survey files) and reads the surveys into list (in memory.) If you work with many files, and you want to keep working sequentially with survey files, it is a good idea to convert them to R objects. This is how you would do it with large SPSS or STATA files.
Our example surveys are small and easily fit into the memory.
example_surveys <- read_surveys(
survey_paths = file.path( examples_dir, survey_files),
export_path = NULL)
#> Warning: Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
ZA5913_survey <- example_surveys[[1]]
# A small subset of this survey
head(ZA5913_survey[, c(1,4,5,34)])
#> Unknown A (????). "Untitled Dataset."
#> doi isocntry p1 qd3_14
#> <chr> <chr> <dbl+lbl> <dbl+lbl>
#> 1 doi:10.4232/1.12884 NL 8 [Tuesday 18th March 2014] 0 [Not mentioned]
#> 2 doi:10.4232/1.12884 NL 8 [Tuesday 18th March 2014] 0 [Not mentioned]
#> 3 doi:10.4232/1.12884 NL 10 [Thursday 20th March 2014] 0 [Not mentioned]
#> 4 doi:10.4232/1.12884 NL 14 [Monday 24th March 2014] 0 [Not mentioned]
#> 5 doi:10.4232/1.12884 NL 10 [Thursday 20th March 2014] 0 [Not mentioned]
#> 6 doi:10.4232/1.12884 NL 8 [Tuesday 18th March 2014] 0 [Not mentioned]
If you look at the metadata attributes of the ZA5913_survey, you find more information than in the case of a data.frame or its modernized version, the tibble. Crucially, it records the source file and creates a unique table identifier. A further addition is that the first column of the data.frame is a truly unique observation identifier, rowid. The rowid is not only unique in this survey, but it is unique in all surveys that you import in one workflow. For example, if the original surveys were just simply using an integer id, like uniqid 1….1000, you will run into problems after joining several surveys.
attributes(ZA5913_survey)
#> $names
#> [1] "doi" "version" "uniqid" "isocntry" "p1" "p3"
#> [7] "p4" "nuts" "d7" "d8" "d25" "d60"
#> [13] "qa10_3" "qa10_2" "qa10_1" "qa7_4" "qa7_2" "qa7_3"
#> [19] "qa7_1" "qa7_5" "qd3_1" "qd3_2" "qd3_3" "qd3_4"
#> [25] "qd3_5" "qd3_6" "qd3_7" "qd3_8" "qd3_9" "qd3_10"
#> [31] "qd3_11" "qd3_12" "qd3_13" "qd3_14" "w1" "w3"
#> [37] "rowid"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35
#>
#> $dataset_bibentry
#> Unknown A (????). "Untitled Dataset."
#>
#> $prov
#> [1] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [3] "<http://example.com/creation> <http://www.w3.org/ns/prov#startedAtTime> \"\"2024-12-29T11:19:14Z\"^^<xs:dateTime>\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "<http://example.com/creation> <http://www.w3.org/ns/prov#endedAtTime> \"\"2024-12-29T11:19:14Z\"^^<xs:dateTime>\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [5] "<https://doi.org/10.5281/zenodo.14537352> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#>
#> $subject
#> $term
#> [1] "data sets"
#>
#> $subjectScheme
#> [1] "Library of Congress Subject Headings (LCSH)"
#>
#> $schemeURI
#> [1] "https://id.loc.gov/authorities/subjects.html"
#>
#> $valueURI
#> [1] "http://id.loc.gov/authorities/subjects/sh2018002256"
#>
#> $classificationCode
#> NULL
#>
#> $prefix
#> [1] ""
#>
#> attr(,"class")
#> [1] "subject" "list"
#>
#> $class
#> [1] "survey" "dataset_df" "tbl_df" "tbl" "data.frame"
#>
#> $id
#> [1] "ZA5913"
#>
#> $filename
#> [1] "ZA5913.rds"
#>
#> $doi
#> [1] "doi:10.4232/1.12884"
#>
#> $object_size
#> [1] 112608
#>
#> $source_file_size
#> [1] 6507
Our example files are lightweight, because they come installed with the R package. If you work with real-life survey data, and many of them, you will likely run out of memory soon. Therefore, the critical functions or retroharmonize are versatile: they either work with a list of surveys, or with a vector of files. Of course, subsetting or renaming work much faster in memory, so if your resources are sufficient, you should work with the survey_list format, like in this importing example. Otherwise, you can work sequentially with the files, which is a far slower procedure.
First, let us check our inventory of surveys.
document_surveys(survey_paths = file.path(examples_dir, survey_files))
#> 1/1 ZA5913.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> 1/2 ZA6863.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> 1/3 ZA7576.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> # A tibble: 3 × 8
#> id filename ncol nrow object_size file_size accessed last_modified
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 ZA5913 ZA5913.rds 37 35 113784 6507 2024-12-29 … 2024-12-29 1…
#> 2 ZA6863 ZA6863.rds 48 50 147360 8738 2024-12-29 … 2024-12-29 1…
#> 3 ZA7576 ZA7576.rds 55 45 168608 9312 2024-12-29 … 2024-12-29 1…
This will easily fit into the memory, so let us explore a bit further.
metadata_create(example_surveys) %>% head()
#> filename id var_name_orig class_orig
#> 1 ZA5913.rds ZA5913 doi character
#> 2 ZA5913.rds ZA5913 version character
#> 3 ZA5913.rds ZA5913 uniqid numeric
#> 4 ZA5913.rds ZA5913 isocntry character
#> 5 ZA5913.rds ZA5913 p1 haven_labelled
#> 6 ZA5913.rds ZA5913 p3 haven_labelled_spss
#> var_label_orig
#> 1 digital_object_identifier
#> 2 gesis_archive_version_and_date
#> 3 unique_respondent_id_caseid_by_tns_country_code
#> 4 country_code_iso_3166
#> 5 date_of_interview
#> 6 duration_of_interview_minutes
#> labels
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
#> 6 2, 225, 999
#> valid_labels na_labels na_range n_labels
#> 1 NA NA NA 0
#> 2 NA NA NA 0
#> 3 NA NA NA 0
#> 4 NA NA NA 0
#> 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NA 14
#> 6 2, 225 999 NA 3
#> n_valid_labels n_na_labels
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
#> 5 14 0
#> 6 2 1