Getting Started

The retroharmonize package arrives with small subsamples of three Eurobarometer surveys with a few variables and a limited set of responses. They are not as interesting as the full datasets – they serve testing and illustration purposes.

library(retroharmonize)

Importing data

Survey data, i.e., data derived from questionnaires or systematic data collection, such as inspecting objects in nature, recording prices at shops are usually stored databases, and converted to complex files retaining at least coding, labelling metadata together with the data. This must be imported to R so that the appropriate harmonization tasks can be carried out with the appropriate R types.

The survey harmonization almost always requires the work of several source files. The harmonization of their contents is important because there the contents of these files do not match, they cannot be joined, integrated, binded together.

Our importing functions, read_csv, read_rda, read_spss, read_dta are slightly modify the read.csv, readRDS, and the haven::read_spss, haven::read_data importing functions. Instead of importing into a data.frame or a tibble, they import to an inherited data frame called survey. The survey class works as a data frame, but tries to retain as much metadata as possible for future harmonization steps and resources planning—for example, original source file names.

You can find the package illustration files with system.file().

examples_dir <- system.file("examples", package = "retroharmonize")
survey_files  <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))]
survey_files
#> [1] "ZA5913.rds" "ZA6863.rds" "ZA7576.rds"

The read_survey() function calls the appropriate importing function (based on the file extension of the survey files) and reads the surveys into list (in memory.) If you work with many files, and you want to keep working sequentially with survey files, it is a good idea to convert them to R objects. This is how you would do it with large SPSS or STATA files.

example_surveys <- read_surveys(
  file.path( examples_dir, survey_files), 
  export_path = tempdir())

Our example surveys are small and easily fit into the memory.

example_surveys <- read_surveys(
  survey_paths = file.path( examples_dir, survey_files), 
  export_path = NULL)
#> Warning: Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
#> Unknown or uninitialised column: `rowid`.
ZA5913_survey <- example_surveys[[1]]
# A small subset of this survey
head(ZA5913_survey[, c(1,4,5,34)])
#> Unknown A (????). "Untitled Dataset."
#>   doi                 isocntry p1                            qd3_14           
#>   <chr>               <chr>    <dbl+lbl>                     <dbl+lbl>        
#> 1 doi:10.4232/1.12884 NL        8 [Tuesday 18th March 2014]  0 [Not mentioned]
#> 2 doi:10.4232/1.12884 NL        8 [Tuesday 18th March 2014]  0 [Not mentioned]
#> 3 doi:10.4232/1.12884 NL       10 [Thursday 20th March 2014] 0 [Not mentioned]
#> 4 doi:10.4232/1.12884 NL       14 [Monday 24th March 2014]   0 [Not mentioned]
#> 5 doi:10.4232/1.12884 NL       10 [Thursday 20th March 2014] 0 [Not mentioned]
#> 6 doi:10.4232/1.12884 NL        8 [Tuesday 18th March 2014]  0 [Not mentioned]

If you look at the metadata attributes of the ZA5913_survey, you find more information than in the case of a data.frame or its modernized version, the tibble. Crucially, it records the source file and creates a unique table identifier. A further addition is that the first column of the data.frame is a truly unique observation identifier, rowid. The rowid is not only unique in this survey, but it is unique in all surveys that you import in one workflow. For example, if the original surveys were just simply using an integer id, like uniqid 1….1000, you will run into problems after joining several surveys.

attributes(ZA5913_survey)
#> $names
#>  [1] "doi"      "version"  "uniqid"   "isocntry" "p1"       "p3"      
#>  [7] "p4"       "nuts"     "d7"       "d8"       "d25"      "d60"     
#> [13] "qa10_3"   "qa10_2"   "qa10_1"   "qa7_4"    "qa7_2"    "qa7_3"   
#> [19] "qa7_1"    "qa7_5"    "qd3_1"    "qd3_2"    "qd3_3"    "qd3_4"   
#> [25] "qd3_5"    "qd3_6"    "qd3_7"    "qd3_8"    "qd3_9"    "qd3_10"  
#> [31] "qd3_11"   "qd3_12"   "qd3_13"   "qd3_14"   "w1"       "w3"      
#> [37] "rowid"   
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35
#> 
#> $dataset_bibentry
#> Unknown A (????). "Untitled Dataset."
#> 
#> $prov
#> [1] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."                                    
#> [2] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                                          
#> [3] "<http://example.com/creation> <http://www.w3.org/ns/prov#startedAtTime> \"\"2024-12-29T11:19:14Z\"^^<xs:dateTime>\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [4] "<http://example.com/creation> <http://www.w3.org/ns/prov#endedAtTime> \"\"2024-12-29T11:19:14Z\"^^<xs:dateTime>\"^^<http://www.w3.org/2001/XMLSchema#string> ."  
#> [5] "<https://doi.org/10.5281/zenodo.14537352> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."                         
#> 
#> $subject
#> $term
#> [1] "data sets"
#> 
#> $subjectScheme
#> [1] "Library of Congress Subject Headings (LCSH)"
#> 
#> $schemeURI
#> [1] "https://id.loc.gov/authorities/subjects.html"
#> 
#> $valueURI
#> [1] "http://id.loc.gov/authorities/subjects/sh2018002256"
#> 
#> $classificationCode
#> NULL
#> 
#> $prefix
#> [1] ""
#> 
#> attr(,"class")
#> [1] "subject" "list"   
#> 
#> $class
#> [1] "survey"     "dataset_df" "tbl_df"     "tbl"        "data.frame"
#> 
#> $id
#> [1] "ZA5913"
#> 
#> $filename
#> [1] "ZA5913.rds"
#> 
#> $doi
#> [1] "doi:10.4232/1.12884"
#> 
#> $object_size
#> [1] 112608
#> 
#> $source_file_size
#> [1] 6507

Our example files are lightweight, because they come installed with the R package. If you work with real-life survey data, and many of them, you will likely run out of memory soon. Therefore, the critical functions or retroharmonize are versatile: they either work with a list of surveys, or with a vector of files. Of course, subsetting or renaming work much faster in memory, so if your resources are sufficient, you should work with the survey_list format, like in this importing example. Otherwise, you can work sequentially with the files, which is a far slower procedure.

Mapping information, harmonizing concepts

First, let us check our inventory of surveys.

document_surveys(survey_paths = file.path(examples_dir, survey_files))
#> 1/1 ZA5913.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> 1/2 ZA6863.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> 1/3 ZA7576.rds
#> Warning: Unknown or uninitialised column: `rowid`.
#> # A tibble: 3 × 8
#>   id     filename    ncol  nrow object_size file_size accessed     last_modified
#>   <chr>  <chr>      <dbl> <dbl>       <dbl>     <dbl> <chr>        <chr>        
#> 1 ZA5913 ZA5913.rds    37    35      113784      6507 2024-12-29 … 2024-12-29 1…
#> 2 ZA6863 ZA6863.rds    48    50      147360      8738 2024-12-29 … 2024-12-29 1…
#> 3 ZA7576 ZA7576.rds    55    45      168608      9312 2024-12-29 … 2024-12-29 1…

This will easily fit into the memory, so let us explore a bit further.

metadata_create(example_surveys) %>% head()
#>     filename     id var_name_orig          class_orig
#> 1 ZA5913.rds ZA5913           doi           character
#> 2 ZA5913.rds ZA5913       version           character
#> 3 ZA5913.rds ZA5913        uniqid             numeric
#> 4 ZA5913.rds ZA5913      isocntry           character
#> 5 ZA5913.rds ZA5913            p1      haven_labelled
#> 6 ZA5913.rds ZA5913            p3 haven_labelled_spss
#>                                    var_label_orig
#> 1                       digital_object_identifier
#> 2                  gesis_archive_version_and_date
#> 3 unique_respondent_id_caseid_by_tns_country_code
#> 4                           country_code_iso_3166
#> 5                               date_of_interview
#> 6                   duration_of_interview_minutes
#>                                          labels
#> 1                                            NA
#> 2                                            NA
#> 3                                            NA
#> 4                                            NA
#> 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
#> 6                                   2, 225, 999
#>                                    valid_labels na_labels na_range n_labels
#> 1                                            NA        NA       NA        0
#> 2                                            NA        NA       NA        0
#> 3                                            NA        NA       NA        0
#> 4                                            NA        NA       NA        0
#> 5 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14                 NA       14
#> 6                                        2, 225       999       NA        3
#>   n_valid_labels n_na_labels
#> 1              0           0
#> 2              0           0
#> 3              0           0
#> 4              0           0
#> 5             14           0
#> 6              2           1

Crosswalk table