--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The retroharmonize package arrives with small subsamples of three Eurobarometer surveys with a few variables and a limited set of responses. They are not as interesting as the full datasets – they serve testing and illustration purposes. ```{r setup} library(retroharmonize) ``` ## Importing data Survey data, i.e., data derived from questionnaires or systematic data collection, such as inspecting objects in nature, recording prices at shops are usually stored databases, and converted to complex files retaining at least coding, labelling metadata together with the data. This must be imported to R so that the appropriate harmonization tasks can be carried out with the appropriate R types. The survey harmonization almost always requires the work of several source files. The harmonization of their contents is important because there the contents of these files do not match, they cannot be joined, integrated, binded together. Our importing functions, read_csv, read_rda, read_spss, read_dta are slightly modify the read.csv, readRDS, and the haven::read_spss, haven::read_data importing functions. Instead of importing into a data.frame or a tibble, they import to an inherited data frame called survey. The survey class works as a data frame, but tries to retain as much metadata as possible for future harmonization steps and resources planning---for example, original source file names. You can find the package illustration files with system.file(). ```{r systemfiles} examples_dir <- system.file("examples", package = "retroharmonize") survey_files <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] survey_files ``` The read_survey() function calls the appropriate importing function (based on the file extension of the survey files) and reads the surveys into list (in memory.) If you work with many files, and you want to keep working sequentially with survey files, it is a good idea to convert them to R objects. This is how you would do it with large SPSS or STATA files. ```{r eval=FALSE} example_surveys <- read_surveys( file.path( examples_dir, survey_files), export_path = tempdir()) ``` Our example surveys are small and easily fit into the memory. ```{r} example_surveys <- read_surveys( survey_paths = file.path( examples_dir, survey_files), export_path = NULL) ``` ```{r} ZA5913_survey <- example_surveys[[1]] # A small subset of this survey head(ZA5913_survey[, c(1,4,5,34)]) ``` If you look at the metadata attributes of the ZA5913_survey, you find more information than in the case of a data.frame or its modernized version, the tibble. Crucially, it records the source file and creates a unique table identifier. A further addition is that the first column of the data.frame is a truly unique observation identifier, rowid. The rowid is not only unique in this survey, but it is unique in all surveys that you import in one workflow. For example, if the original surveys were just simply using an integer id, like uniqid 1....1000, you will run into problems after joining several surveys. ```{r} attributes(ZA5913_survey) ``` Our example files are lightweight, because they come installed with the R package. If you work with real-life survey data, and many of them, you will likely run out of memory soon. Therefore, the critical functions or retroharmonize are versatile: they either work with a list of surveys, or with a vector of files. Of course, subsetting or renaming work much faster in memory, so if your resources are sufficient, you should work with the survey_list format, like in this importing example. Otherwise, you can work sequentially with the files, which is a far slower procedure. ## Mapping information, harmonizing concepts First, let us check our inventory of surveys. ```{r} document_surveys(survey_paths = file.path(examples_dir, survey_files)) ``` This will easily fit into the memory, so let us explore a bit further. ```{r metadata} metadata_create(example_surveys) %>% head() ``` ## Crosswalk table