Title: | Ex Post Survey Data Harmonization |
---|---|
Description: | Assist in reproducible retrospective (ex-post) harmonization of data, particularly individual level survey data, by providing tools for organizing metadata, standardizing the coding of variables, and variable names and value labels, including missing values, and documenting the data transformations, with the help of comprehensive s3 classes. |
Authors: | Daniel Antal [aut, cre] , Marta Kolczynska [ctb] |
Maintainer: | Daniel Antal <[email protected]> |
License: | GPL-3 |
Version: | 0.2.5.003 |
Built: | 2024-10-26 03:25:59 UTC |
Source: | https://github.com/rOpenGov/retroharmonize |
Convert a labelled_spss_survey
vector to a type
of factor. Keeps only the levels
and class
attributes.
as_factor(x, levels = "default", ordered = FALSE)
as_factor(x, levels = "default", ordered = FALSE)
x |
Object to coerce to a factor. |
levels |
How to create the levels of the generated factor:
|
ordered |
If |
as_factor
is imported from haven::as_factor
Labelled to labelled_spss_survey
as_labelled_spss_survey(x, id)
as_labelled_spss_survey(x, id)
x |
A vector of class haven_labelled or haven_labelled_spss. |
id |
The survey identifier. |
A vector of labelled_spss_survey
Other type conversion functions:
labelled_spss_survey()
This is a function candidate which will partly replace the current
create_codebook
function.
codebook_create()
codebook_create()
This future function should create a DDI-Codebook compatible, partial codebook on survey level.
This function will return
# This is a new function candidate and is not written yet. codebook_create()
# This is a new function candidate and is not written yet. codebook_create()
This is a function candidate.
codelist_create()
codelist_create()
should be a new function that creates a codelist which considers
https://sdmx.org/?page_id=4345
and any guidance from DDI on Question banks, and the DDI Question Construct.
Partly replaces the current create_codebook
function.
This function will return a data frame (tbl_df) with a codelist.
# This is a new function candidate and is not written yet. codelist_create()
# This is a new function candidate and is not written yet. codelist_create()
Collect labels from metadata file
collect_val_labels(metadata) collect_na_labels(metadata)
collect_val_labels(metadata) collect_na_labels(metadata)
metadata |
A metadata data frame created by
|
The unique valid labels or the user-defined missing
labels found in all the files analyzed in metadata
.
Other harmonization functions:
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
,
label_normalize()
test_survey <- retroharmonize::read_rds ( file = system.file("examples", "ZA7576.rds", package = "retroharmonize"), id = "test" ) example_metadata <- metadata_create (test_survey) collect_val_labels (metadata = example_metadata ) collect_na_labels ( metadata = example_metadata )
test_survey <- retroharmonize::read_rds ( file = system.file("examples", "ZA7576.rds", package = "retroharmonize"), id = "test" ) example_metadata <- metadata_create (test_survey) collect_val_labels (metadata = example_metadata ) collect_na_labels ( metadata = example_metadata )
Concatenate haven_labelled_spss vectors
concatenate(x, y)
concatenate(x, y)
x |
A haven_labelled_spss vector. |
y |
A haven_labelled_spss vector. |
A concatenated haven_labelled_spss vector. Returns an error if the attributes do not match. Gives a warning when only the variable label do not match.
v1 <- labelled::labelled( c(3,4,4,3,8, 9), c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) ) v2 <- labelled::labelled( c(4,3,3,9), c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) ) s1 <- haven::labelled_spss( x = unclass(v1), # remove labels from earlier defined labels = labelled::val_labels(v1), # use the labels from earlier defined na_values = NULL, na_range = 8:9, label = "Variable Example" ) s2 <- haven::labelled_spss( x = unclass(v2), # remove labels from earlier defined labels = labelled::val_labels(v2), # use the labels from earlier defined na_values = NULL, na_range = 8:9, label = "Variable Example" ) concatenate (s1,s2)
v1 <- labelled::labelled( c(3,4,4,3,8, 9), c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) ) v2 <- labelled::labelled( c(4,3,3,9), c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) ) s1 <- haven::labelled_spss( x = unclass(v1), # remove labels from earlier defined labels = labelled::val_labels(v1), # use the labels from earlier defined na_values = NULL, na_range = 8:9, label = "Variable Example" ) s2 <- haven::labelled_spss( x = unclass(v2), # remove labels from earlier defined labels = labelled::val_labels(v2), # use the labels from earlier defined na_values = NULL, na_range = 8:9, label = "Variable Example" ) concatenate (s1,s2)
Create a codebook from one or more survey data files.
create_codebook(metadata = NULL, survey = NULL) codebook_waves_create(waves) codebook_surveys_create(survey_list)
create_codebook(metadata = NULL, survey = NULL) codebook_waves_create(waves) codebook_surveys_create(survey_list)
metadata |
A metadata table created by |
survey |
A survey data frame, defaults to |
waves |
A list of surveys. |
survey_list |
A list containing surveys of class survey. |
For a list of survey waves, use codebook_waves_create
.
The returned codebook contains only labelled variables, i.e., numeric and
character types are not included, because they do not require coding.
A codebook for the survey as a data frame, including the metadata, and all found SPSS-type valid or missing labels.
Other metadata functions:
crosswalk_table_create()
,
metadata_create()
,
metadata_survey_create()
Other metadata functions:
crosswalk_table_create()
,
metadata_create()
,
metadata_survey_create()
create_codebook ( survey = read_rds ( system.file("examples", "ZA7576.rds", package = "retroharmonize") ) ) examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) codebook_surveys_create (example_surveys)
create_codebook ( survey = read_rds ( system.file("examples", "ZA7576.rds", package = "retroharmonize") ) ) examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) codebook_surveys_create (example_surveys)
Harmonize surveys with crosswalk tables.
crosswalk_surveys( crosswalk_table, survey_list = NULL, survey_paths = NULL, import_path = NULL, na_values = NULL ) crosswalk(survey_list, crosswalk_table, na_values = NULL)
crosswalk_surveys( crosswalk_table, survey_list = NULL, survey_paths = NULL, import_path = NULL, na_values = NULL ) crosswalk(survey_list, crosswalk_table, na_values = NULL)
crosswalk_table |
A table created with |
survey_list |
A list of surveys imported with |
survey_paths |
A vector of full file paths to the surveys to subset. |
na_values |
A named vector of |
Harmonize a survey or a list of surveys with the help of a crosswalk table.
You can create the crosswalk table with crosswalk_table_create
, or manually
create a crosswalk table as a data frame including at least the following columns: id
for identifying a survey, var_name_orig
for the original variable name
and var_name_target
for the new (target) variable name. Optionally you can harmonize
the value labels, the numeric codes, and the special missing labels, too.
crosswalk
will return a data frame, and crosswalk_surveys
a list of
data frames, where the variable names, and optionally the variable labels, and the missing
value range is harmonized (the same names, labels, codes are used.)
Other harmonization functions:
collect_val_labels()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
,
label_normalize()
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) ## Compare with documentation: documented_surveys <- metadata_surveys_create(example_surveys) documented_surveys <- documented_surveys[ documented_surveys$var_name_orig %in% c( "rowid", "isocntry", "w1", "qd3_4", "qd3_8" , "qd7.4", "qd7.8", "qd6.4", "qd6.8"), ] crosswalk_table <- crosswalk_table_create ( metadata = documented_surveys )
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) ## Compare with documentation: documented_surveys <- metadata_surveys_create(example_surveys) documented_surveys <- documented_surveys[ documented_surveys$var_name_orig %in% c( "rowid", "isocntry", "w1", "qd3_4", "qd3_8" , "qd7.4", "qd7.8", "qd6.4", "qd6.8"), ] crosswalk_table <- crosswalk_table_create ( metadata = documented_surveys )
Create a crosswalk table with the source variable names and variable labels.
crosswalk_table_create(metadata) is.crosswalk_table(ctable)
crosswalk_table_create(metadata) is.crosswalk_table(ctable)
metadata |
A metadata table created by |
ctable |
A table to validate if it is a crosswalk table. |
The table contains a var_name_target
and val_label_target
column, but
these values need to be set by further manual or reproducible harmonization steps.
A tibble with raw crosswalk table. It contains all harmonization tasks, but the target values need to be set by further manipulations.
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
,
label_normalize()
Other metadata functions:
create_codebook()
,
metadata_create()
,
metadata_survey_create()
Document the current and historic coding and labelling of the variable.
document_survey_item(x)
document_survey_item(x)
x |
A labelled_spss_survey vector from a single survey or concatenated from several surveys. |
Returns a list of the current and historic coding, labelling of the valid range and missing values or range, the history of the variable names and the history of the survey IDs.
Other documentation functions:
document_surveys()
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_values = c(8,9)) var2 <- labelled::labelled_spss( x = c(2,2,8,9,1,1 ), labels = c("Tend to trust" = 1, "Tend not to trust" = 2, "DK" = 8, "Inap" = 9), na_values = c(8,9)) h1 <- harmonize_values ( x = var1, harmonize_label = "Do you trust the European Union?", harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey1", ) h2 <- harmonize_values ( x = var2, harmonize_label = "Do you trust the European Union?", harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey2" ) h3 <- concatenate(h1, h2) document_survey_item(h3)
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_values = c(8,9)) var2 <- labelled::labelled_spss( x = c(2,2,8,9,1,1 ), labels = c("Tend to trust" = 1, "Tend not to trust" = 2, "DK" = 8, "Inap" = 9), na_values = c(8,9)) h1 <- harmonize_values ( x = var1, harmonize_label = "Do you trust the European Union?", harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey1", ) h2 <- harmonize_values ( x = var2, harmonize_label = "Do you trust the European Union?", harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey2" ) h3 <- concatenate(h1, h2) document_survey_item(h3)
Document the key attributes surveys in a survey list.
document_surveys(survey_list = NULL, survey_paths = NULL, .f = NULL) document_waves(waves)
document_surveys(survey_list = NULL, survey_paths = NULL, .f = NULL) document_waves(waves)
survey_list |
A list of |
survey_paths |
A vector of full file paths to the surveys to subset, defaults to
|
.f |
A function to import the surveys with.
Defaults to |
waves |
A list of |
The function has two alternative input parameters. If survey_list
is the
input, it returns the name of the original source data file, the number of rows and
columns, and the size of the object as stored in memory. In case survey_paths
contains the source data files, it will sequentially read those files, and add the file
size, the last access and the last modified time attributes.
The earlier form document_waves
is deprecated.
Currently called document_surveys
.
Returns a data frame with the key attributes of the surveys in a survey list: the name of the data file, the number of rows and columns, and the size of the object as stored in memory.
Other documentation functions:
document_survey_item()
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys(file.path(examples_dir, my_rds_files)) documented <- document_surveys(example_surveys) attr(documented, "original_list") documented document_surveys(survey_paths = file.path(examples_dir, my_rds_files))
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys(file.path(examples_dir, my_rds_files)) documented <- document_surveys(example_surveys) attr(documented, "original_list") documented document_surveys(survey_paths = file.path(examples_dir, my_rds_files))
Harmonize na_values in haven_labelled_spss
harmonize_na_values(df)
harmonize_na_values(df)
df |
A data frame that contains haven_labelled_spss vectors. |
A tibble where the na_values are consistent
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
,
label_normalize()
examples_dir <- system.file( "examples", package = "retroharmonize" ) test_read <- read_rds ( file.path(examples_dir, "ZA7576.rds"), id = "ZA7576", doi = "test_doi") harmonize_na_values(test_read)
examples_dir <- system.file( "examples", package = "retroharmonize" ) test_read <- read_rds ( file.path(examples_dir, "ZA7576.rds"), id = "ZA7576", doi = "test_doi") harmonize_na_values(test_read)
Harmonize the value codes and value labels across multiple surveys.
harmonize_survey_values(survey_list, .f, status_message = FALSE) harmonize_waves(waves, .f, status_message = FALSE)
harmonize_survey_values(survey_list, .f, status_message = FALSE) harmonize_waves(waves, .f, status_message = FALSE)
survey_list |
A list of surveys. In the deprecated form the parameter was called
|
.f |
A function to apply for the harmonization. |
status_message |
Defaults to |
waves |
A list of surveys. Deprecated. |
The functions binds together variables
that are all present in the surveys, and applies a
harmonization function .f
on them. Till
retroharmonize 0.2.0 called harmonize_waves
.
The earlier form harmonize_waves
is deprecated.
The function is currently called harmonize_waves
.
A natural full join of all surveys in a single data frame.
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_values()
,
harmonize_var_names()
,
label_normalize()
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) metadata <- lapply ( X = example_surveys, FUN = metadata_create ) metadata <- do.call(rbind, metadata) require(dplyr) to_harmonize <- metadata %>% filter ( var_name_orig %in% c("rowid", "w1") | grepl("^trust", var_label_orig ) ) %>% mutate ( var_label = var_label_normalize(var_label_orig) ) %>% mutate ( var_name_target = val_label_normalize(var_label_orig) ) %>% mutate ( var_name_target = ifelse(.data$var_name_orig %in% c("rowid", "w1", "wex"), .data$var_name_orig, .data$var_name_target) ) harmonize_eb_trust <- function(x) { label_list <- list( from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely", "^dk", "^inap", "na"), to = c("not_trust", "not_trust", "trust", "trust", "do_not_know", "inap", "inap"), numeric_values = c(0,0,1,1, 99997,99999,99999) ) harmonize_survey_values(x, harmonize_labels = label_list, na_values = c("do_not_know"=99997, "declined"=99998, "inap"=99999) ) } merged_surveys <- merge_surveys ( example_surveys, var_harmonization = to_harmonize ) harmonized <- harmonize_survey_values(survey_list = merged_surveys, .f = harmonize_eb_trust, status_message = FALSE) # For details see Afrobarometer and Eurobarometer Case Study vignettes.
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) metadata <- lapply ( X = example_surveys, FUN = metadata_create ) metadata <- do.call(rbind, metadata) require(dplyr) to_harmonize <- metadata %>% filter ( var_name_orig %in% c("rowid", "w1") | grepl("^trust", var_label_orig ) ) %>% mutate ( var_label = var_label_normalize(var_label_orig) ) %>% mutate ( var_name_target = val_label_normalize(var_label_orig) ) %>% mutate ( var_name_target = ifelse(.data$var_name_orig %in% c("rowid", "w1", "wex"), .data$var_name_orig, .data$var_name_target) ) harmonize_eb_trust <- function(x) { label_list <- list( from = c("^tend\\snot", "^cannot", "^tend\\sto", "^can\\srely", "^dk", "^inap", "na"), to = c("not_trust", "not_trust", "trust", "trust", "do_not_know", "inap", "inap"), numeric_values = c(0,0,1,1, 99997,99999,99999) ) harmonize_survey_values(x, harmonize_labels = label_list, na_values = c("do_not_know"=99997, "declined"=99998, "inap"=99999) ) } merged_surveys <- merge_surveys ( example_surveys, var_harmonization = to_harmonize ) harmonized <- harmonize_survey_values(survey_list = merged_surveys, .f = harmonize_eb_trust, status_message = FALSE) # For details see Afrobarometer and Eurobarometer Case Study vignettes.
Similar to subset_surveys
, but will not only remove the
variables that cannot be harmonized, but also renames the variables that are kept.
harmonize_survey_variables( crosswalk_table, subset_name = "subset", survey_list = NULL, survey_paths = NULL, import_path = NULL, export_path = NULL )
harmonize_survey_variables( crosswalk_table, subset_name = "subset", survey_list = NULL, survey_paths = NULL, import_path = NULL, export_path = NULL )
crosswalk_table |
A crosswalk table created by |
subset_name |
An identifier for the survey subset. |
survey_list |
A list of surveys imported with |
survey_paths |
A vector of full file paths to the surveys to subset. |
A list of surveys or save individual rds files on the export_path
.
{ examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), export_path = NULL) documented_surveys <- metadata_create(example_surveys) documented_surveys <- documented_surveys[ documented_surveys$var_name_orig %in% c( "rowid", "isocntry", "w1", "qd3_4", "qd3_8" , "qd7.4", "qd7.8", "qd6.4", "qd6.8"), ] crosswalk_table <- crosswalk_table_create ( metadata = documented_surveys ) freedom_table <- crosswalk_table[ which(crosswalk_table$var_name_target %in% c("rowid", "freedom")), ] harmonize_survey_variables(crosswalk_table = freedom_table, subset_name = 'freedom', survey_list = example_surveys ) }
{ examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), export_path = NULL) documented_surveys <- metadata_create(example_surveys) documented_surveys <- documented_surveys[ documented_surveys$var_name_orig %in% c( "rowid", "isocntry", "w1", "qd3_4", "qd3_8" , "qd7.4", "qd7.8", "qd6.4", "qd6.8"), ] crosswalk_table <- crosswalk_table_create ( metadata = documented_surveys ) freedom_table <- crosswalk_table[ which(crosswalk_table$var_name_target %in% c("rowid", "freedom")), ] harmonize_survey_variables(crosswalk_table = freedom_table, subset_name = 'freedom', survey_list = example_surveys ) }
Create a labelled vector with harmonized numeric coding and value labels.
harmonize_values( x, harmonize_label = NULL, harmonize_labels = NULL, na_values = c(do_not_know = 99997, declined = 99998, inap = 99999), na_range = NULL, id = "survey_id", name_orig = NULL, remove = NULL, perl = FALSE )
harmonize_values( x, harmonize_label = NULL, harmonize_labels = NULL, na_values = c(do_not_know = 99997, declined = 99998, inap = 99999), na_range = NULL, id = "survey_id", name_orig = NULL, remove = NULL, perl = FALSE )
x |
A labelled vector |
harmonize_label |
A character vector of 1L containing the new,
harmonize variable label. Defaults to |
harmonize_labels |
A list of harmonization values |
na_values |
A named vector of |
na_range |
A min, max range of |
id |
A survey ID, defaults to |
name_orig |
The original name of the variable. If left |
remove |
Defaults to |
perl |
Use perl-like regex? Defaults to FALSE. |
Create a labelled vector that contains in its metadata attributes the original labelling, the original numeric coding and the current labelling, with the numerical values representing the harmonized coding.
A labelled vector that contains in its metadata attributes the original labelling, the original numeric coding and the current labelling, with the numerical values representing the harmonized coding.
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_var_names()
,
label_normalize()
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_var_names()
,
label_normalize()
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_values = c(8,9)) harmonize_values ( var1, harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey_id" )
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_values = c(8,9)) harmonize_values ( var1, harmonize_labels = list ( from = c("^tend\\sto|^trust", "^tend\\snot|not\\strust", "^dk|^don", "^inap"), to = c("trust", "not_trust", "do_not_know", "inap"), numeric_values = c(1,0,99997, 99999)), na_values = c("do_not_know" = 99997, "inap" = 99999), id = "survey_id" )
The function harmonizes the variable names of surveys (of class survey
) that
are imported from an external file as a wave.
harmonize_var_names( survey_list, metadata, old = "var_name_orig", new = "var_name_suggested", rowids = TRUE )
harmonize_var_names( survey_list, metadata, old = "var_name_orig", new = "var_name_suggested", rowids = TRUE )
survey_list |
A list of surveys imported with |
metadata |
A metadata table created by |
old |
The column name in |
new |
The column name in |
rowids |
Rename var labels of original vars |
If the metadata
that contains subsetting information is subsetted, then
it will subset the surveys in
survey_list
.
The list of surveys with harmonized variable names.
crosswalk
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
label_normalize()
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list) ) metadata <- metadata_create(example_surveys) metadata$var_name_suggested <- label_normalize(metadata$var_name) metadata$var_name_suggested[metadata$label_orig == "age_education"] <- "age_education" harmonize_var_names(survey_list = example_surveys, metadata = metadata )
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list) ) metadata <- metadata_create(example_surveys) metadata$var_name_suggested <- label_normalize(metadata$var_name) metadata$var_name_suggested[metadata$label_orig == "age_education"] <- "age_education" harmonize_var_names(survey_list = example_surveys, metadata = metadata )
label_normalize
removes special characters, whitespace,
and other typical typing errors.
label_normalize(x) var_label_normalize(x) val_label_normalize(x)
label_normalize(x) var_label_normalize(x) val_label_normalize(x)
x |
A character vector of labels to be normalized. |
var_label_normalize
and val_label_normalize
removes possible
chunks from question identifiers.
The functions var_label_normalize
and
val_label_normalize
may
be differently implemented for various survey series.
Returns a suggested, normalized label without special characters. The
var_label_normalize
and val_label_normalize
returns them in
snake_case
for programmatic use.
Other variable label harmonization functions:
na_range_to_values()
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
Other harmonization functions:
collect_val_labels()
,
crosswalk_surveys()
,
crosswalk_table_create()
,
harmonize_na_values()
,
harmonize_survey_values()
,
harmonize_values()
,
harmonize_var_names()
label_normalize ( c("Don't know", " TRUST", "DO NOT TRUST", "inap in Q.3", "Not 100%", "TRUST < 50%", "TRUST >=90%", "Verify & Check", "TRUST 99%+")) var_label_normalize ( c("Q1_Do you trust the national government?", " Do you trust the European Commission") ) val_label_normalize ( c("Q1_Do you trust the national government?", " Do you trust the European Commission") )
label_normalize ( c("Don't know", " TRUST", "DO NOT TRUST", "inap in Q.3", "Not 100%", "TRUST < 50%", "TRUST >=90%", "Verify & Check", "TRUST 99%+")) var_label_normalize ( c("Q1_Do you trust the national government?", " Do you trust the European Commission") ) val_label_normalize ( c("Q1_Do you trust the national government?", " Do you trust the European Commission") )
This class is amending haven::labelled_spss
with a unique object
identifier id
to make later binding or joining
reproducible and well-documented.
labelled_spss_survey( x = double(), labels = NULL, na_values = NULL, na_range = NULL, label = NULL, id = NULL, name_orig = NULL ) as_character(x) is.labelled_spss_survey(x) as_numeric(x)
labelled_spss_survey( x = double(), labels = NULL, na_values = NULL, na_range = NULL, label = NULL, id = NULL, name_orig = NULL ) as_character(x) is.labelled_spss_survey(x) as_numeric(x)
x |
A vector to label. Must be either numeric (integer or double) or character. |
labels |
A named vector or |
na_values |
A vector of values that should also be considered as missing. |
na_range |
A numeric vector of length two giving the (inclusive) extents
of the range. Use |
label |
A short, human-readable description of the vector. |
id |
Survey ID |
name_orig |
The original name of the variable. If left |
It inherits many methods from labelled, but uses more strict coercion and validation rules.
as_factor
Other type conversion functions:
as_labelled_spss_survey()
Other type conversion functions:
as_labelled_spss_survey()
x1 <- labelled_spss_survey( 1:10, c(Good = 1, Bad = 8), na_values = c(9, 10), id = "survey1") is.na(x1) # Print data and metadata print(x1) x2 <- labelled_spss_survey( 1:10, labels = c(Good = 1, Bad = 8), na_range = c(9, Inf), label = "Quality rating", id = "survey1") is.na(x2) # Print data and metadata x2
x1 <- labelled_spss_survey( 1:10, c(Good = 1, Bad = 8), na_values = c(9, 10), id = "survey1") is.na(x1) # Print data and metadata print(x1) x2 <- labelled_spss_survey( 1:10, labels = c(Good = 1, Bad = 8), na_range = c(9, Inf), label = "Quality rating", id = "survey1") is.na(x2) # Print data and metadata x2
Merge a list of surveys into a list with harmonized variable names, variable labels and survey identifiers.
merge_surveys(survey_list, var_harmonization) merge_waves(waves, var_harmonization)
merge_surveys(survey_list, var_harmonization) merge_waves(waves, var_harmonization)
survey_list |
A list of surveys |
var_harmonization |
Metadata of surveys, including at least
|
waves |
Deprecated. |
The function was called till version 0.2.0 merge_waves()
,
which reflects the vocabulary of Eurobarometer surveys.
A list of surveys with harmonized names and variable labels.
survey
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) metadata <- metadata_surveys_create(example_surveys) require(dplyr) to_harmonize <- metadata %>% filter ( var_name_orig %in% c("rowid", "w1") | grepl("^trust", label_orig ) ) %>% mutate ( var_label = var_label_normalize(label_orig) ) %>% mutate ( var_name_target = val_label_normalize(var_label) ) %>% mutate ( var_name_target = ifelse(.data$var_name_orig %in% c("rowid", "w1", "wex"), .data$var_name_orig, .data$var_name_target) ) merge_surveys ( example_surveys, to_harmonize )
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list), save_to_rds = FALSE) metadata <- metadata_surveys_create(example_surveys) require(dplyr) to_harmonize <- metadata %>% filter ( var_name_orig %in% c("rowid", "w1") | grepl("^trust", label_orig ) ) %>% mutate ( var_label = var_label_normalize(label_orig) ) %>% mutate ( var_name_target = val_label_normalize(var_label) ) %>% mutate ( var_name_target = ifelse(.data$var_name_orig %in% c("rowid", "w1", "wex"), .data$var_name_orig, .data$var_name_target) ) merge_surveys ( example_surveys, to_harmonize )
Create a metadata table from several surveys
metadata_create(survey_list = NULL, survey_paths = NULL, .f = NULL) metadata_waves_create(survey_list)
metadata_create(survey_list = NULL, survey_paths = NULL, .f = NULL) metadata_waves_create(survey_list)
inheritParams |
read_surveys |
The form metadata_waves_create
is deprecated.
Other metadata functions:
create_codebook()
,
crosswalk_table_create()
,
metadata_survey_create()
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys(file.path(examples_dir, my_rds_files)) metadata_create (example_surveys)
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys(file.path(examples_dir, my_rds_files)) metadata_create (example_surveys)
Create a metadata table from the survey data files.
metadata_survey_create(survey)
metadata_survey_create(survey)
survey |
A survey data frame. You receive a survey object with any importing function, i.e.
|
A data frame like tibble object is returned.
In case you are working with several surveys, a list of surveys or a vector
of file names containing the full path to the survey must be called with
metadata_create
, which is a wrapper around
a list of metadata_survey_create
calls.
The structure of the returned tibble:
The original file name; if present; missing
, if a non-survey
data frame is used as input survey
.
The ID of the survey, if present; missing
, if a non-survey
data frame is used as input survey
.
The original variable name in SPSS.
The original variable class after importing withread_spss
.
The original variable label in SPSS.
A list of the value labels.
A list of the value labels that are not marked as missing values.
A list of the value labels that refer to user-defined missing values.
An optional range of a continuous missing range, if present in the vector.
Number of categories or unique levels, which may be different from the sum of missing and category labels.
Number of categories in the non-missing range.
Number of categories of the variable, should be the sum of the former two.
A list of the user-defined missing values.
A nested data frame with metadata and the range of labels, na_values and the na_range itself.
Other metadata functions:
create_codebook()
,
crosswalk_table_create()
,
metadata_create()
metadata_create ( survey_list = read_rds ( system.file("examples", "ZA7576.rds", package = "retroharmonize") ) )
metadata_create ( survey_list = read_rds ( system.file("examples", "ZA7576.rds", package = "retroharmonize") ) )
Harmonize the na_values
attribute with
na_range
, if the latter is present.
na_range_to_values(x) is.na_range_to_values(x)
na_range_to_values(x) is.na_range_to_values(x)
x |
A labelled_spss or labelled_spss_survey vector |
na_range_to_values()
tests if the function needs to be
called for na_values
harmonization. The na_range
is often missing and less likely to cause logical problems
when joining survey answers.
A x
with harmonized na_values
and
na_range
attributes.
If min(na_values)
or max(na_values)
than the left- and
right-hand value of na_range
, it gives a warning and adjusts
the original na_range
.
Other variable label harmonization functions:
label_normalize()
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_range = c(8,12)) na_range_to_values(var1) as_numeric(na_range_to_values(var1)) as_character(na_range_to_values(var1))
var1 <- labelled::labelled_spss( x = c(1,0,1,1,0,8,9), labels = c("TRUST" = 1, "NOT TRUST" = 0, "DON'T KNOW" = 8, "INAP. HERE" = 9), na_range = c(8,12)) na_range_to_values(var1) as_numeric(na_range_to_values(var1)) as_character(na_range_to_values(var1))
Pull a survey by survey code or id.
pull_survey(survey_list, id = NULL, filename = NULL)
pull_survey(survey_list, id = NULL, filename = NULL)
survey_list |
A list of surveys |
id |
The id of the requested survey. If |
filename |
The filename of the requested survey. |
A single survey identified by id
or filename
.
Other import functions:
read_csv()
,
read_dta()
,
read_rds()
,
read_spss()
,
read_surveys()
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys( file.path(examples_dir, my_rds_files) ) pull_survey(example_surveys, id = "ZA5913")
examples_dir <- system.file( "examples", package = "retroharmonize") my_rds_files <- dir( examples_dir)[grepl(".rds", dir(examples_dir))] example_surveys <- read_surveys( file.path(examples_dir, my_rds_files) ) pull_survey(example_surveys, id = "ZA5913")
Import a survey from a csv file.
read_csv( file, id = NULL, doi = NULL, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), na.strings = "NA", skip = 0, check.names = TRUE, strip.white = FALSE, blank.lines.skip = TRUE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown" )
read_csv( file, id = NULL, doi = NULL, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), na.strings = "NA", skip = 0, check.names = TRUE, strip.white = FALSE, blank.lines.skip = TRUE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown" )
file |
A path to a file to import. |
id |
An identifier of the tibble, if omitted, defaults to the file name without its extension. |
doi |
An optional document object identifier. |
A tibble, data frame variant with survey attributes.
Other import functions:
pull_survey()
,
read_dta()
,
read_rds()
,
read_spss()
,
read_surveys()
path <- system.file("examples", "ZA7576.rds", package = "retroharmonize") read_survey <- read_rds(path) attr(read_survey, "id") attr(read_survey, "filename") attr(read_survey, "doi")
path <- system.file("examples", "ZA7576.rds", package = "retroharmonize") read_survey <- read_rds(path) attr(read_survey, "id") attr(read_survey, "filename") attr(read_survey, "doi")
This is a wrapper around haven::read_dta
with some exception handling.
read_dta(file, id = NULL, doi = NULL, .name_repair = "unique")
read_dta(file, id = NULL, doi = NULL, .name_repair = "unique")
file |
A STATA file. |
id |
An identifier of the tibble, if omitted, defaults to the file name without its extension. |
doi |
An optional document object identifier. |
.name_repair |
Defaults to |
'read_dta()' reads both '.dta' files.
The funcion is not yet tested.
A tibble.
Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it.
'write_sav()' returns the input 'data' invisibly.
Other import functions:
pull_survey()
,
read_csv()
,
read_rds()
,
read_spss()
,
read_surveys()
path <- system.file("examples", "iris.dta", package = "haven") read_dta(path)
path <- system.file("examples", "iris.dta", package = "haven") read_dta(path)
Import a survey from an rds file.
read_rds(file, id = NULL, doi = NULL)
read_rds(file, id = NULL, doi = NULL)
file |
A path to a file to import. |
id |
An identifier of the tibble, if omitted, defaults to the file name without its extension. |
doi |
An optional document object identifier. |
A tibble, data frame variant with survey attributes.
Other import functions:
pull_survey()
,
read_csv()
,
read_dta()
,
read_spss()
,
read_surveys()
path <- system.file("examples", "ZA7576.rds", package = "retroharmonize") read_survey <- read_rds(path) attr(read_survey, "id") attr(read_survey, "filename") attr(read_survey, "doi")
path <- system.file("examples", "ZA7576.rds", package = "retroharmonize") read_survey <- read_rds(path) attr(read_survey, "id") attr(read_survey, "filename") attr(read_survey, "doi")
This is a wrapper around haven::read_spss
with some exception handling.
read_spss(file, user_na = TRUE, id = NULL, doi = NULL, .name_repair = "unique")
read_spss(file, user_na = TRUE, id = NULL, doi = NULL, .name_repair = "unique")
file |
An SPSS file. |
user_na |
Should user-defined na_values be imported? Defaults
to |
id |
An identifier of the tibble, if omitted, defaults to the file name without its extension. |
doi |
An optional document object identifier. |
.name_repair |
Defaults to |
'read_sav()' reads both '.sav' and '.zsav' files; 'write_sav()' creates '.zsav' files when 'compress = TRUE'. 'read_por()' reads '.por' files. 'read_spss()' uses either 'read_por()' or 'read_sav()' based on the file extension.
When the SPSS file has columns which are of class labelled, but have no labels, they are read as numeric or character vectors.
A tibble:
Variable labels are stored in the "label" attribute of each variable. It is not printed on the console, but the RStudio viewer will show it.
'write_sav()' returns the input 'data' invisibly.
Other import functions:
pull_survey()
,
read_csv()
,
read_dta()
,
read_rds()
,
read_surveys()
path <- system.file("examples", "iris.sav", package = "haven") haven::read_sav(path) tmp <- tempfile(fileext = ".sav") haven::write_sav(mtcars, tmp) haven::read_sav(tmp)
path <- system.file("examples", "iris.sav", package = "haven") haven::read_sav(path) tmp <- tempfile(fileext = ".sav") haven::write_sav(mtcars, tmp) haven::read_sav(tmp)
Import surveys into a list or several .rds
files.
read_surveys(survey_paths, .f = NULL, export_path = NULL) read_survey(file_path, .f = NULL, export_path = NULL)
read_surveys(survey_paths, .f = NULL, export_path = NULL) read_survey(file_path, .f = NULL, export_path = NULL)
survey_paths |
A vector of (full) file paths that contain the surveys to import. |
.f |
A function to import the surveys with.
Defaults to |
export_path |
Defaults to |
Use read_survey
for a single survey and read_surveys
for several surveys in
in a loop. The function handle exceptions with wrong file names and not readable
files. If a file cannot be read, a message is printed, and empty survey is added to the
the list in the place of this file.
A list of the surveys or a vector of the saved file names.
Each element of the list is a data
frame-like survey
type object where some metadata,
such as the original file name, doi identifier if present, and other
information is recorded for a reproducible workflow.
survey
Other import functions:
pull_survey()
,
read_csv()
,
read_dta()
,
read_rds()
,
read_spss()
file1 <- system.file( "examples", "ZA7576.rds", package = "retroharmonize") file2 <- system.file( "examples", "ZA5913.rds", package = "retroharmonize") read_surveys (c(file1,file2), .f = 'read_rds' )
file1 <- system.file( "examples", "ZA7576.rds", package = "retroharmonize") file2 <- system.file( "examples", "ZA5913.rds", package = "retroharmonize") read_surveys (c(file1,file2), .f = 'read_rds' )
The goal of retroharmonize
is to facilitate retrospective (ex-post)
harmonization of data, particularly survey data, in a reproducible manner.
The package provides tools for organizing the metadata, standardizing the
coding of variables, variable names and value labels, including missing
values, and for documenting all transformations, with the help of
comprehensive S3 classes.
Read data stored in formats with rich metadata, such as SPSS (.sav) files,
and make them usable in a programmatic context.read_spss
: read an SPSS file and record metadata for reproducibilityread_rds
: read an rds file and record metadata for reproducibilityread_surveys
: programmatically read a list of surveyspull_survey
: pull a single survey from a survey list.
subset_surveys
: remove variables from surveys that cannot be harmonized.
harmonize_survey_variables
: Create a list of surveys with harmonized variable names.
codebook_create
: A not yet working function.
codelist_create
: A not yet working function.
Create consistent coding and labelling.harmonize_values
: Harmonize the label list across surveys.harmonize_survey_values
: Create a list of surveys with harmonized value labels.na_range_to_values
: Make the na_range
attributes,
as imported from SPSS, consistent with the na_values
attributes.label_normalize
removes special characters, whitespace,
and other typical typing errors and helps the uniformization of labels
and variable names.
merge_surveys
: Create a list of surveys with harmonized names and variable labels.crosswalk_surveys
: Create a list of surveys with harmonized variable names, harmonized
value labels and harmonize R classes.crosswalk
: Create a joined data frame of surveys with harmonized variable names, harmonized
value labels and harmonize R classes.
metadata_create
: Createa metadata dataa from one or more survey
.metadata_survey_create
: Create a joined metadata data frame from one survey.create_codebook
and codebook_waves_create
crosswalk_table_create
: Create an initial crosswalk table from a metadata data frame.
Make the workflow reproducible by recording the harmonization process.
document_survey_item
: Returns a list of the current and historic coding,
labelling of the valid range and missing values or range, the history of the variable names
and the history of the survey IDs.
document_surveys
: Document the key attributes surveys in a survey list.
Consistently treat labels and SPSS-style user-defined missing
values in the R language.
survey
helps constructing a valid survey data frame, and
labelled_spss_survey
helps creating a vector for a
questionnaire item.
as_numeric
: convert to numeric values.as_factor
: convert to labels to factor levels.as_character
: convert to labels to characters.as_labelled_spss_survey
: convert labelled and labelled_spss
vectors to labelled_spss_survey vectors.
This is a wrapper function for various procedures to reduce the size of surveys by removing variables that are not harmonized.
subset_surveys( survey_list, survey_paths = NULL, rowid = "rowid", subset_name = "subset", subset_vars = NULL, crosswalk_table = NULL, import_path = NULL, export_path = NULL ) subset_waves(waves, subset_vars = NULL) subset_save_surveys( crosswalk_table, subset_name = "subset", survey_list = NULL, subset_vars = NULL, survey_paths = NULL, import_path = NULL, export_path = NULL )
subset_surveys( survey_list, survey_paths = NULL, rowid = "rowid", subset_name = "subset", subset_vars = NULL, crosswalk_table = NULL, import_path = NULL, export_path = NULL ) subset_waves(waves, subset_vars = NULL) subset_save_surveys( crosswalk_table, subset_name = "subset", survey_list = NULL, subset_vars = NULL, survey_paths = NULL, import_path = NULL, export_path = NULL )
survey_list |
A list of surveys imported with |
survey_paths |
A vector of full file paths to the surveys to subset. |
rowid |
The unique row (observation) identifier in the files. Defaults to
|
subset_name |
An identifier for the survey subset. |
subset_vars |
The names of the variables that should be kept from all surveys in the list that contains the
wave of surveys. Defaults to |
crosswalk_table |
A crosswalk table created by |
waves |
A list of surveys imported with |
This function allows several workflows.
Subsetting can be based on a vector of variable names
given by survey_path
, or on the basis of a crosstable
.
The subset_save_surveys
can be called directly.
subset_surveys
will also harmonize the variable names if the var_name_target
is
optionally defined in the crosswalk_table
input.
harmonize_survey_variables
is a wrapper and will require that the new (target) variable names are
present in a valid crosstable
.
A list of surveys or save individual rds files on the export_path
.
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list) ) subset_surveys(survey_list = example_surveys, subset_vars = c("rowid", "isocntry", "qa10_1", "qa14_1"), subset_name = "subset_example")
examples_dir <- system.file("examples", package = "retroharmonize") survey_list <- dir(examples_dir)[grepl("\\.rds", dir(examples_dir))] example_surveys <- read_surveys( file.path( examples_dir, survey_list) ) subset_surveys(survey_list = example_surveys, subset_vars = c("rowid", "isocntry", "qa10_1", "qa14_1"), subset_name = "subset_example")
Store the data of a survey in a tibble (data frame) with a unique survey identifier, import filename, and optional document object identifier.
survey( object = data.frame(), id = character(), filename = character(), doi = character() ) is.survey(object) ## S3 method for class 'survey' summary(object, ...)
survey( object = data.frame(), id = character(), filename = character(), doi = character() ) is.survey(object) ## S3 method for class 'survey' summary(object, ...)
object |
A tibble or data frame that contains the survey data. |
id |
A mandatory identifier for the survey. |
filename |
The import file name. |
doi |
Optional document object identifier (doi), can be omitted. |
... |
Arguments passed to summary method. |
Whilst you can create a survey object with this helper function, it is most likely that
you will receive it with an importing function, i.e.
read_rds
, read_spss
read_dta
, read_csv
or
their common wrapper read_survey
.
A tibble with id
, filename
, doi
metadata information.
example_survey <- survey( object =data.frame ( rowid = 1:6, observations = runif(6)), id = 'example', filename = "no_file" )
example_survey <- survey( object =data.frame ( rowid = 1:6, observations = runif(6)), id = 'example', filename = "no_file" )