Survey data harmonization refers to procedures that improve the data comparability or the inferential capacity of multiple surveys. Ex ante survey harmonization refers to planning and design steps to make sure that not yet answered questionnaires can be better compared, or data derived from them joined, integrated. Such procedures include the harmonization of the questionnaire, the harmonization of the sample design, and other aspects of carrying out multiple surveys. Ex post or retrospective harmonization refers to procedures to data that has been derived from surveys—i.e., survey that have been carried out.
Naturally, better ex ante harmonization makes eventual data integration or data comparison easier; yet often we can still harmonize retrospectively survey data that has not been carefully pre-harmonized before respondents have answered the questionnaire items.
Our aim with the retroharmonize
R package is to provide
assistance to a reproducible research workflow in carrying out important
computational aspects of retrospective survey harmonization.
Let’s start with a very simple example.
library(labelled)
survey_1 <- data.frame(
sex = labelled(c(1,1,0, NA_real_), c(Male = 1, Female = 0))
)
attr(survey_1, "id") <- "Survey 1"
survey_2 <- data.frame(
gender = labelled(c(1,3,9,1,2), c(male = 1, female = 2, other = 3, declined = 9))
)
attr(survey_2, "id") <- "Survey 2"
library(dplyr, quietly = TRUE)
survey_1 %>%
mutate ( sex_numeric = as_numeric(.data$sex),
sex_factor = as_factor(.data$sex))
#> sex sex_numeric sex_factor
#> 1 1 1 Male
#> 2 1 1 Male
#> 3 0 0 Female
#> 4 NA NA <NA>
The ordering of the survey harmonization workflow is flexible, and it is likely that even the same researcher would choose a different workflow in the case of smaller, simpler harmonization tasks and more complex harmonization tasks.
The data science aspect of a successful survey harmonization task is the creation of a consistent data frame that contains harmonized information from multiple surveys. It practically means that questionnaire items are mapped into variables with a consistent numerical coding, descriptive metadata (variable and value labels) and a consistent handling of missing and special values. This may be very laborous task when surveys are conducted in different years, saved in different file formats with a different metadata structure, missing and special values are handled differently, and the metadata contains potentially different natural language descriptions or spelling.
Survey 1
labels the sex of respondents as
Male
and Female
, and has cases that are
neither Male
or Female
, but we do not know
why.
survey_2 %>%
mutate ( gender_numeric = as_numeric(.data$gender),
gender_factor = as_factor(.data$gender))
#> gender gender_numeric gender_factor
#> 1 1 1 male
#> 2 3 3 other
#> 3 9 9 declined
#> 4 1 1 male
#> 5 2 2 female
Survey 2
records gender, which contains the same
information as sex in Survey 1
(Male
and
Female
), but allows people to identify as
Other
, and labels cases when people decline to identify
with any of these three categories.
In practice, you want to end up with the following joined representation of your survey:
survey_joined <- data.frame(
id = c(1,2,3,4,1,2,3,4,5),
survey = c(rep(1,4), rep(2, 5)),
gender = labelled(c(1,1,0,9, 1,3,9,1,0), c(male = 1, female = 0, other = 3, declined = 9))
)
survey_joined %>%
mutate ( id = paste0("survey_", .data$survey, "_", .data$id),
gender_numeric = c(1,1,0,NA_real_, 1,3,NA_real_,1,0),
gender_factor = as_factor(.data$gender),
is_female = ifelse (.data$gender_numeric == 0, 1, 0))
#> id survey gender gender_numeric gender_factor is_female
#> 1 survey_1_1 1 1 1 male 0
#> 2 survey_1_2 1 1 1 male 0
#> 3 survey_1_3 1 0 0 female 1
#> 4 survey_1_4 1 9 NA declined NA
#> 5 survey_2_1 2 1 1 male 0
#> 6 survey_2_2 2 3 3 other 0
#> 7 survey_2_3 2 9 NA declined NA
#> 8 survey_2_4 2 1 1 male 0
#> 9 survey_2_5 2 0 0 female 1
survey_1
with survey_2
, or, we want to concatenate
survey_1$sex
with survey_2$gender
.survey_1$sex
may come with a
variable label something like SEX OF RESPONDENT, and
survey_2$gender
may be labelled as GENDER
IDENTIFICATION. This label should be harmonized to Sex or
gender or the respondent.survey_2$gender
coded with a numeric 2 must be changed to a
numeric 0.survey_1$sex
Female respondents and
survey_2$gender
female respondents will be
consistently labelled as female.survey_1$sex
and survey_2$gender
can be technically concatenated, but
before harmonization this will create logical errors, because females
will be either coded with 0 or with 2. The as_numeric()
and
as_factor()
methods of our labelled_spss_survey
class handle consistency issues.data.frame()
. It contains various descriptive
metadata about the survey among attributes.The joining of the not harmonized datasets results in the following data frame.
library(dplyr)
survey_1 %>%
mutate ( survey = 1,
sex_numeric = as_numeric(.data$sex),
sex_factor = as_factor(.data$sex)) %>%
full_join(
survey_2 %>%
mutate ( survey = 2,
gender_numeric = as_numeric(.data$gender),
gender_factor = as_factor(.data$gender))
)
#> Joining with `by = join_by(survey)`
#> sex survey sex_numeric sex_factor gender gender_numeric gender_factor
#> 1 1 1 1 Male NA NA <NA>
#> 2 1 1 1 Male NA NA <NA>
#> 3 0 1 0 Female NA NA <NA>
#> 4 NA 1 NA <NA> NA NA <NA>
#> 5 NA 2 NA <NA> 1 1 male
#> 6 NA 2 NA <NA> 3 3 other
#> 7 NA 2 NA <NA> 9 9 declined
#> 8 NA 2 NA <NA> 1 1 male
#> 9 NA 2 NA <NA> 2 2 female
Performing only variable harmonization yields to a data frame that has the correct dimensions, but it is not usable for statistical analysis.
library(dplyr)
survey_var_harmonized <- survey_1 %>%
rename ( gender = .data$sex ) %>%
mutate ( survey = 1,
gender_numeric = as_numeric(.data$gender),
gender_factor = as_factor(.data$gender)) %>%
full_join(
survey_2 %>%
mutate ( survey = 2,
gender_numeric = as_numeric(.data$gender),
gender_factor = as_factor(.data$gender)),
by = c("gender", "survey", "gender_numeric", "gender_factor")
)
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"sex"` instead of `.data$sex`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `gender` and `gender` have conflicting value labels.
#> ℹ Labels for these values will be taken from `gender`.
#> ✖ Values: 1
Apart from the simple, descriptive variable of the survey identification, non of the descriptive statistics are meaningful.
summary(survey_var_harmonized)
#> gender survey gender_numeric gender_factor
#> Min. :0.00 Min. :1.000 Min. :0.00 Female :1
#> 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.00 Male :2
#> Median :1.00 Median :2.000 Median :1.00 male :2
#> Mean :2.25 Mean :1.556 Mean :2.25 female :1
#> 3rd Qu.:2.25 3rd Qu.:2.000 3rd Qu.:2.25 other :1
#> Max. :9.00 Max. :2.000 Max. :9.00 declined:1
#> NA's :1 NA's :1 NA's :1
The variable labels must be harmonized for a successful factor representation. The numerical coding must be harmonized, and the missing cases must be consistently handled to achieve any useful numerical representation.
survey_joined %>%
mutate ( id = paste0("survey_", .data$survey, "_", .data$id),
gender_numeric = c(1,1,0,NA_real_, 1,3,NA_real_,1,0),
gender_factor = as_factor(.data$gender),
female_ratio = ifelse (.data$gender_numeric == 0, 1, 0)) %>%
summary()
#> id survey gender gender_numeric
#> Length:9 Min. :1.000 Min. :0.000 Min. :0.0
#> Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.5
#> Mode :character Median :2.000 Median :1.000 Median :1.0
#> Mean :1.556 Mean :2.778 Mean :1.0
#> 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.0
#> Max. :2.000 Max. :9.000 Max. :3.0
#> NA's :2
#> gender_factor female_ratio
#> female :2 Min. :0.0000
#> male :4 1st Qu.:0.0000
#> other :1 Median :0.0000
#> declined:2 Mean :0.2857
#> 3rd Qu.:0.5000
#> Max. :1.0000
#> NA's :2
The data importing functions make sure that survey data and metadata are carefully translated to R data classes and variable types.
The metadata functions help the analysis, normalization and joining of the metadata aspects (variable and value labels, original variable names, unique identifiers) across surveys.
Harmonization functions help the harmonization of responses to questionnaire items, i.e. making sure that coded values, the labelling of values, and missing data are handled consistently across multiple surveys.
Our package was tested on multiple, international, harmonized surveys, particularly the Eurobarometer, the Afrobarometer and the Arab Barometer survey programs. Different users, and different task call for different workflows. We created a number of helper functions to assist various workflows.