Large international survey programs that provide data access to their data, such as the aforementioned Eurobarometer, Afrobarometer and Arab Barometer, or the European Values Survey, or Lationbarometro usually give access to the data in SPSS files. In some cases, Stata files are available, too. Our examples always start with SPSS files.
In R, the haven package provides functions to import and export data to and from IBM’s proprietary SPSS files. We had to realize that while haven::read_spss works perfectly with importing the data and metadata of a single survey, it is not always suitable for multiple survey files, for two important reasons. SPSS files contain the survey data in a coded form, with the coding (labelling) metadata optionally included to each variable.
Variables imported with inconsistent labelling cannot always be concatenated. For example, unlabelled age variables are imported to an R numeric vector, whilst when the age 18 is labelled 18 years or 18 éves, it is imported to a labelled class.
The SPSS variables do not handle various missing cases in a complete and unambiguous form. In an age variable, 998 and 999 may be labelled as not asked and declined to answer, or simply the numerical range between 120-999 may be marked as a range of numeric values representing missing cases.
One practical problem in the Eurobarometer surveys, which targets the population of at least 15 years old Europeans is that in some standard questions, such as the age of finishing full-time education, 10 represents a special missing value, whilst in other surveys it may be a perfectly valid numerical answer. The real coding problem is that SPSS users can freely chose to use explicit labelling of missing cases, using a numerical range for missing cases, or not providing any missing case metadata at all.
Our importing functions rely on two new S3 classes. A single survey
is imported into a survey()
class, which inherits all the
properties of a modern data frame, i.e., it is a tibble
or
tbl_df
from tibble in the tidyverse, but it
includes as much metadata from the original SPSS file as possible. These
metadata attributes are handled in a way that they can facilitate proper
documentation and a reproducible workflow. Furthermore, it converts
labelled variables into the retroharmonize_labelled_spss class
inherited from haven::labelled_spss
with more consistent handling of missing value ranges and labels. See
?labelled_spss_survey
.
Working with our new retroharmonize_labelled_spss class can be very cumbersome, particularly with simple harmonization tasks. In the case of harmonizing a single question from two surveys, this may not be practical, and a simple crosswalk table can help with spotting and correcting inconsistent codes.
Our survey objects or retroharmonize_labelled_spss vectors can be
converted to base R classes with the as_numeric()
,
as_factor(
) or as_character()
methods. When
computing a numerical average, the special age value of 10 is converted
to NA_real
_ as a numeric. In other statistical
applications, missing and special values are best represented as
categories—this calls for the factor representation. The character
representation is often more useful for visualizing the data then the
factor representation.
Use the labelled_spss_survey()
helper function to create
vectors of class retroharmonize_labelled_spss_survey.
sl1 <- labelled_spss_survey (
x = c(1,1,0,8,8,8),
labels = c("yes" =1,
"no" = 0,
"declined" = 8),
label = "Do you agree?",
na_values = 8,
id = "survey1")
print(sl1)
#> [1] 1 1 0 8 8 8
#> attr(,"labels")
#> yes no declined
#> 1 0 8
#> attr(,"label")
#> [1] "Do you agree?"
#> attr(,"na_values")
#> [1] 8
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"
#> [3] "haven_labelled"
#> attr(,"survey1_name")
#> [1] "c(1, 1, 0, 8, 8, 8)"
#> attr(,"survey1_values")
#> 0 1 8
#> 0 1 8
#> attr(,"survey1_label")
#> [1] "Do you agree?"
#> attr(,"survey1_labels")
#> yes no declined
#> 1 0 8
#> attr(,"survey1_na_values")
#> [1] 8
#> attr(,"id")
#> [1] "survey1"
You can check the type:
The labelled_spss_survey()
class inherits some
properties from haven::labelled()
, which can be manipulated
by the labelled
package (See particularly the vignette
Introduction to labelled by Joseph Larmarange.)
It can also be subsetted:
sl1[3:4]
#> [1] 0 8
#> attr(,"labels")
#> yes no declined
#> 1 0 8
#> attr(,"label")
#> [1] "Do you agree?"
#> attr(,"na_values")
#> [1] 8
#> attr(,"class")
#> [1] "retroharmonize_labelled_spss_survey" "haven_labelled_spss"
#> [3] "haven_labelled"
#> attr(,"survey1_name")
#> [1] "c(1, 1, 0, 8, 8, 8)"
#> attr(,"survey1_values")
#> 0 1 8
#> 0 1 8
#> attr(,"survey1_label")
#> [1] "Do you agree?"
#> attr(,"survey1_labels")
#> yes no declined
#> 1 0 8
#> attr(,"survey1_na_values")
#> [1] 8
#> attr(,"id")
#> [1] "survey1"
When used within the modernized version of data.frame,
tibble::tibble()
, the summary of the variable content
prints in an informative way.
df <- tibble::tibble (v1 = sl1)
## Use tibble instead of data.frame(v1=sl1) ...
print(df)
#> # A tibble: 6 × 1
#> v1
#> <retroh_dbl>
#> 1 1 [yes]
#> 2 1 [yes]
#> 3 0 [no]
#> 4 8 (NA) [declined]
#> 5 8 (NA) [declined]
#> 6 8 (NA) [declined]
## ... which inherits the methods of a data.frame
subset(df, v1 == 1)
#> # A tibble: 2 × 1
#> v1
#> <retroh_dbl>
#> 1 1 [yes]
#> 2 1 [yes]
To avoid any confusion with mis-labelled surveys, coercion with
double or integer vectors will result in a double or integer vector. The
use of vctrs::vec_c
is generally safer than base R
c()
.
#double
c(sl1, 1/7)
#> [1] 1.0000000 1.0000000 0.0000000 8.0000000 8.0000000 8.0000000 0.1428571
vctrs::vec_c(sl1, 1/7)
#> [1] 1.0000000 1.0000000 0.0000000 8.0000000 8.0000000 8.0000000 0.1428571
Conversion to character works as expected:
The base as.factor
converts to integer and uses the
integers as levels, because base R factors are integers with a
levels
attribute.
Conversion to factor with as_factor
converts the value
labels to factor levels:
Similarly, when converting to numeric types, we have to convert the
user-defined missing values to NA
values used in the R
language. For numerical analysis, convert with
as_numeric
.
The median value is correctly displayed, because user-defined missing values are removed from the calculation. Only a few arithmetic methods are implemented, such as
weights1 <- runif (n = 6, min = 0, max = 1)
weighted.mean(as.numeric(sl1), weights1)
#> [1] 5.057185
weighted.mean(sl1, weights1)
#> [1] 0.4838951
The result of the conversion to numeric can be used for other mathematical / statistical function.