--- title: "Case Study: Working With Afrobarometer surveys" output: rmarkdown::html_vignette resource_files: - vignettes/ab_plot1.png vignette: > %\VignetteIndexEntry{Case Study: Working With Afrobarometer surveys} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, include=FALSE} ## https://github.com/tidyverse/rvest/blob/master/vignettes/selectorgadget.Rmd requireNamespace("png", quietly = TRUE) embed_png <- function(path, dpi = NULL) { meta <- attr(png::readPNG(path, native = TRUE, info = TRUE), "info") if (!is.null(dpi)) meta$dpi <- rep(dpi, 2) knitr::asis_output(paste0( "" )) } ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(retroharmonize) library(dplyr) library(tidyr) load(file = system.file( "afrob", "afrob_vignette.rda", package = "retroharmonize")) ``` The goal of this case study is to explore the variation in trust in various state institutions among African societies, as well as changes in trust over time. To do this, we use data from [Afrobarometer](https://afrobarometer.org/data/), a cross-national survey project measuring attitudes and opinions about democracy, the economy and society in African countries based on general population samples. Currently, seven rounds of the Afrobarometer are available, covering the period between 1999 and 2019. `retroharmonize` is not affiliated with Afrobarometer. To fully reproduce this example, you must acquire the data files from them, which is free of charge. Afrobarometer data is protected by copyright. Authors of any published work based on Afrobarometer data or papers are required to acknowledge the source, including, where applicable, citations to data sets posted on this website. Please acknowledge the copyright holders in all publications resulting from its use by means of bibliographic citation in this form: In this vignette, we harmonize data from rounds [Afrobarometer Data Round 5 (all countries)](https://afrobarometer.org/data/merged-round-5-data-34-countries-2011-2013-last-update-july-2015), [Afrobarometer Data Round 6 (all countries)](https://afrobarometer.org/data/merged-round-6-data-36-countries-2016) and [Afrobarometer Data Round 7 (all countries)](https://afrobarometer.org/data/merged-round-7-data-34-countries-2019). Some elements of the vignette are not “live”, because we want to avoid re-publishing the original microdata files from Afrobarometer, but you can access the data directly from the [www.afrobarometer.org website](https://afrobarometer.org/). File names (to identify file versions): Round 5: `merged-round-5-data-34-countries-2011-2013-last-update-july-2015.sav` Round 6: `merged_r6_data_2016_36countries2.sav` Round 7: `r7_merged_data_34ctry.release.sav` For reproducibility, we are storing only a small subsample from the files and the metadata. We assume that you extracted and copied all `.sav` files into a single folder that we will call in this vignette the `afrobarometer_dir`. Define your `afrobarometer_dir` with `file.path()` in your own system. ## Importing Afrobarometer Files We start by reading in the three rounds of the Afrobarometer. ```{r setup, eval=FALSE} library(retroharmonize) library(dplyr) library(tidyr) ``` ```{r import, eval=FALSE} ### use here your own directory ab <- dir ( afrobarometer_dir, pattern = "sav$" ) afrobarometer_rounds <- file.path(afrobarometer_dir, ab) ab_waves <- read_surveys(afrobarometer_rounds, .f='read_spss') ``` Let's give a bit more meaningful identifiers than the file names: ```{r set-id, eval=FALSE} attr(ab_waves[[1]], "id") <- "Afrobarometer_R5" attr(ab_waves[[2]], "id") <- "Afrobarometer_R6" attr(ab_waves[[3]], "id") <- "Afrobarometer_R7" ``` We can review if the main descriptive metadata is correctly present with `document_surveys()`. ```{r document-ab, eval=FALSE} documented_ab_waves <- document_surveys(ab_waves) ``` ```{r} print(documented_ab_waves) ``` Create a metadata table, or a data map, that contains information about the variable names and labels, and where each row correspnds to one variable in the survey data file. We do this with the function `metadata_survey_create()`, which extracts the metadata from the survery data files, normalizes variable labels, identifies ranges of substantive responses and missing value codes. Its wrapper, `metadata_create()`, loops the function on a list of surveys, or on a set of files containing surveys. ```{r create-metadata, eval=FALSE} ab_metadata <- metadata_create ( survey_list = ab_waves ) ``` ## Working with metadata From the metadata file we select only those rows that correspond to the variables that we're interested in: the `rowid` being the unique case identifier, `DATEINTR` with the interview date, `COUNTRY` containing information about the country where the interview was conducted, `REGION` with the region (sub-national unit) where the interview was conducted, and `withinwt` being the weighting factor. Further, we select all variables that have the word "trust" in the normalized variable label. For these variables, we create normalized variable names (`var_name`) and labels (`var_label`). ```{r, selection, eval=FALSE} library(dplyr) to_harmonize <- ab_metadata %>% filter ( var_name_orig %in% c("rowid", "DATEINTR", "COUNTRY", "REGION", "withinwt") | grepl("trust ", label_orig ) ) %>% mutate ( var_label = var_label_normalize(label_orig)) %>% mutate ( var_label = case_when ( grepl("^unique identifier", var_label) ~ "unique_id", TRUE ~ var_label)) %>% mutate ( var_name = val_label_normalize(var_label)) ``` To further reduce the size of the analysis, we only choose a few variables based on the normalized variable names (`var_name`). ```{r} to_harmonize <- to_harmonize %>% filter ( grepl ( "president|parliament|religious|traditional|unique_id|weight|country|date_of_int", var_name) ) ``` The resulting table with information about the variables selected for harmonization looks as follows: ```{r} head(to_harmonize %>% select ( all_of(c("id", "var_name", "var_label"))), 10) ``` The `merge_surveys()` function harmonizes the variable names, the variable labels and survey identifiers and returns a list of surveys (of class `survey()`.) The parameter `var_harmonization` must be a list or a data frame that contains at least the original file name (`filename`), original variable names (`var_name_orig`), the new variable names (`var_name`) and their labels (`var_label`), so that the program knows which variables to take from what files and how to call and label them after transformation. ```{r merge, eval=FALSE} merged_ab <- merge_surveys ( survey_list = ab_waves, var_harmonization = to_harmonize ) # country will be a character variable, and doesn't need a label merged_ab <- lapply ( merged_ab, FUN = function(x) x %>% mutate( country = as_character(country))) ``` Review the most important metadata with `document_surveys()`: ```{r, eval=FALSE} documenteded_merged_ab <- document_surveys(merged_ab) ``` ```{r} print(documenteded_merged_ab) ``` ## Harmonization In the Afrobarometer trust is measured with four-point ordinal rating scales. Such data are best analyzed with ordinal models, which do not assume that the points are equidistant. However, to get a quick idea of how the data look like, we will assign numbers 0-3, with 0 corresponding to the least trust, and 3 corresponding to the most trust, and for the time-being analyze the data as if they were metric. To review the harmonization on a single survey use `pull_survey()`. Here we select Afrobarometer Round 6. ```{r check, eval=FALSE} R6 <- pull_survey ( merged_ab, id = "Afrobarometer_R6" ) ``` ```{r pulled-attributes} attributes(R6$trust_president[1:20]) ``` The `document_survey_item()` function shows the metadata of a single variable. ```{r document-item} document_survey_item(R6$trust_president) ``` Afrobarometer's SPSS files do not mark the missing values, so we have to be careful. The set of labels that are identified as missing: ```{r} collect_na_labels( to_harmonize ) ``` The set of valid category labels and missing value labels are as follows: ```{r} collect_val_labels (to_harmonize %>% filter ( grepl( "trust", var_name) )) ``` We create a harmonization function from the `harmonize_values()` prototype function. In fact, this is just a re-setting the default values of the original function. It makes future reference in pipelines easier, or it can be used for a question block only, in this case to variables with `starts_with("trust")`. ```{r specify} harmonize_ab_trust <- function(x) { label_list <- list( from = c("^not", "^just", "^somewhat", "^a", "^don", "^ref", "^miss", "^not", "^inap"), to = c("not_at_all", "little", "somewhat", "a_lot", "do_not_know", "declined", "inap", "inap", "inap"), numeric_values = c(0,1,2,3, 99997, 99998, 99999,99999, 99999) ) harmonize_values( x, harmonize_labels = label_list, na_values = c("do_not_know"=99997, "declined"=99998, "inap"=99999) ) } ``` We apply this function to the trust variables. The `harmonize_surveys()` function binds all variables that are present in all surveys. ```{r harmonize, eval=FALSE} harmonized_ab_waves <- harmonize_surveys ( survey_list = merged_ab, .f = harmonize_ab_trust ) ``` Let's look at the attributes of `harmonized_ab_waves`. ```{r, eval=FALSE} h_ab_structure <- attributes(harmonized_ab_waves) ``` ```{r} h_ab_structure$row.names <- NULL # We have over 100K row names h_ab_structure ``` Let's add the year of the interview: ```{r year, eval=FALSE} harmonized_ab_waves <- harmonized_ab_waves %>% mutate ( year = as.integer(substr(as.character( date_of_interview),1,4))) ``` To keep our example manageable, we subset the datasets to include only five countries. ```{r, eval=FALSE} harmonized_ab_waves <- harmonized_ab_waves %>% filter ( country %in% c("Niger", "Nigeria", "Algeria", "South Africa", "Madagascar")) ``` ## Analyzing the harmonized data The harmonized data can be exported and analyzed in another statistical program. The labelled survey data is stored in `labelled_spss_survey()` vectors, which is a complex class that retains metadata for reproducibility. Most statistical R packages do not know it. The data should be presented either as numeric data with `as_numeric()` or as categorical with `as_factor()`. (See more why you should not fall back on the more generic `as.factor()` or `as.numeric()` methods in [The labelled_spss_survey class vignette.](https://retroharmonize.dataobservatory.eu/articles/labelled_spss_survey.html)) Please note that the numeric form of these trust variables is not directly comparable with the numeric averages of the Eurobarometer trust variables, because the middle of the range is at `r mean(0:3)` and not `r mean(0:1)`. ```{r numeric} harmonized_ab_waves %>% mutate_at ( vars(starts_with("trust")), ~as_numeric(.)*within_country_weighting_factor) %>% select ( -all_of("within_country_weighting_factor") ) %>% group_by ( country, year ) %>% summarize_if ( is.numeric, mean, na.rm=TRUE ) ``` And the factor representation, without weighting: ```{r factor} library(tidyr) ## tidyr::pivot_longer() harmonized_ab_waves %>% select ( -all_of("within_country_weighting_factor") ) %>% mutate_if ( is.labelled_spss_survey, as_factor) %>% pivot_longer ( starts_with("trust"), names_to = "institution", values_to = "category") %>% mutate ( institution = gsub("^trust_", "", institution) ) %>% group_by ( country, year, institution, category ) %>% summarize ( n = n() ) ``` We subsetted the datasets to meet the vignette size limitations. If you are following our code without reducing the number of countries, you get the following results: ```{r, out.width='95%', echo=FALSE} knitr::include_graphics('ab_plot1.png') ```