--- title: "Working With Regional, Sub-National Statistical Products" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Working With Regional, Sub-National Statistical Products} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r backrgoundsetup, message=FALSE, warning=FALSE, echo=FALSE} devtools::load_all() library(dplyr) ``` ```{r setup, message=FALSE, eval=FALSE} library(regions) library(dplyr) ``` Most economic, social, and environmental situations and developments have a specific territorial dimension. When a statistic is computed to describe a population parameter, the computation is often carried out for a population within the boundaries of this territory. Probably the most widely used and cited statistics are defined within the scope of national boundaries. The population of France relates to the counting of people living within the boundaries of France (unless we specifically count French citizens living outside these boundaries.) Most national boundaries are stable, often throughout centuries. The number of national boundaries is approaching 200, which means that global datasets have boundary changes almost every year _somewhere_. Just the current European Union has witnessed several such changes in the last decades with the unification of Germany, the breakup of current member states Czechia and Slovakia, and the breakup of the former Yugoslavia to several current and prospective member states. When we are working with long-term data panels, obviously we have to make sure that we are comparing Czechia, Slovakia correctly with Czechoslovakia. National boundary changes require complicated diplomatic negotiations and they are meant to be difficult, but changes of boundaries within a national boundary, i.e. the changes of sub-national or regional boundaries is very frequent. In fact, it is so frequent that there is no consistent way of describing them globally. Probably the most comprehensive international system, the NUTS of the European Union contains dozens of changes every three years. It describes currently 1845 territorial units, but far more are being used within the 27 member states of the EU. [The NUTS is de facto used in a number of non-EU member states, too, and much of these uses will be part of the NUTS2021 definition.] Unless we work with shapefiles that contain the exact boundaries of territorial units, and we have access to microdata to replicate the statistical computation over varying boundaries, for example, before and after a regional border change, we cannot be sure that we are comparing exactly the same thing. The population of a region may have increased because more people were born or moved there than those who moved away or died, or simply because the boundaries were extended and new people who were earlier assigned to a different region became part of the count. To make sure that we are working with the same territorial units, i.e. countries, provinces, (member or federal) states, administrative units, we must unambiguously attribute observation of our statistics to a territorial unit. This requires the use of controlled vocabularies as geographical identifiers. Geographical names are not particularly useful identifiers in international datasets, because the names are subject to many changes in term of spelling, renaming, character codes or natural language that describes them and so on. The geographical names of the European Union use various extensions of the Latin alphabet, furthermore Cyril and Greek characters. In global datasets, we must deal with a lot more characters. A machine-readable, controlled vocabulary, the geocode is used to describe the geographical territories with Latin alphanumerical combinations that are usually easy to use programmatically, too, as variable names. The ISO standard for geographical territories is the `ISO-3166-1` standard for national boundaries and `ISO-3166-2` standard for sub-national divisions, such a province in the Netherlands, states in the USA or Australia, or Land in Germany. The `ISO-3166-2` is rather often used for statistical purposes; however, it is mainly a public administration metadata that is very impractical for statistical use. Its greatest disadvantage is that it is not well versioned. Because of the very frequent changes before in sub-national boundaries, we can almost never be sure if a certain `ISO-3166-2` refers to the same boundary. Usually, it refers to a same legal entity, for example, a county with its own local regulations, budgets and contracts; municipalities however switch country borders. For legal purposes, the `ISO-3166-2` is good. For one-time, cross sectional comparison of data, the `ISO-3166-2` is usable. For any statistical computation that is periodically repeated, it is almost useless. The European Union uses its own nomenclature, the `NUTS`, whose geocodes, and geographical names are controlled vocabularies with about new editions every three years. When boundaries or names are changing, usually the geocode is also changing for disambiguation. Furthermore, the NUTS is a hierarchical system, which is very helpful for missing data imputation. It is a good system to join statistical data tables across time, for example, to compare the regional employment data of the European Union in 2010 and 2020. Yet it is surprisingly difficult to use. A typical workflow is this: 1. Validate the geographical metadata (for which periods do you have data where you know which the territory was used for the statistical computation?) 2. Filter unchanged regions – these can be joined from data tables. 3. Recode regions that whose boundaries did not change to one single controlled vocabulary, such as the current, 2016 edition of `NUTS`; failing to do so will result in smaller data tables, and loss of observations that were recoded. 4. Aggregate and disaggregate unambiguous changes to the units in the `NUTS2016` – in these cases all the statistical information is present in different form, and any other imputation method would result in loss of existing data. 5. Impute or estimate other regions, using the structural information about the remaining data. This will result in far more superior data than imputation techniques designed for non-aggregated, individual data points; usually a large part of the information is present in statistical tables that can be used for proximation. When you are working with individual data, you are omitting steps 2-4. When working with national data, especially with small datasets, we can often join data tables by a country name or country code without the problems of 2-4, but even in these cases, imputation rules like using the median value for missing cases is obviously wrong. If the missing value represents a very small member state like Malta, which has a total population smaller than most capital cities in Europe, then using a median or mean country value will very much overestimate a count data for Malta. There are two main international efforts are taking place to standardize sub-national statistics: on the level of the European Union and on the level of the OECD. Our programmatic solution aims to work with both international data sources, and as much as possible, with any country's data that uses the `ISO-3166-2` standard. However, at this stage we have only included the typology used by the European Union in helper functions and metadata (typological vocabulary) files. ## Using a Consistent Typology The European Union harmonises the national statistical systems of 27 members states, 4 EEA countries and several other potential member states. Its regional typology, the hierarchical NUTS system is the most complex international system, and the regions package is mainly developed to help working with European regional data. However, the tasks related to such regional typologies are not unique the European Union. They are present in almost any national statistical system, and they can be similarly complex in federal states like the United States or Australia. The [Methodological manual on territorial typologies 2018 edition](https://op.europa.eu/en/publication-detail/-/publication/d96ef933-1ec5-11e9-8d04-01aa75ed71a1/language-en/format-PDF/source-106441306) introduces three types of typologies. * Regional typologies on NUTS1 level (states of Germany, provinces, macroregions), `NUTS2` level (regions) and `NUTS3` level (small regions.) Similar typologies exist in the United States or Australia. * Local Administrative Units (LAUs) which exist in almost all nation states that are sufficiently large, and have a territorially divided public administration and statistical system. * Grid typologies which are beyond the scope of `regions`. The availability of the sub-national statistical products is lower in terms of data table completeness or timeliness. This is often due to the fact that the creation of sub-national statistical products poses significant methodological challenges. The changes of national or provincial boundaries are rare – within a single polity. However, when we work with global data panels, or data panels of many sub-national territorial units, boundary changes are rather frequent. On a national level, just within Europe we have witnessed in the last three decades the dissolution of federal states like the Soviet Union, Yugoslavia and Czechoslovakia and the unification of Germany. Especially the break-up of the former Yugoslavia took a long time, meaning that annual national statistics contain different configurations over the years for Serbia, Montenegro and Kosovo. On sub-national levels, within Europe regional boundary changes are so frequent that they happen on average every three years. Of course, at one time, this procedure leaves most regional boundaries intact, and change only a relatively small number of boundaries. Whenever we work with European regional data, we still have to adjust panels of statistical aggregates every three years; similarly, we must validate any regional time series every three years. In either case, producers of statistical products do not necessarily re-cast historical statistical data for the new national or sub-national boundaries. For example, when the regional boundary definition within Central Hungary (`HU1`) changed in 2016, Eurostat did not require the Hungarian national statistical authorities to re-cast Hungary's `NUTS2` level data according to the new boundary definition, i.e. creating separate observations for `HU10` (Budapest) and `HU12` (surrounding _Pest county_.) Earlier `HU101` (*Budapest*) on `NUTS3` level became a `NUTS2` region under the _same name_. Furthermore, the former `NUTS2` region _Közép-Magyarország_ or _Central Hungary_ still exists on `NUTS1` level under the code `HU1`. Although you will not find a history for the `HU10` code before 2017, in most cases, the very same data about Budapest can be found under the code `HU101` in `NUTS3` datasets, or from national statistical sources that follow Hungary's `ISO-3166-2` borders. To put it another way, the HU-BP `ISO-3166-2` code, Budapest, did not change, it was just reassigned from the `NUTS3` typology to the `NUTS2` typology, which is reflecting the size of Budapest better. Budapest can be better compared with `NUTS2` regions than `NUTS3` small regions. ### Validation A very important and basic task is the validation of our data: which data is valid in a certain typology? For example, if we have a data source from 2014 that describes data within Europe in 2014 and we currently download data from the Eurostat website in 2020, we may find a discrepancy in how the regions are described. The dataset from 2014 is likely to represent sub-national territories according to their `NUTS2013` boundary definition, and the new dataset according to the currently valid `NUTS2016` boundary definition. We may not be able to correctly join the two datasets. The `validate` function family helps in finding out to conformity of your regional coding with various typologies. In the first CRAN release, only Eurostat typologies are coded, but some of the functions can be used with other typologies, for example, with the `ISO-3166-1` and `ISO-3166-2` typologies. OECD uses the NUTS typologies for the EU member states, so these typologies are the most frequently used in international statistics. Eurostat offers so-called correspondence tables to follow boundary changes, recoding and relabelling for all NUTS changes since the formalization of the NUTS typology. Unfortunately, these Excel tables do not conform with the requirements of tidy data, and their vocabulary for is not standardized, either. For example, recoding changes are often labelled as recoding, recoding and renaming, code change, Code change, etc. The `data-raw` library contains these Excel tables and very long data wrangling code that unifies the relevant vocabulary of these Excel files and brings the tables into a single, tidy format , starting with the definition `NUTS1999`. The resulting data file `nuts_changes` is included in the `regions` package. It already contains the changes that will come into force in 2021. The vignette [Validating Your Typology](https://regions.dataobservatory.eu/articles/validation.html) gives a real-life example to the use of these functions. Validations are the starting points for [Recoding and re-labelling](#recoding), [Aggregation And Re-Aggregation ](#aggregation) and [Disaggregation And Projection](#projection) operations. ### Recoding and re-labelling {#recoding} In the European Union, the most frequent regional statistical change is `recoding` and `relabelling`. Such changes do not affect the statistical data, only the statistical metadata. For example, the overseas department and region of France, _Réunion_, was officially renamed _La Réunion_ in 2010: this is `relabelling`. In earlier sources, you may find data about this region under the identifier _Réunion_, and later you may find it extended with the French article *La* *Réunion*. But the data from _Réunion_ in 2009 and *La* *Réunion* in 2012 refers to the same territory. Furthermore, when France changed its administrative boundaries, it changed their official code abbreviations. Although the typology of _La Réunion_ did not change, to indicate that a dataset is comparable across French departments according to the new administrative borders (which did change in many other regions) the geolocational code of this region changed from `FR94` to `FRA4` in 2013 and to `FRY4` in 2016. This is `recoding`. The geocodes of Eurostat are elements of a `controlled metadata vocabulary`. To avoid confusion, when a part of the vocabulary changes, for example, when France is changing some elements in its NUTS regions and their coding, it often changes the complete vocabulary, including for unchanged elements, for clarity. This is why `FR94` becomes `FRA4` even though it is describing the same overseas _department_ and _region_, _La Réunion_. If you want to work with French or international data that include any statistical information about the overseas departments, you must make sure that observations coded with `FR94`, `FRA4`, `FRY4` are treated as a time-series of the _La Réunion_ and not as different entities. Names are often dubious identifiers. _La Réunion_ remains in the `ISO-3166-1` and `ISO-3166-2` libraries _Réunion_, which is no longer used as name in `NUTS`. What makes working with `ISO` and `NUTS` vocabularies is that while NUTS is "versioned", ISO is not, so a correspondence for a given year, for example, for the year 2013 is almost impossible to create. This is why the use of Eurostats NUTS geo codes, which are usually unique in a given year, should be preferred over ISO. Many of Eurostat’s regional datasets contain recoded geographical IDs that violate the definition of tidy data, because under the same `geo` column they may contain observations for `FR94`, `FRA4`, `FRY4`. However, this is not a tidy description of the territory. Like in our `nuts_changes` data, they should be reported in three different columns, `code_2010`, `code_2013` and `code_2016`. The descriptive metadata code, i.e. row identifier `FR94` is compatible with `FR71` for example, but not with `FRY4`. If you are joining data from different tables, you must not join by their `geo` code, but by the explicit `code_2013` geo codes for the observation described or aggregated according to the `NUTS2013` regional boundaries and `code_2016` for the `NUTS2016`. In other words, you must make sure that you join by valid elements in a given version of NUTS. Another important note: if you join with the raw `geo` data, you will often have **seemingly missing data**. You may have data available for `FR93`, `FR94`, `FRY3`, `FRY4`, with missingness for the first two in 2016 and missingness for the latter two in 2010, but in fact, these observations or related to two, unchanged territorial units (*Guyane* and _La Réunion_.) You must choose one consistent coding, either using only `FRY4` for _La Réunion_ or `FR93` for _La Réunion_. You can also identify `FR93`, `FRY3` explicitly with the name *Guyane*. However, you must choose between the use of _La Réunion_ or _Réunion_. ```{r fr} data(nuts_changes) overseas <- nuts_changes[ which(nuts_changes$geo_name_2016 %in% c("La Réunion", "Guyane")), ] knitr::kable( overseas[, c("code_2010", "code_2013", "code_2016", "geo_name_2010", "geo_name_2013", "geo_name_2016")] ) ``` This is in fact a simpler recoding than the earlier example of _Budapest_. _La Réunion_ did change its name and code, so you find its data under different labels, but it is still within the `NUTS2` typology. Its comparators are likely to be found in the same statistical table. In the case of _Budapest_, the name did not change, but the change of the geo label reflects the fact that Budapest was moved from the `NUTS3` typology to the `NUTS2` typology, because it can be much better compared with `NUTS2` regions. For data concerning the time period before 2017, you need to look for data about Budapest in datasets that follow the `NUTS3` typology, or in national data sources, but you can fill in these historical data to pre-2017 cells of your `NUTS2` regional data. The size of Budapest did not change that much - it should have been compared with `NUTS2` regions in 2014, like in 2020, and not with smaller `NUTS3` units. See in more detail the vignette [Recoding & Relabelling](https://regions.dataobservatory.eu/articles/recode.html) for our programmatic solution. ## Imputation R has many packages that implement imputation algorithms designed for non-aggregated, individual data. In almost all the cases, these methods are not adequate for territorially aggregated data. With individual data imputation, we usually take a typical individual data point (such as the median value) or hypothesize a probability distribution of unobserved data and pick a hypothetically likely observation. We must never use such methods when data among observations is differently aggregated. Furthermore, very often some elements of the information, or all the information is present in the dataset, and a far less biased estimate can be given to the missing data. ### Disaggregation And Projection {#projection} Often, we have missing data about smaller territorial units, and we would like to impute their wider region’s representative data to a higher density dataset. For example, we do not have data for _Schwabia_, which is a region of the German federal state of _Bavaria_, and therefore we would like to impute the Bavarian (`NUTS1`) level data to this `NUTS2` level missing observation for Schwabia. This is a valid imputation technique for non-aggregating descriptive statistics, such as mean or median values derived from surveys. A typical problem of missingness is that we are comparing `NUTS1` or `NUTS2` level regions of Europe, and we do not find information about the small members states of Cyprus, Estonia, Luxembourg, Malta, because, due to their size, these countries do not have such divisions. Because of their size and territorial integrity, they can be well compared with Schwabia, for example, but the data is missing in a `NUTS2` data table. In this case, we may want to impute the technical `NUTS0` (country) level data into the `NUTS2` data table, creating the technical `NUTS2` data of `CY00`, `EE00`, `LU00` or `MT00`, which is identical to the national `CY`, `EE`, `LU`, `MT` observations. In this case, the imputation is valid to any type of data, even to count data like population, because `EE` `=` (`NUTS0`) `EE0` (`NUTS1`) `=` `EE00` (`NUTS2`). The function `impute_down` creates a generic solution to the problem when we want to project data from a larger territorial unit to a smaller one. For example, some state level data is missing for certain Australian or German federal states, and we want to use the national (average) values for imputation. The function `impute_down_nuts` gives and EU-specific solution, with a single data frame as input. The `mixed_nuts_example` data table shows a mixture of potential problems in a small, fictional dataset with random numbers. For simplicity, we take the fictional, country-level data from Malta, and input it down to all potential sub-divisions. Malta is a very transparent example, because the country has only 2 subdivisions on the lowest, `NUTS3` level, i.e. `MT001` and `MT002`. You can find Malta among countries (technical `NUTS0` level) as `MT`, among `NUTS1` regions as `MT0` (because there are no subdivisions on this level), and among `NUTS2` regions again as `MT00`. For any data that cannot be assigned for a geographical location or area, the technical codes ending with `Z` are used. So, what are our possibilities if we have only country-level `MT`data, but we would like to use it in a `NUTS3` database and impute this variable to both `MT001` and `MT002`. (You can safely do this with average values taken from a survey, but you should not do this with additive values such as the population.) ```{r} data("mixed_nuts_example") mixed_nuts_example %>% dplyr::filter(substr(geo,1,2) == "MT") %>% knitr::kable() ``` Malta offers two technical imputation slots on `NUTS1` level, `MT0` for territorially attributed data, and `MTZ` for non-territorial exceptions. On NUTS2 level, `MT00` and `MTZZ`. And at last, what is a real imputation, you can impute the `MT` = `MT0` = `MT00` values to `MT001`, should you have only data for `MT002`, etc. (In this simple example, we impute the `MT` values to both subdivisions, `MT001` and `MT002`.) ```{r} data("mixed_nuts_example") impute_down_nuts(mixed_nuts_example) %>% dplyr::filter(substr(geo,1,2) == "MT") %>% knitr::kable () ``` ### Aggregation And Re-Aggregation {#aggregation} The European Union sometimes publishes data on `NUTS2` level regions, but not on the larger NUTS1 level macroregions or provinces. In many cases, this information is fully present in the `NUTS2` datasets, and a simple aggregation can produce a NUTS1 level dataset, because each `NUTS2` region is strictly assigned to one broader NUTS1 macroregion. For count-type data, we can simply aggregate the constituent `NUTS2` level data to `NUTS1` level. A frequent problem is the re-assignment of a smaller territorial units, such as lower level administrative unit or a lower NUTS unit to another larger unit, or a pre-existing larger unit. For example, `LT02`, i.e. _Vidurio ir vakarų Lietuvos regionas_ in Lithuania consists of the former `LT00` (undivided) Lithuania minus `LT00A`, i.e. _Vilnius_. `LT01`, or _Sostinės regionas_, i.e. the metropolitan area of _Vilnius_ is a separate region from 2016. If you have national and `NUTS3` or Vilnius municipality level data from Lithuania, you can present it according to the undivided `NUTS2013` definition or the more granular `NUTS2016` definition. The _Southern_ region of Ireland, `IE05` was created in the same year from the earlier `NUTS3` level regions `IE023`, `IE024`, and `IE025`. You can fill in the missing data for earlier years with the simple additive formula `IE05=IE023+IE024+IE025` for count/additive variables, or you can create a re-weighted average for average values. This information can be found in the `change_2016` of the `nuts_changes` dataset. ```{r southern} data(nuts_changes) southern <- nuts_changes[ which(nuts_changes$geo_name_2016 == "Southern"), ] knitr::kable( southern[, c("code_2013", "code_2016", "geo_name_2016", "change_2016")] ) ``` ### Boundary Changes With No Solution In some cases the boundary change cuts across all territorial typologies that re-configuration is impossible from aggregated data. Unless we have access to the individual level data (which is not within scope of the tasks helped by this package), it is not possible to create time-wise comparable data for these units. We must treat the observations on these units logically missing, which may or may not be imputed with any techniques. It is usually helpful to mark territorial observations in panels or time series that have such a structural change. ### Time-series Because of the complexity of sub-national data, there are likely to be missing elements in time-wise comparisons. For example, we may find that some data is available for many years on a German federal level, and for most states, but a particular annual data is missing for Bavaria. It would be tempting to use a simple interpolation of the available data, but often a naïve interpolation is not the best method, because, as we have seen, often the data is available at a different aggregation level or under a different metadata label. It is imperative to first make sure that we exploit all information that is present in a different territorial structure for any given year, and then start inter-temporal imputation with interpolation, or carry forward/backward, forecasting and backcasting algorithms. At last, we may go back to the cross-sectional data, and may fill in some cross-sectional data gaps with interpolated or extrapolated data. Aggregating an estimate for Bavaria from previous or next-years constituent lower level (`NUTS2` or `NUTS3`) data is very likely to be a far better estimate than any imputation results, for example, a simple substitution of a median value from all German or EU NUTS1 units. The territorial aggregation structure gives plenty of exploitable information. ## Boundary definitions Our package uses a number of boundary definitions that are easily expanded. Our functions are generic, i.e. they treat boundary information as an input variable. Well formatted metadata about such typologies can be used with any country. The Online Browsing Platform (OBP) of the International Organization for Standardization contains all coding information, and metadata about recent changes for each country. (See, for example: [ISO 3166 — Codes for the representation of names of countries and their subdivisions: NL - Netherlands (the)](https://www.iso.org/obp/ui/#iso:code:3166:NL)) It would be tempting to use `ISO-3166-2` definitions for sub-national boundaries, however, the `ISO-3166-2` itself is changing relatively often, and in time series or panel data we can only refer to a certain version of the `ISO-3166-2`. All the boundary definitions are valid for a certain time period and certain boundaries. We currently incorporated only European metadata for the last 20 years, and we are planning to continue with OECD metadata in the forthcoming releases. However, we are welcoming any pull requests for other countries or country-groups. # Citations and related work ### Citing the data sources Eurostat data: cite [Eurostat](https://ec.europa.eu/eurostat/). Administrative boundaries: cite [EuroGeographics](https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units). ### Citing the eurostat R package For main developers and contributors, see the [package homepage](https://ropengov.github.io/eurostat/). This work can be freely used, modified and distributed under the BSD-2-clause (modified FreeBSD) license: ```{r citation-eurostat, message=FALSE, eval=TRUE, echo=TRUE} citation("eurostat") ``` ### Citing the regions R package For main developer and contributors, see the [package homepage](https://regions.dataobservatory.eu/). This work can be freely used, modified and distributed under the GPL-3 license: ```{r citation-regions, message=FALSE, eval=TRUE, echo=TRUE} citation("regions") ``` ### Contact For contact information, see the package homepages.