Most economic, social, and environmental situations and developments have a specific territorial dimension. When a statistic is computed to describe a population parameter, the computation is often carried out for a population within the boundaries of this territory.
Probably the most widely used and cited statistics are defined within the scope of national boundaries. The population of France relates to the counting of people living within the boundaries of France (unless we specifically count French citizens living outside these boundaries.)
Most national boundaries are stable, often throughout centuries. The number of national boundaries is approaching 200, which means that global datasets have boundary changes almost every year somewhere. Just the current European Union has witnessed several such changes in the last decades with the unification of Germany, the breakup of current member states Czechia and Slovakia, and the breakup of the former Yugoslavia to several current and prospective member states. When we are working with long-term data panels, obviously we have to make sure that we are comparing Czechia, Slovakia correctly with Czechoslovakia.
National boundary changes require complicated diplomatic negotiations and they are meant to be difficult, but changes of boundaries within a national boundary, i.e. the changes of sub-national or regional boundaries is very frequent. In fact, it is so frequent that there is no consistent way of describing them globally. Probably the most comprehensive international system, the NUTS of the European Union contains dozens of changes every three years. It describes currently 1845 territorial units, but far more are being used within the 27 member states of the EU. [The NUTS is de facto used in a number of non-EU member states, too, and much of these uses will be part of the NUTS2021 definition.]
Unless we work with shapefiles that contain the exact boundaries of territorial units, and we have access to microdata to replicate the statistical computation over varying boundaries, for example, before and after a regional border change, we cannot be sure that we are comparing exactly the same thing. The population of a region may have increased because more people were born or moved there than those who moved away or died, or simply because the boundaries were extended and new people who were earlier assigned to a different region became part of the count.
To make sure that we are working with the same territorial units, i.e. countries, provinces, (member or federal) states, administrative units, we must unambiguously attribute observation of our statistics to a territorial unit. This requires the use of controlled vocabularies as geographical identifiers.
Geographical names are not particularly useful identifiers in international datasets, because the names are subject to many changes in term of spelling, renaming, character codes or natural language that describes them and so on. The geographical names of the European Union use various extensions of the Latin alphabet, furthermore Cyril and Greek characters. In global datasets, we must deal with a lot more characters.
A machine-readable, controlled vocabulary, the geocode is used to describe the geographical territories with Latin alphanumerical combinations that are usually easy to use programmatically, too, as variable names.
The ISO standard for geographical territories is the
ISO-3166-1
standard for national boundaries and
ISO-3166-2
standard for sub-national divisions, such a
province in the Netherlands, states in the USA or Australia, or Land in
Germany. The ISO-3166-2
is rather often used for
statistical purposes; however, it is mainly a public administration
metadata that is very impractical for statistical use. Its greatest
disadvantage is that it is not well versioned. Because of the very
frequent changes before in sub-national boundaries, we can almost never
be sure if a certain ISO-3166-2
refers to the same
boundary. Usually, it refers to a same legal entity, for example, a
county with its own local regulations, budgets and contracts;
municipalities however switch country borders. For legal purposes, the
ISO-3166-2
is good. For one-time, cross sectional
comparison of data, the ISO-3166-2
is usable. For any
statistical computation that is periodically repeated, it is almost
useless.
The European Union uses its own nomenclature, the NUTS
,
whose geocodes, and geographical names are controlled vocabularies with
about new editions every three years. When boundaries or names are
changing, usually the geocode is also changing for disambiguation.
Furthermore, the NUTS is a hierarchical system, which is very helpful
for missing data imputation. It is a good system to join statistical
data tables across time, for example, to compare the regional employment
data of the European Union in 2010 and 2020. Yet it is surprisingly
difficult to use.
A typical workflow is this:
Validate the geographical metadata (for which periods do you have data where you know which the territory was used for the statistical computation?)
Filter unchanged regions – these can be joined from data tables.
Recode regions that whose boundaries did not change to one single
controlled vocabulary, such as the current, 2016 edition of
NUTS
; failing to do so will result in smaller data tables,
and loss of observations that were recoded.
Aggregate and disaggregate unambiguous changes to the units in
the NUTS2016
– in these cases all the statistical
information is present in different form, and any other imputation
method would result in loss of existing data.
Impute or estimate other regions, using the structural information about the remaining data. This will result in far more superior data than imputation techniques designed for non-aggregated, individual data points; usually a large part of the information is present in statistical tables that can be used for proximation.
When you are working with individual data, you are omitting steps 2-4. When working with national data, especially with small datasets, we can often join data tables by a country name or country code without the problems of 2-4, but even in these cases, imputation rules like using the median value for missing cases is obviously wrong. If the missing value represents a very small member state like Malta, which has a total population smaller than most capital cities in Europe, then using a median or mean country value will very much overestimate a count data for Malta.
There are two main international efforts are taking place to
standardize sub-national statistics: on the level of the European Union
and on the level of the OECD. Our programmatic solution aims to work
with both international data sources, and as much as possible, with any
country’s data that uses the ISO-3166-2
standard. However,
at this stage we have only included the typology used by the European
Union in helper functions and metadata (typological vocabulary)
files.
The European Union harmonises the national statistical systems of 27 members states, 4 EEA countries and several other potential member states. Its regional typology, the hierarchical NUTS system is the most complex international system, and the regions package is mainly developed to help working with European regional data. However, the tasks related to such regional typologies are not unique the European Union. They are present in almost any national statistical system, and they can be similarly complex in federal states like the United States or Australia.
The Methodological manual on territorial typologies 2018 edition introduces three types of typologies.
Regional typologies on NUTS1 level (states of Germany, provinces,
macroregions), NUTS2
level (regions) and NUTS3
level (small regions.) Similar typologies exist in the United States or
Australia.
Local Administrative Units (LAUs) which exist in almost all nation states that are sufficiently large, and have a territorially divided public administration and statistical system.
Grid typologies which are beyond the scope of
regions
.
The availability of the sub-national statistical products is lower in terms of data table completeness or timeliness. This is often due to the fact that the creation of sub-national statistical products poses significant methodological challenges.
The changes of national or provincial boundaries are rare – within a single polity. However, when we work with global data panels, or data panels of many sub-national territorial units, boundary changes are rather frequent. On a national level, just within Europe we have witnessed in the last three decades the dissolution of federal states like the Soviet Union, Yugoslavia and Czechoslovakia and the unification of Germany. Especially the break-up of the former Yugoslavia took a long time, meaning that annual national statistics contain different configurations over the years for Serbia, Montenegro and Kosovo.
On sub-national levels, within Europe regional boundary changes are so frequent that they happen on average every three years. Of course, at one time, this procedure leaves most regional boundaries intact, and change only a relatively small number of boundaries. Whenever we work with European regional data, we still have to adjust panels of statistical aggregates every three years; similarly, we must validate any regional time series every three years.
In either case, producers of statistical products do not necessarily
re-cast historical statistical data for the new national or sub-national
boundaries. For example, when the regional boundary definition within
Central Hungary (HU1
) changed in 2016, Eurostat did not
require the Hungarian national statistical authorities to re-cast
Hungary’s NUTS2
level data according to the new boundary
definition, i.e. creating separate observations for HU10
(Budapest) and HU12
(surrounding Pest county.)
Earlier HU101
(Budapest) on NUTS3
level became a NUTS2
region under the same name.
Furthermore, the former NUTS2
region
Közép-Magyarország or Central Hungary still exists on
NUTS1
level under the code HU1
. Although you
will not find a history for the HU10
code before 2017, in
most cases, the very same data about Budapest can be found under the
code HU101
in NUTS3
datasets, or from national
statistical sources that follow Hungary’s ISO-3166-2
borders.
To put it another way, the HU-BP ISO-3166-2
code,
Budapest, did not change, it was just reassigned from the
NUTS3
typology to the NUTS2
typology, which is
reflecting the size of Budapest better. Budapest can be better compared
with NUTS2
regions than NUTS3
small
regions.
A very important and basic task is the validation of our data: which
data is valid in a certain typology? For example, if we have a data
source from 2014 that describes data within Europe in 2014 and we
currently download data from the Eurostat website in 2020, we may find a
discrepancy in how the regions are described. The dataset from 2014 is
likely to represent sub-national territories according to their
NUTS2013
boundary definition, and the new dataset according
to the currently valid NUTS2016
boundary definition. We may
not be able to correctly join the two datasets.
The validate
function family helps in finding out to
conformity of your regional coding with various typologies. In the first
CRAN release, only Eurostat typologies are coded, but some of the
functions can be used with other typologies, for example, with the
ISO-3166-1
and ISO-3166-2
typologies. OECD
uses the NUTS typologies for the EU member states, so these typologies
are the most frequently used in international statistics.
Eurostat offers so-called correspondence tables to follow boundary changes, recoding and relabelling for all NUTS changes since the formalization of the NUTS typology. Unfortunately, these Excel tables do not conform with the requirements of tidy data, and their vocabulary for is not standardized, either. For example, recoding changes are often labelled as recoding, recoding and renaming, code change, Code change, etc.
The data-raw
library contains these Excel tables and
very long data wrangling code that unifies the relevant vocabulary of
these Excel files and brings the tables into a single, tidy format ,
starting with the definition NUTS1999
. The resulting data
file nuts_changes
is included in the regions
package. It already contains the changes that will come into force in
2021.
The vignette Validating Your Typology gives a real-life example to the use of these functions. Validations are the starting points for Recoding and re-labelling, Aggregation And Re-Aggregation and Disaggregation And Projection operations.
In the European Union, the most frequent regional statistical change
is recoding
and relabelling
. Such changes do
not affect the statistical data, only the statistical metadata.
For example, the overseas department and region of France,
Réunion, was officially renamed La Réunion in 2010:
this is relabelling
. In earlier sources, you may find data
about this region under the identifier Réunion, and later you
may find it extended with the French article La
Réunion. But the data from Réunion in 2009 and
La Réunion in 2012 refers to the same territory.
Furthermore, when France changed its administrative boundaries, it
changed their official code abbreviations. Although the typology of
La Réunion did not change, to indicate that a dataset is
comparable across French departments according to the new administrative
borders (which did change in many other regions) the geolocational code
of this region changed from FR94
to FRA4
in
2013 and to FRY4
in 2016. This is
recoding
.
The geocodes of Eurostat are elements of a
controlled metadata vocabulary
. To avoid confusion, when a
part of the vocabulary changes, for example, when France is changing
some elements in its NUTS regions and their coding, it often changes the
complete vocabulary, including for unchanged elements, for clarity. This
is why FR94
becomes FRA4
even though it is
describing the same overseas department and region,
La Réunion.
If you want to work with French or international data that include
any statistical information about the overseas departments, you must
make sure that observations coded with FR94
,
FRA4
, FRY4
are treated as a time-series of the
La Réunion and not as different entities.
Names are often dubious identifiers. La Réunion remains in
the ISO-3166-1
and ISO-3166-2
libraries
Réunion, which is no longer used as name in NUTS
.
What makes working with ISO
and NUTS
vocabularies is that while NUTS is “versioned”, ISO is not, so a
correspondence for a given year, for example, for the year 2013 is
almost impossible to create. This is why the use of Eurostats NUTS geo
codes, which are usually unique in a given year, should be preferred
over ISO.
Many of Eurostat’s regional datasets contain recoded geographical IDs
that violate the definition of tidy data, because under the same
geo
column they may contain observations for
FR94
, FRA4
, FRY4
. However, this
is not a tidy description of the territory. Like in our
nuts_changes
data, they should be reported in three
different columns, code_2010
, code_2013
and
code_2016
. The descriptive metadata code, i.e. row
identifier FR94
is compatible with FR71
for
example, but not with FRY4
.
If you are joining data from different tables, you must not join by
their geo
code, but by the explicit code_2013
geo codes for the observation described or aggregated according to the
NUTS2013
regional boundaries and code_2016
for
the NUTS2016
. In other words, you must make sure that you
join by valid elements in a given version of NUTS.
Another important note: if you join with the raw geo
data, you will often have seemingly missing data. You
may have data available for FR93
, FR94
,
FRY3
, FRY4
, with missingness for the first two
in 2016 and missingness for the latter two in 2010, but in fact, these
observations or related to two, unchanged territorial units
(Guyane and La Réunion.) You must choose one
consistent coding, either using only FRY4
for La
Réunion or FR93
for La Réunion. You can also
identify FR93
, FRY3
explicitly with the name
Guyane. However, you must choose between the use of La
Réunion or Réunion.
data(nuts_changes)
overseas <- nuts_changes[
which(nuts_changes$geo_name_2016 %in% c("La Réunion", "Guyane")),
]
knitr::kable(
overseas[, c("code_2010", "code_2013", "code_2016",
"geo_name_2010", "geo_name_2013", "geo_name_2016")]
)
code_2010 | code_2013 | code_2016 | geo_name_2010 | geo_name_2013 | geo_name_2016 |
---|---|---|---|---|---|
FR93 | FRA3 | FRY3 | Guyane | Guyane | Guyane |
FR94 | FRA4 | FRY4 | Réunion | La Réunion | La Réunion |
FR930 | FRA30 | FRY30 | Guyane | Guyane | Guyane |
FR940 | FRA40 | FRY40 | Réunion | La Réunion | La Réunion |
This is in fact a simpler recoding than the earlier example of
Budapest. La Réunion did change its name and code, so
you find its data under different labels, but it is still within the
NUTS2
typology. Its comparators are likely to be found in
the same statistical table. In the case of Budapest, the name
did not change, but the change of the geo label reflects the fact that
Budapest was moved from the NUTS3
typology to the
NUTS2
typology, because it can be much better compared with
NUTS2
regions. For data concerning the time period before
2017, you need to look for data about Budapest in datasets that follow
the NUTS3
typology, or in national data sources, but you
can fill in these historical data to pre-2017 cells of your
NUTS2
regional data. The size of Budapest did not change
that much - it should have been compared with NUTS2
regions
in 2014, like in 2020, and not with smaller NUTS3
units.
See in more detail the vignette Recoding & Relabelling for our programmatic solution.
R has many packages that implement imputation algorithms designed for non-aggregated, individual data. In almost all the cases, these methods are not adequate for territorially aggregated data. With individual data imputation, we usually take a typical individual data point (such as the median value) or hypothesize a probability distribution of unobserved data and pick a hypothetically likely observation. We must never use such methods when data among observations is differently aggregated. Furthermore, very often some elements of the information, or all the information is present in the dataset, and a far less biased estimate can be given to the missing data.
Often, we have missing data about smaller territorial units, and we
would like to impute their wider region’s representative data to a
higher density dataset. For example, we do not have data for
Schwabia, which is a region of the German federal state of
Bavaria, and therefore we would like to impute the Bavarian
(NUTS1
) level data to this NUTS2
level missing
observation for Schwabia. This is a valid imputation technique for
non-aggregating descriptive statistics, such as mean or median values
derived from surveys.
A typical problem of missingness is that we are comparing
NUTS1
or NUTS2
level regions of Europe, and we
do not find information about the small members states of Cyprus,
Estonia, Luxembourg, Malta, because, due to their size, these countries
do not have such divisions. Because of their size and territorial
integrity, they can be well compared with Schwabia, for example, but the
data is missing in a NUTS2
data table. In this case, we may
want to impute the technical NUTS0
(country) level data
into the NUTS2
data table, creating the technical
NUTS2
data of CY00
, EE00
,
LU00
or MT00
, which is identical to the
national CY
, EE
, LU
,
MT
observations. In this case, the imputation is valid to
any type of data, even to count data like population, because
EE
=
(NUTS0
) EE0
(NUTS1
) =
EE00
(NUTS2
).
The function impute_down
creates a generic solution to
the problem when we want to project data from a larger territorial unit
to a smaller one. For example, some state level data is missing for
certain Australian or German federal states, and we want to use the
national (average) values for imputation.
The function impute_down_nuts
gives and EU-specific
solution, with a single data frame as input. The
mixed_nuts_example
data table shows a mixture of potential
problems in a small, fictional dataset with random numbers. For
simplicity, we take the fictional, country-level data from Malta, and
input it down to all potential sub-divisions.
Malta is a very transparent example, because the country has only 2
subdivisions on the lowest, NUTS3
level,
i.e. MT001
and MT002
. You can find Malta among
countries (technical NUTS0
level) as MT
, among
NUTS1
regions as MT0
(because there are no
subdivisions on this level), and among NUTS2
regions again
as MT00
. For any data that cannot be assigned for a
geographical location or area, the technical codes ending with
Z
are used.
So, what are our possibilities if we have only country-level
MT
data, but we would like to use it in a NUTS3
database and impute this variable to both MT001
and
MT002
. (You can safely do this with average values taken
from a survey, but you should not do this with additive values such as
the population.)
data("mixed_nuts_example")
mixed_nuts_example %>%
dplyr::filter(substr(geo,1,2) == "MT") %>%
knitr::kable()
geo | values | method |
---|---|---|
MT | 93.3514 | actual |
Malta offers two technical imputation slots on NUTS1
level, MT0
for territorially attributed data, and
MTZ
for non-territorial exceptions. On NUTS2 level,
MT00
and MTZZ
. And at last, what is a real
imputation, you can impute the MT
= MT0
=
MT00
values to MT001
, should you have only
data for MT002
, etc. (In this simple example, we impute the
MT
values to both subdivisions, MT001
and
MT002
.)
data("mixed_nuts_example")
impute_down_nuts(mixed_nuts_example) %>%
dplyr::filter(substr(geo,1,2) == "MT") %>%
knitr::kable ()
geo | values | method | typology | valid_2016 |
---|---|---|---|---|
MT | 93.3514 | country | TRUE | |
MT0 | 93.3514 | imputed from country MT | nuts_level_1 | TRUE |
MT00 | 93.3514 | imputed from country MT | nuts_level_2 | TRUE |
MT001 | 93.3514 | imputed from country MT | nuts_level_3 | TRUE |
MT002 | 93.3514 | imputed from country MT | nuts_level_3 | TRUE |
MTZ | 93.3514 | imputed from country MT | nuts_level_1 | TRUE |
MTZZ | 93.3514 | imputed from country MT | nuts_level_2 | TRUE |
MTZZZ | 93.3514 | imputed from country MT | nuts_level_3 | TRUE |
The European Union sometimes publishes data on NUTS2
level regions, but not on the larger NUTS1 level macroregions or
provinces. In many cases, this information is fully present in the
NUTS2
datasets, and a simple aggregation can produce a
NUTS1 level dataset, because each NUTS2
region is strictly
assigned to one broader NUTS1 macroregion. For count-type data, we can
simply aggregate the constituent NUTS2
level data to
NUTS1
level.
A frequent problem is the re-assignment of a smaller territorial units, such as lower level administrative unit or a lower NUTS unit to another larger unit, or a pre-existing larger unit.
For example, LT02
, i.e. Vidurio ir vakarų Lietuvos
regionas in Lithuania consists of the former LT00
(undivided) Lithuania minus LT00A
, i.e. Vilnius.
LT01
, or Sostinės regionas, i.e. the metropolitan
area of Vilnius is a separate region from 2016. If you have
national and NUTS3
or Vilnius municipality level data from
Lithuania, you can present it according to the undivided
NUTS2013
definition or the more granular
NUTS2016
definition.
The Southern region of Ireland, IE05
was
created in the same year from the earlier NUTS3
level
regions IE023
, IE024
, and IE025
.
You can fill in the missing data for earlier years with the simple
additive formula IE05=IE023+IE024+IE025
for count/additive
variables, or you can create a re-weighted average for average
values.
This information can be found in the change_2016
of the
nuts_changes
dataset.
data(nuts_changes)
southern <- nuts_changes[
which(nuts_changes$geo_name_2016 == "Southern"), ]
knitr::kable(
southern[, c("code_2013", "code_2016", "geo_name_2016", "change_2016")]
)
code_2013 | code_2016 | geo_name_2016 | change_2016 |
---|---|---|---|
NA | IE05 | Southern | new region, made from ex-ie023, ie024 and ie025 |
In some cases the boundary change cuts across all territorial typologies that re-configuration is impossible from aggregated data. Unless we have access to the individual level data (which is not within scope of the tasks helped by this package), it is not possible to create time-wise comparable data for these units. We must treat the observations on these units logically missing, which may or may not be imputed with any techniques. It is usually helpful to mark territorial observations in panels or time series that have such a structural change.
Because of the complexity of sub-national data, there are likely to be missing elements in time-wise comparisons. For example, we may find that some data is available for many years on a German federal level, and for most states, but a particular annual data is missing for Bavaria. It would be tempting to use a simple interpolation of the available data, but often a naïve interpolation is not the best method, because, as we have seen, often the data is available at a different aggregation level or under a different metadata label. It is imperative to first make sure that we exploit all information that is present in a different territorial structure for any given year, and then start inter-temporal imputation with interpolation, or carry forward/backward, forecasting and backcasting algorithms.
At last, we may go back to the cross-sectional data, and may fill in
some cross-sectional data gaps with interpolated or extrapolated data.
Aggregating an estimate for Bavaria from previous or next-years
constituent lower level (NUTS2
or NUTS3
) data
is very likely to be a far better estimate than any imputation results,
for example, a simple substitution of a median value from all German or
EU NUTS1 units. The territorial aggregation structure gives plenty of
exploitable information.
Our package uses a number of boundary definitions that are easily expanded. Our functions are generic, i.e. they treat boundary information as an input variable. Well formatted metadata about such typologies can be used with any country.
The Online Browsing Platform (OBP) of the International Organization for Standardization contains all coding information, and metadata about recent changes for each country. (See, for example: ISO 3166 — Codes for the representation of names of countries and their subdivisions: NL - Netherlands (the))
It would be tempting to use ISO-3166-2
definitions for
sub-national boundaries, however, the ISO-3166-2
itself is
changing relatively often, and in time series or panel data we can only
refer to a certain version of the ISO-3166-2
. All the
boundary definitions are valid for a certain time period and certain
boundaries. We currently incorporated only European metadata for the
last 20 years, and we are planning to continue with OECD metadata in the
forthcoming releases. However, we are welcoming any pull requests for
other countries or country-groups.