--- title: "Finnish personal ID number data toolkit for R (hetu)" author: "Pyry Kantanen, Jussi Paananen, Mans Magnusson, Leo Lahti" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true rmarkdown::md_vignette: toc: true vignette: > %\VignetteIndexEntry{Finnish personal ID number data toolkit for R (hetu)} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The **hetu R package** provides tools to work with Finnish personal identity numbers (hetu, short for the Finnish term "henkilötunnus"). Some functions can also be used with Finnish Business ID numbers (y-tunnus). Where possible, we have unified the syntax with [sweidnumbr](https://github.com/rOpenGov/sweidnumbr). ## Installation Install the current devel version in R: ```{r install, eval=FALSE} devtools::install_github("ropengov/hetu") ``` Test the installation by loading the library: ```{r test, message=FALSE, warning=FALSE, eval=TRUE} library(hetu) ``` We also recommend setting the UTF-8 encoding: ```{r locale, eval=FALSE} Sys.setlocale(locale="UTF-8") ``` ## Introduction Finnish personal identification numbers (Finnish: henkilötunnus, hetu in short), are used to identify citizens. Hetu PIN consists of eleven characters: DDMMYYCZZZQ, where DDMMYY is the day, month and year of birth, C is the century marker, ZZZ is the individual number and Q is the control character. Males have odd and females have even individual number. The control character is determined by dividing DDMMYYZZZ by 31 and using the remainder (modulo 31) to pick up the corresponding character from the string "0123456789ABCDEFHJKLMNPRSTUVWXY". For example, if the remainder is 0, the control character is 0 and if the remainder is 12, the control character is C. A valid individual number is between 002-899. Individual numbers 900-999 are not in normal use and are used only for temporary or artificial PINs. These temporary PINs are sometimes used in different organizations, such as insurance companies or hospitals, if the individual is not a Finnish citizen, a permanent resident or if the exact identity of the individual cannot be determined at the time. Artificial or temporary PINs are not intended for continuous, long term use and they are not usually accepted by PIN validity checking algorithms. Temporary PINs provide similar information about individual's birth date or sex as regular PINs. Temporary PINs can also be safely used for testing purposes, as such a number cannot be linked to any real person. ## Personal identification numbers (HETU) The basic hetu function can be used to view information included in a Finnish personal identification number. The data is outputted as a data frame. ```{r hetu_example1, message=FALSE} example_pin <- "111111-111C" hetu(example_pin) ``` The output can be made prettier, for example by using knitr: ```{r hetu_example2, message=FALSE} knitr::kable(hetu(example_pin)) ``` The hetu function also accepts vectors with several identification numbers as input: ```{r hetu_example3, message=FALSE} example_pins <- c("010101-0101", "111111-111C") knitr::kable(hetu(example_pins)) ``` The hetu function does not print warning messages to the user if input vector contains invalid PINs. Validity of specific PINs can be determined by looking at the valid.pin column. ```{r hetu_example4, error = TRUE, purl = FALSE} hetu(c("010101-0102", "111311-111C", "010101-0101")) ``` ### Extracting specific information Information contained in the PIN can be extracted with a generic extract parameter. Valid values for extraction are *hetu*, *sex*, *personal.number*, *ctrl.char*, *date*, *day*, *month*, *year*, *century*, *valid.pin* and *is.temp*. *is.temp* can be extracted only if allow.temp is set to TRUE. If allow.temp is set to FALSE (default), temporary PINs are filtered from the output and information provided by *is.temp* would be meaningless. ```{r hetuextract1, message=FALSE} hetu(example_pins, extract = "sex") hetu(example_pins, extract = "ctrl.char") ``` Some fields can be extracted with specialized functions. Extracting sex with hetu_sex function: ```{r hetuextract2, message=FALSE, eval=TRUE} hetu_sex(example_pins) ``` Extracting age at current date and at a given date with hetu_age function: ```{r hetuextract3, message=TRUE, eval=TRUE} hetu_age(example_pins) hetu_age(example_pins, date = "2012-01-01") hetu_age(example_pins, timespan = "months") ``` Dates (birth dates) also have their own function, hetu_date. ```{r hetuextract4, message=FALSE, eval=TRUE} hetu_date(example_pins) ``` ### Validity checking The basic hetu function output includes information on the validity of each pin, which can be extracted by using hetu-function with *valid.pin* as extract parameter. The validity of the PINs can also be determined by using the hetu_ctrl function, which produces a vector: ```{r hetu_ctrl_example, fig.message=FALSE} hetu_ctrl(c("010101-0101", "111111-111C")) # TRUE TRUE hetu_ctrl("010101-1010") # FALSE ``` ### Artificial and temporary personal identification numbers The package functions can be made to accept artificial or temporary personal identification numbers. Artificial and temporary PINs can be used normally by allowing them through allow.temp parameter. ```{r hetu_temp1, message=FALSE} example_temp_pin <- "010101A900R" knitr::kable(hetu(example_temp_pin, allow.temp = TRUE)) ``` A vector with regular and temporary PINs mixed together prints only regular PINs, if allow.temp is not set to TRUE. Automatic omitting of temporary PINs does not produce a visible error message and therefore users need to be cautious if they want to use temporary PINs. If temporary PINs are not explicitly allowed and the input vector consists of temporary PINs only, the function will return an error. ```{r hetu_temp2, error = TRUE, purl = FALSE} example_temp_pins <- c("010101A900R", "010101-0101") hetu_ctrl("010101A900R", allow.temp = FALSE) knitr::kable(hetu(example_temp_pins)) ``` When allow.temp is set to TRUE, all PINs are handled as if they were regular PINs. ```{r hetu_temp3, message=FALSE} knitr::kable(hetu(example_temp_pins, allow.temp = TRUE)) hetu_ctrl("010101A900R", allow.temp = TRUE) ``` Validation function hetu_ctrl produces a FALSE for every artificial / temporary PIN, if they are not explicitly allowed. ```{r hetu_temp4, error = TRUE, purl = FALSE} knitr::kable(hetu(example_temp_pins)) #FALSE TRUE knitr::kable(hetu(example_temp_pins, allow.temp = TRUE)) #TRUE TRUE ``` ### Generating random PINs Random PINs can be generated by using the rpin function. ```{r rpin, message=FALSE} rhetu(n = 4) rhetu(n = 4, start.date = "1990-01-01", end.date = "2005-01-01") ``` The number of males in the generated sample can be changed with parameter p.male. Default is 0.4. ```{r rpin2, message=FALSE} random_sample <- rhetu(n = 4, p.male = 0.8) table(random_sample) ``` The default proportion of artificial / temporary PINs is 0.0, meaning that no artificial / temporary PINs are generated by default. ```{r rpin3, message=FALSE} temp_sample <- rhetu(n = 4, p.temp = 0.5) table(hetu(temp_sample, allow.temp = TRUE, extract = "is.temp")) ``` ### Diagnostics In addition to information mentioned in the section [Extracting specific information](#extracting-specific-information), the user can choose to print additional columns containing information about checks done on PINs. The diagnostic checks produce a TRUE or FALSE for the following categories: *valid.p.num*, *valid.checksum*, *correct.checksum*, *valid.date*, *valid.day*, *valid.month*, *valid.year*, *valid.length* and *valid.century*, FALSE meaning that hetu is somehow incorrect. ```{r example_diagnostics1, error = TRUE, purl = FALSE, warning = FALSE} diagnosis_example <- c("010101-0102", "111111-111Q", "010101B0101", "320101-0101", "011301-0101", "010101-01010", "010101-0011") head(hetu(diagnosis_example, diagnostic = TRUE), 3) ``` Diagnostic information can be examined more closely by using subset or by using a separate hetu_diagnostics function. The user can print all diagnostic information for all PINs in the dataset: ```{r example_diagnostics2, error = TRUE, purl = FALSE, warning = FALSE} tail(hetu_diagnostic(diagnosis_example), 3) ``` By using extract parameter, the user can choose which columns will be printed in the output table. Valid extract values are listed in the function's help file. ```{r example_diagnostics3, error = TRUE, purl = FALSE, warning = FALSE} hetu_diagnostic(diagnosis_example, extract = c("valid.century", "correct.checksum")) ``` Because of the way PINs are handled in inside hetu-function, the diagnostics-function can show unexpected warning messages or introduce NAs by coercion if the date-part of the PIN is too long. This may result in inability to handle the PIN at all! ```{r example_diagnostics5, error = TRUE, purl = FALSE, warning = FALSE, eval = FALSE} # Faulty example hetu_diagnostic(c("01011901-01010")) ``` ## Business Identity Codes (BID) The package has also the ability to generate Finnish Business ID codes (y-tunnus) and check their validity. Unlike with personal identification numbers, no additional information can be extracted from Business IDs. ### Generating random BIDs Similar to hetu PINs, random Finnish Business IDs (y-tunnus) can be generated by using rbid function. ```{r rbid_example, message=FALSE} bid_sample <- rbid(3) bid_sample ``` ### BID validity checking The validity of Finnish Business Identity Codes can be checked with a similar function to hetu_ctrl: bid_ctrl. ```{r bid_ctrl_example, fig.message=FALSE} bid_ctrl(c("0737546-2", "1572860-0")) # TRUE TRUE bid_ctrl("0737546-1") # FALSE ``` ## Various examples Data frames generated by hetu function work well with tidyverse/dplyr workflows as well. ```{r hetu_tibbles, eval = FALSE} library(hetu) library(tidyverse) library(dplyr) # Generate data for this example hdat<-tibble(pin=rhetu(n = 4, start_date = "1990-01-01", end_date = "2005-01-01")) # Extract all the hetu information to tibble format hdat<-hdat %>% mutate(result=map(.x=pin,.f=hetu::hetu)) %>% unnest(cols=c(result)) hdat ``` ## Licensing and Citations This work can be freely used, modified and distributed under the open license specified in the [DESCRIPTION file](https://github.com/rOpenGov/hetu/blob/master/DESCRIPTION). Kindly cite the work as follows ```{r citation, message=FALSE, eval=TRUE} citation("hetu") ``` ## References - [The personal identity code](https://dvv.fi/en/personal-identity-code). Digital and population data services agency. - [Valtioneuvoston asetus väestötietojärjestelmästä (128/2010)](https://www.finlex.fi/fi/laki/ajantasa/2010/20100128) (In Finnish). Valtiovarainministeriö. - [HETU-uudistuksen loppuraportti](http://urn.fi/URN:ISBN:978-952-367-296-3) (In Finnish). Valtiovarainministeriön julkaisuja 2020:20. - [The Business Information System (BIS)](https://www.prh.fi/en/kaupparekisteri/rekisterointipalvelut/ytj.html). Finnish Patent and Registration Office. ## Session info This vignette was created with ```{r sessioninfo, message=FALSE, warning=FALSE} sessionInfo() ```