|
1 | 1 | --- |
2 | | -title: "Data errors, how to find them?" |
3 | | -subtitle: "Errorlocate: find and replace erroneous fields in data using validation rules" |
| 2 | +title: "Open Data from CBS using R" |
| 3 | +subtitle: "cbsodataR: find and use public open statistics on NL" |
4 | 4 | author: "Edwin de Jonge" |
5 | | -date: "Statistics Netherlands / uRos 2018" |
| 5 | +date: "Statistics Netherlands / NSC-R 2022" |
6 | 6 | output: |
7 | 7 | beamer_presentation: |
8 | 8 | includes: |
9 | 9 | in_header: tex/header.tex |
10 | 10 | keep_tex: yes |
| 11 | +editor_options: |
| 12 | + markdown: |
| 13 | + wrap: 72 |
11 | 14 | --- |
12 | 15 |
|
13 | 16 | ```{r setup, include=FALSE} |
14 | | -knitr::opts_chunk$set(eval = FALSE) |
15 | | -library(errorlocate) |
16 | | -library(magrittr) |
| 17 | +knitr::opts_chunk$set(eval = TRUE) |
17 | 18 | ``` |
18 | 19 |
|
19 | 20 | ## Who am I? |
20 | 21 |
|
21 | | -- Data scientist / Methodologist at Statistics Netherlands (aka CBS). |
22 | | -- Author of several R-packages, including `whisker`, `validate`, `errorlocate`, `docopt`, `daff`, `tableplot`, `ffbase`, `chunked`, ... |
23 | | -- Co-author of _Statistical Data Cleaning with applications in R (2018)_ (together with @markvdloo) |
| 22 | +- Data scientist / Methodologist at Statistics Netherlands (aka CBS). |
| 23 | +- Author of several R-packages, including `whisker`, `validate`, |
| 24 | + `errorlocate`, `docopt`, `daff`, `chunked` , *`cbsodataR`*... |
| 25 | +- Co-author of *Statistical Data Cleaning with applications in |
| 26 | + R (2018)* (together with @markvdloo) |
| 27 | +- background in theoretical/comptutional physics, now since long a |
| 28 | + "statistician". |
24 | 29 |
|
25 | | -## Data cleaning... |
| 30 | +## Daily work: |
26 | 31 |
|
27 | | -A large part of your job is spent in data-cleaning: |
| 32 | +- expertise: Statistical Computing, (incl R and Python), Network |
| 33 | + analysis, Data Visualisation. |
| 34 | +- Complexity Science Research / Population and Enterprise networks. |
| 35 | +- Internal statistical consultant |
28 | 36 |
|
29 | | -- getting your data in the right shape (e.g. `tidyverse`, `dplyr`) |
| 37 | +## CBS |
30 | 38 |
|
31 | | -- assessing missing data (e.g. `VIM`, `datamaid`) |
| 39 | +- Centraal Bureau voor de Statistiek = Statistics Netherlands |
32 | 40 |
|
33 | | -- checking validity (e.g. `validate`) |
| 41 | +- Founded in 1899 |
34 | 42 |
|
35 | | -- locating and removing errors: **`errorlocate`**! |
| 43 | +- Governmental Agency ("ZBO" under EZK) |
36 | 44 |
|
37 | | -- impute values for missing or erroneous data (e.g. `simputation`, `VIM`, `recipes`) |
| 45 | +- publishes official statistics on various topics: |
38 | 46 |
|
39 | | -## Statistical Value Chain |
| 47 | + - demographics |
| 48 | + - economy, GDP (economic growth), |
| 49 | + - education |
| 50 | + - agriculture |
| 51 | + - environment |
| 52 | + - Sustainable Development Goals (sdg) |
| 53 | + - **crime** |
40 | 54 |
|
41 | | -\begin{center} |
42 | | - \includegraphics[width=\textwidth]{img/valuechain.pdf} |
43 | | -\end{center} |
| 55 | +## CBS open data? |
44 | 56 |
|
45 | | -## {.plain} |
| 57 | +### StatLine, CBS (Statistics Netherlands) |
46 | 58 |
|
47 | | -\begin{center} |
48 | | - \includegraphics[height=1\paperheight]{img/keep-calm-and-validate} |
49 | | -\end{center} |
| 59 | +- Since 1995, Statistics Netherlands has StatLine, output database: |
| 60 | + all (tabular) data produced by CBS. |
50 | 61 |
|
| 62 | +- search, select, view and download data |
51 | 63 |
|
52 | | -## Validation rules? |
| 64 | +- statline: open data *avant la lettre* : <https://opendata.cbs.nl> |
53 | 65 |
|
54 | | -Package `validate` allows to: |
| 66 | +## What is open data? |
55 | 67 |
|
56 | | -- formulate explicit data rule that data must conform to: |
| 68 | +Open as in: - free to download and use (with referencing) |
57 | 69 |
|
58 | | -```{r} |
59 | | -library(validate) |
60 | | -check_that( data.frame(age=160, driver_license=TRUE), |
61 | | - age >= 0, |
62 | | - age < 150, |
63 | | - if (driver_license == TRUE) age >= 16 |
64 | | -) |
65 | | -``` |
| 70 | +But also: - "machine readable" - should be possible to retrieve data and |
| 71 | +meta data "automagically". - implemented with a Web API. (application |
| 72 | +programming interface) |
66 | 73 |
|
67 | | -## Explicit validation rules: |
| 74 | +## CBS open data: |
68 | 75 |
|
69 | | -- Give a clear overview what the data must conform to. |
70 | | -- Can be used to reason about. |
71 | | -- Can be used to fix/correct data! |
72 | | -- Find error, and when found correct it. |
| 76 | +- implements *odata3* web api for all StatLine tables (>3500) and |
| 77 | + their metadata. |
| 78 | +- allows for retrieving data at any time. |
73 | 79 |
|
74 | | -### Note: |
| 80 | +### What is `cbsodataR`? |
75 | 81 |
|
76 | | -- Manual fix is error prone, not reproducible and not feasible for large data sets. |
77 | | -- Large rule set have (very) complex behavior, e.g. entangled rules: adjusting one value may |
78 | | -invalidate other rules. |
| 82 | +- R wrapper for Web API to facilitate downloading data. |
| 83 | +- Adds metadata easily |
| 84 | +- Allows for finding and searching data |
| 85 | +- Allows for filtering data. |
79 | 86 |
|
80 | | -## Error localization |
| 87 | +Is used by: |
81 | 88 |
|
82 | | -> Error localization is a procedure that points out fields in a data set |
83 | | -that can be altered or imputed in such a way that all validation rules |
84 | | -can be satisfied. |
| 89 | +- government agencies: e.g. CPB, SCP, RIVM, etc. |
| 90 | +- data journalists |
85 | 91 |
|
86 | | -## Find the error: |
| 92 | +## cbsodataR Functionality |
87 | 93 |
|
88 | | -```{r} |
89 | | -library(validate) |
90 | | -check_that( data.frame(age=160, driver_license=TRUE), |
91 | | - age >= 0, |
92 | | - age < 150, |
93 | | - if (driver_license == TRUE) age >= 16 |
94 | | -) |
95 | | -``` |
| 94 | +- Retrieve a list of datasets |
| 95 | +- Find a specific data set |
| 96 | +- Download data |
| 97 | +- Add metadata |
| 98 | +- Filter data |
96 | 99 |
|
97 | | -It is clear that `age` has an erroneous value, but for more complex rule sets |
98 | | -it is less clear. |
| 100 | +## `cbs_get_datasets()` |
99 | 101 |
|
100 | | -## Multivariate example: |
| 102 | +- retrieve a list of all open data tables + their publication meta |
| 103 | + data \tiny |
101 | 104 |
|
102 | 105 | ```{r} |
103 | | -check_that( data.frame( age = 3 |
104 | | - , married = TRUE |
105 | | - , attends = "kindergarten" |
106 | | - ) |
107 | | - , if (married == TRUE) age >= 16 |
108 | | - , if (attends == "kindergarten") age <= 6 |
109 | | - ) |
| 106 | +library(cbsodataR) |
| 107 | +ds <- cbs_get_datasets() |
| 108 | +ds[1:4, 1:3] |
110 | 109 | ``` |
111 | | -Ok, clear that this is a faulty record, but what is the error? |
112 | | - |
113 | | -## Feligi Holt formalism: |
114 | | - |
115 | | -> Find the minimal (weighted) number of variables that cause the invalidation of the data rules. |
116 | | -
|
117 | | -Makes sense! (But there are exceptions...) |
118 | | - |
119 | | -Implemented in `errorlocate` (second generation of `editrules`). |
120 | | - |
121 | | -## Formal description (1) |
122 | | - |
123 | | -### Rule $r_i(x)$ |
124 | | - |
125 | | -A rule a disjunction of atomic clauses: |
126 | | - |
127 | | -$$ |
128 | | -r_i(\la{x}) = \bigvee_j C_i^j(\la{x}) |
129 | | -$$ |
130 | | -with: |
131 | 110 |
|
132 | | -$$ |
133 | | -C_i^j(\la{x}) = \left\{ |
134 | | - \begin{array}{l} |
135 | | - \la{a}^T\la{x} \leq b \\ |
136 | | - \la{a}^T\la{x} = b \\ |
137 | | - x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ |
138 | | - x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ |
139 | | - \end{array} |
140 | | -\right. |
141 | | -$$ |
| 111 | +## `cbs_search` |
142 | 112 |
|
143 | | -## Rule system: |
| 113 | +- retrieve a list of data table containing search words |
144 | 114 |
|
145 | | -The rules form a system $R(\la{x})$: |
| 115 | +\small |
146 | 116 |
|
147 | | -$$ |
148 | | -R_H(\la{x}) = \bigwedge_i r_i |
149 | | -$$ |
150 | | -If $R_H(\la{x})$ is true for record $\la{x}$, then the record is valid, otherwise one (or more) of the rules is violated. |
151 | | - |
152 | | -## Mixed Integer Programming to FH |
| 117 | +```{r} |
| 118 | +theft <- cbs_search("diefstal") |
| 119 | +theft[1:4, 2:4] |
| 120 | +``` |
153 | 121 |
|
154 | | -Each rule set $R(\la{x})$ can be translated into a mip problem and solved. |
155 | | -$$ |
156 | | -\begin{array}{r} |
157 | | - \textrm{Minimize } f(\mathbf{x}) = 0; \\ |
158 | | - \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \\ |
159 | | -\end{array} |
160 | | -$$ |
| 122 | +## `cbs_get_meta()` |
161 | 123 |
|
162 | | -- $f(\la{x})$ is the (weighted) number of changed variable: $\delta_i \in {0,1}$ |
| 124 | +- Use `Identifier` to retrieve metadata of a table |
| 125 | +- contains all metadata (including borders of table). |
| 126 | +- Each property is a data.frame containing metadata e.g. |
163 | 127 |
|
164 | | -$$ |
165 | | -f(\la{x}) = \sum_{i=1}^N w_i \delta_i |
166 | | -$$ |
| 128 | +```{r} |
| 129 | +m <- cbs_get_meta("83648NED") # geregistreerde criminaliteit |
| 130 | +names(m) |
| 131 | +m$SoortMisdrijf[1:5, 1:2] |
| 132 | +``` |
167 | 133 |
|
168 | | -- $\la{R}$ contains rules: $\la{R}_H(\la{x}) \leq \la{d}_H$ and soft constraints: $\la{R}_0(\la{x}, \la{\delta}) \leq \la{d}_0$ that |
169 | | -try fix the values of $\la{x}$ to the measured values. |
| 134 | +## `cbs_get_data` |
170 | 135 |
|
171 | | -## `errorlocate` |
| 136 | +- retrieve data using an Identifier |
| 137 | +- returns a `data.frame` with metadata "embedded" |
172 | 138 |
|
173 | | -- translates your rules automatically into a mip form. |
174 | | -- Uses `lpSolveAPI` to solve the problem. |
175 | | -- contains a small framework for implementing your own error localization algorithms. |
| 139 | +```{r} |
| 140 | +d <- cbs_get_data("83648NED", Perioden="2021JJ00") |
| 141 | +``` |
176 | 142 |
|
177 | | -## `errorlocate::locate_errors` |
| 143 | +## intermezzo: Web API |
178 | 144 |
|
179 | | -```{r, eval=TRUE} |
180 | | -locate_errors( data.frame( age = 3 |
181 | | - , married = TRUE |
182 | | - , attends = "kindergarten" |
183 | | - ) |
184 | | - , validator( if (married == TRUE) age >= 16 |
185 | | - , if (attends == "kindergarten") age <= 6 |
186 | | - ) |
187 | | - )$errors |
188 | | -``` |
| 145 | +\tiny |
189 | 146 |
|
190 | | -## `errorlocate::replace_errors` |
191 | | - |
192 | | -```{r, eval=TRUE} |
193 | | -replace_errors( |
194 | | - data.frame( age = 3 |
195 | | - , married = TRUE |
196 | | - , attends = "kindergarten" |
197 | | - ) |
198 | | - , validator( if (married == TRUE) age >= 16 |
199 | | - , if (attends == "kindergarten") age <= 6 |
200 | | - ) |
201 | | -) |
| 147 | +```{r} |
| 148 | +d <- cbs_get_data("83648NED", Perioden="2021JJ00", verbose=TRUE) |
202 | 149 | ``` |
203 | 150 |
|
| 151 | +## Live demo |
204 | 152 |
|
205 | | -## Pipe %>% friendly |
| 153 | +## Toekomst: |
206 | 154 |
|
207 | | -The `replace_errors` function is pipe friendly: |
| 155 | +- Nieuwere versie van de Web API OData4 (nog in beta / test versie) |
208 | 156 |
|
209 | | -```{r} |
210 | | -rules <- validator(age < 150) |
211 | | -
|
212 | | -data_noerrors <- |
213 | | - data.frame(age=160, driver_license = TRUE) %>% |
214 | | - replace_errors(rules) |
| 157 | +# volgende versie van |
215 | 158 |
|
216 | | -errors_removed(data_noerrors) # contains errors removed |
217 | | -``` |
| 159 | +- <https://statistiekcbs.github.io/cbsodata4> |
218 | 160 |
|
219 | | -## Interested? |
220 | | - |
221 | | -\begincols |
222 | | - \begincol{0.48\textwidth} |
223 | | - \includegraphics[width=0.9\textwidth]{img/SDCR.jpg} |
224 | | - \endcol |
225 | | - |
226 | | - \begincol{0.48\textwidth} |
227 | | - \begin{block}{SDCR} |
228 | | -M. van der Loo and E. de Jonge (2018) |
229 | | -\emph{Statistical Data Cleaning with applications in R} |
230 | | -Wiley, Inc. |
231 | | -\end{block} |
232 | | -\begin{block}{errorlocate} |
233 | | -\begin{itemize} |
234 | | -\item Available on \href{https://CRAN.R-project.org/package=errorlocate}{\underline{CRAN}} |
235 | | -\end{itemize} |
236 | | -\end{block} |
237 | | -\begin{block}{More theory?} |
238 | | -$\leftarrow$ See book |
239 | | -\end{block} |
240 | | -\endcol |
241 | | -\endcols |
242 | | - |
243 | | -Thank you for your attention (and enjoy The Hague)! |
| 161 | +Thank you for your attention! |
0 commit comments