Skip to content

Commit fbc213c

Browse files
committed
tweaking
1 parent 44a7bbc commit fbc213c

File tree

3 files changed

+99
-181
lines changed

3 files changed

+99
-181
lines changed

tests/testthat/test-AND.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ describe("AND-ing rules work",{
2525
rules <- validator(if (x1 < 1 && x2 < 1) x3 >= 1 & x4 >= 1)
2626
d <- data.frame(x1 = 0, x2 = 0, x3 = 0, x4 = 0)
2727
le <- locate_errors(d, rules, weight=c(x1=Inf))
28-
expect_equal(values(le)[1,], c(x1=FALSE, x2-FALSE, x3=TRUE, x4=TRUE))
28+
expect_equal(values(le)[1,], c(x1=FALSE, x2=FALSE, x3=TRUE, x4=TRUE))
2929
})
3030

3131

uRos2018/presentation.Rmd

Lines changed: 98 additions & 180 deletions
Original file line numberDiff line numberDiff line change
@@ -1,243 +1,161 @@
11
---
2-
title: "Data errors, how to find them?"
3-
subtitle: "Errorlocate: find and replace erroneous fields in data using validation rules"
2+
title: "Open Data from CBS using R"
3+
subtitle: "cbsodataR: find and use public open statistics on NL"
44
author: "Edwin de Jonge"
5-
date: "Statistics Netherlands / uRos 2018"
5+
date: "Statistics Netherlands / NSC-R 2022"
66
output:
77
beamer_presentation:
88
includes:
99
in_header: tex/header.tex
1010
keep_tex: yes
11+
editor_options:
12+
markdown:
13+
wrap: 72
1114
---
1215

1316
```{r setup, include=FALSE}
14-
knitr::opts_chunk$set(eval = FALSE)
15-
library(errorlocate)
16-
library(magrittr)
17+
knitr::opts_chunk$set(eval = TRUE)
1718
```
1819

1920
## Who am I?
2021

21-
- Data scientist / Methodologist at Statistics Netherlands (aka CBS).
22-
- Author of several R-packages, including `whisker`, `validate`, `errorlocate`, `docopt`, `daff`, `tableplot`, `ffbase`, `chunked`, ...
23-
- Co-author of _Statistical Data Cleaning with applications in R (2018)_ (together with @markvdloo)
22+
- Data scientist / Methodologist at Statistics Netherlands (aka CBS).
23+
- Author of several R-packages, including `whisker`, `validate`,
24+
`errorlocate`, `docopt`, `daff`, `chunked` , *`cbsodataR`*...
25+
- Co-author of *Statistical Data Cleaning with applications in
26+
R (2018)* (together with @markvdloo)
27+
- background in theoretical/comptutional physics, now since long a
28+
"statistician".
2429

25-
## Data cleaning...
30+
## Daily work:
2631

27-
A large part of your job is spent in data-cleaning:
32+
- expertise: Statistical Computing, (incl R and Python), Network
33+
analysis, Data Visualisation.
34+
- Complexity Science Research / Population and Enterprise networks.
35+
- Internal statistical consultant
2836

29-
- getting your data in the right shape (e.g. `tidyverse`, `dplyr`)
37+
## CBS
3038

31-
- assessing missing data (e.g. `VIM`, `datamaid`)
39+
- Centraal Bureau voor de Statistiek = Statistics Netherlands
3240

33-
- checking validity (e.g. `validate`)
41+
- Founded in 1899
3442

35-
- locating and removing errors: **`errorlocate`**!
43+
- Governmental Agency ("ZBO" under EZK)
3644

37-
- impute values for missing or erroneous data (e.g. `simputation`, `VIM`, `recipes`)
45+
- publishes official statistics on various topics:
3846

39-
## Statistical Value Chain
47+
- demographics
48+
- economy, GDP (economic growth),
49+
- education
50+
- agriculture
51+
- environment
52+
- Sustainable Development Goals (sdg)
53+
- **crime**
4054

41-
\begin{center}
42-
\includegraphics[width=\textwidth]{img/valuechain.pdf}
43-
\end{center}
55+
## CBS open data?
4456

45-
## {.plain}
57+
### StatLine, CBS (Statistics Netherlands)
4658

47-
\begin{center}
48-
\includegraphics[height=1\paperheight]{img/keep-calm-and-validate}
49-
\end{center}
59+
- Since 1995, Statistics Netherlands has StatLine, output database:
60+
all (tabular) data produced by CBS.
5061

62+
- search, select, view and download data
5163

52-
## Validation rules?
64+
- statline: open data *avant la lettre* : <https://opendata.cbs.nl>
5365

54-
Package `validate` allows to:
66+
## What is open data?
5567

56-
- formulate explicit data rule that data must conform to:
68+
Open as in: - free to download and use (with referencing)
5769

58-
```{r}
59-
library(validate)
60-
check_that( data.frame(age=160, driver_license=TRUE),
61-
age >= 0,
62-
age < 150,
63-
if (driver_license == TRUE) age >= 16
64-
)
65-
```
70+
But also: - "machine readable" - should be possible to retrieve data and
71+
meta data "automagically". - implemented with a Web API. (application
72+
programming interface)
6673

67-
## Explicit validation rules:
74+
## CBS open data:
6875

69-
- Give a clear overview what the data must conform to.
70-
- Can be used to reason about.
71-
- Can be used to fix/correct data!
72-
- Find error, and when found correct it.
76+
- implements *odata3* web api for all StatLine tables (>3500) and
77+
their metadata.
78+
- allows for retrieving data at any time.
7379

74-
### Note:
80+
### What is `cbsodataR`?
7581

76-
- Manual fix is error prone, not reproducible and not feasible for large data sets.
77-
- Large rule set have (very) complex behavior, e.g. entangled rules: adjusting one value may
78-
invalidate other rules.
82+
- R wrapper for Web API to facilitate downloading data.
83+
- Adds metadata easily
84+
- Allows for finding and searching data
85+
- Allows for filtering data.
7986

80-
## Error localization
87+
Is used by:
8188

82-
> Error localization is a procedure that points out fields in a data set
83-
that can be altered or imputed in such a way that all validation rules
84-
can be satisfied.
89+
- government agencies: e.g. CPB, SCP, RIVM, etc.
90+
- data journalists
8591

86-
## Find the error:
92+
## cbsodataR Functionality
8793

88-
```{r}
89-
library(validate)
90-
check_that( data.frame(age=160, driver_license=TRUE),
91-
age >= 0,
92-
age < 150,
93-
if (driver_license == TRUE) age >= 16
94-
)
95-
```
94+
- Retrieve a list of datasets
95+
- Find a specific data set
96+
- Download data
97+
- Add metadata
98+
- Filter data
9699

97-
It is clear that `age` has an erroneous value, but for more complex rule sets
98-
it is less clear.
100+
## `cbs_get_datasets()`
99101

100-
## Multivariate example:
102+
- retrieve a list of all open data tables + their publication meta
103+
data \tiny
101104

102105
```{r}
103-
check_that( data.frame( age = 3
104-
, married = TRUE
105-
, attends = "kindergarten"
106-
)
107-
, if (married == TRUE) age >= 16
108-
, if (attends == "kindergarten") age <= 6
109-
)
106+
library(cbsodataR)
107+
ds <- cbs_get_datasets()
108+
ds[1:4, 1:3]
110109
```
111-
Ok, clear that this is a faulty record, but what is the error?
112-
113-
## Feligi Holt formalism:
114-
115-
> Find the minimal (weighted) number of variables that cause the invalidation of the data rules.
116-
117-
Makes sense! (But there are exceptions...)
118-
119-
Implemented in `errorlocate` (second generation of `editrules`).
120-
121-
## Formal description (1)
122-
123-
### Rule $r_i(x)$
124-
125-
A rule a disjunction of atomic clauses:
126-
127-
$$
128-
r_i(\la{x}) = \bigvee_j C_i^j(\la{x})
129-
$$
130-
with:
131110

132-
$$
133-
C_i^j(\la{x}) = \left\{
134-
\begin{array}{l}
135-
\la{a}^T\la{x} \leq b \\
136-
\la{a}^T\la{x} = b \\
137-
x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\
138-
x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\
139-
\end{array}
140-
\right.
141-
$$
111+
## `cbs_search`
142112

143-
## Rule system:
113+
- retrieve a list of data table containing search words
144114

145-
The rules form a system $R(\la{x})$:
115+
\small
146116

147-
$$
148-
R_H(\la{x}) = \bigwedge_i r_i
149-
$$
150-
If $R_H(\la{x})$ is true for record $\la{x}$, then the record is valid, otherwise one (or more) of the rules is violated.
151-
152-
## Mixed Integer Programming to FH
117+
```{r}
118+
theft <- cbs_search("diefstal")
119+
theft[1:4, 2:4]
120+
```
153121

154-
Each rule set $R(\la{x})$ can be translated into a mip problem and solved.
155-
$$
156-
\begin{array}{r}
157-
\textrm{Minimize } f(\mathbf{x}) = 0; \\
158-
\textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \\
159-
\end{array}
160-
$$
122+
## `cbs_get_meta()`
161123

162-
- $f(\la{x})$ is the (weighted) number of changed variable: $\delta_i \in {0,1}$
124+
- Use `Identifier` to retrieve metadata of a table
125+
- contains all metadata (including borders of table).
126+
- Each property is a data.frame containing metadata e.g.
163127

164-
$$
165-
f(\la{x}) = \sum_{i=1}^N w_i \delta_i
166-
$$
128+
```{r}
129+
m <- cbs_get_meta("83648NED") # geregistreerde criminaliteit
130+
names(m)
131+
m$SoortMisdrijf[1:5, 1:2]
132+
```
167133

168-
- $\la{R}$ contains rules: $\la{R}_H(\la{x}) \leq \la{d}_H$ and soft constraints: $\la{R}_0(\la{x}, \la{\delta}) \leq \la{d}_0$ that
169-
try fix the values of $\la{x}$ to the measured values.
134+
## `cbs_get_data`
170135

171-
## `errorlocate`
136+
- retrieve data using an Identifier
137+
- returns a `data.frame` with metadata "embedded"
172138

173-
- translates your rules automatically into a mip form.
174-
- Uses `lpSolveAPI` to solve the problem.
175-
- contains a small framework for implementing your own error localization algorithms.
139+
```{r}
140+
d <- cbs_get_data("83648NED", Perioden="2021JJ00")
141+
```
176142

177-
## `errorlocate::locate_errors`
143+
## intermezzo: Web API
178144

179-
```{r, eval=TRUE}
180-
locate_errors( data.frame( age = 3
181-
, married = TRUE
182-
, attends = "kindergarten"
183-
)
184-
, validator( if (married == TRUE) age >= 16
185-
, if (attends == "kindergarten") age <= 6
186-
)
187-
)$errors
188-
```
145+
\tiny
189146

190-
## `errorlocate::replace_errors`
191-
192-
```{r, eval=TRUE}
193-
replace_errors(
194-
data.frame( age = 3
195-
, married = TRUE
196-
, attends = "kindergarten"
197-
)
198-
, validator( if (married == TRUE) age >= 16
199-
, if (attends == "kindergarten") age <= 6
200-
)
201-
)
147+
```{r}
148+
d <- cbs_get_data("83648NED", Perioden="2021JJ00", verbose=TRUE)
202149
```
203150

151+
## Live demo
204152

205-
## Pipe %>% friendly
153+
## Toekomst:
206154

207-
The `replace_errors` function is pipe friendly:
155+
- Nieuwere versie van de Web API OData4 (nog in beta / test versie)
208156

209-
```{r}
210-
rules <- validator(age < 150)
211-
212-
data_noerrors <-
213-
data.frame(age=160, driver_license = TRUE) %>%
214-
replace_errors(rules)
157+
# volgende versie van
215158

216-
errors_removed(data_noerrors) # contains errors removed
217-
```
159+
- <https://statistiekcbs.github.io/cbsodata4>
218160

219-
## Interested?
220-
221-
\begincols
222-
\begincol{0.48\textwidth}
223-
\includegraphics[width=0.9\textwidth]{img/SDCR.jpg}
224-
\endcol
225-
226-
\begincol{0.48\textwidth}
227-
\begin{block}{SDCR}
228-
M. van der Loo and E. de Jonge (2018)
229-
\emph{Statistical Data Cleaning with applications in R}
230-
Wiley, Inc.
231-
\end{block}
232-
\begin{block}{errorlocate}
233-
\begin{itemize}
234-
\item Available on \href{https://CRAN.R-project.org/package=errorlocate}{\underline{CRAN}}
235-
\end{itemize}
236-
\end{block}
237-
\begin{block}{More theory?}
238-
$\leftarrow$ See book
239-
\end{block}
240-
\endcol
241-
\endcols
242-
243-
Thank you for your attention (and enjoy The Hague)!
161+
Thank you for your attention!

uRos2018/presentation.pdf

-479 KB
Binary file not shown.

0 commit comments

Comments
 (0)