Skip to content

Commit dbe0d06

Browse files
feat: add first draft for data
with boxes and further reading
1 parent cc26e82 commit dbe0d06

File tree

3 files changed

+216
-20
lines changed

3 files changed

+216
-20
lines changed

chapters/chapter1.qmd

Lines changed: 59 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,27 @@ title: "Data"
33
format: html # This qmd file will be compiled as an HTML web page
44
---
55

6-
7-
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
6+
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, API etc.) and the compute environment.
87

98
Enforcing such a separation enable to:
9+
1010
- have a strict reproducibility of the full pipeline
1111
- independence and better maintainability for each of the components
1212

1313
# Data storage
1414

15-
In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
15+
In that spirit, data should absolutely lie in a **stable storage** - preferably cloud-based, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
16+
17+
Any preprocessing step should be clearly documented, with a fully reproducible script
18+
1619

20+
<details class="insee">
21+
<summary class="insee-header">
22+
<span class="solutionbox-icon"></span>
23+
Insee: S3-based storage
24+
</summary>
25+
26+
<div class="solutionbox-body">
1727
At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
1828

1929
Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
@@ -36,35 +46,63 @@ df_train = pd.read_parquet("df_train.parquet", filesystem=fs)
3646
# Saving too
3747
df_train.to_parquet("df_train.parquet", filesystem=fs)
3848
```
39-
# Data storage
4049

41-
## Germany
50+
</div>
51+
</details>
4252

43-
In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data.
44-
For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
45-
We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
46-
For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
4753

48-
## Austria
49-
Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.
5054

55+
<details class="destatis">
56+
<summary class="destatis-header">
57+
<span class="solutionbox-icon"></span>
58+
Destatis:
59+
</summary>
5160

52-
# Data cleaning
61+
<div class="solutionbox-body">
62+
63+
In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data.
64+
For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
65+
We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
66+
For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
5367

54-
## Germany
5568

5669
The data cleaning in our project is quite straightforward, since the text entries contain short texts (mostly keywords) instead of long ones.
5770
First, data augmentation is performed by adding new text entries (e.g. text like "groceries" or "beverages") to the dataset, adding multiple newly generated text values to each household to enrich the data.
5871
Adding a variety of new textual entries helps the model to generelize better.
5972
Secondly, we clean the data by removing punctation and handling missing values.
6073

61-
## Austria
74+
</div>
75+
</details>
76+
77+
78+
<details class="austria">
79+
<summary class="austria-header">
80+
<span class="solutionbox-icon"></span>
81+
Austria:
82+
</summary>
83+
84+
<div class="solutionbox-body">
85+
86+
Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.
87+
6288
Duplicated entries are removed from the data. Text inputs are transformed into all lower-case letters. Further, we remove stop words, umlaut-charaters (ä,ö,ü), special characters (e.g. -,+,#,), gender-specific words endings (e.g. "-in", ":innen"), and numbers.
6389
Each categorical variable has a predefined set of valid input classes, since the model can only handle known classes. All known inputs are translated into this set of classes. Unknown inputs are set to their respective "unknown" category.
6490

91+
</div>
92+
</details>
93+
94+
6595

6696
# Data versioning
6797

98+
99+
<details class="insee">
100+
<summary class="insee-header">
101+
<span class="solutionbox-icon"></span>
102+
Insee: MLFlow Datasets
103+
</summary>
104+
105+
<div class="solutionbox-body">
68106
Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
69107

70108
Several tools are available to seamlessly achieve this versioning:
@@ -73,10 +111,12 @@ Several tools are available to seamlessly achieve this versioning:
73111
- DVC
74112

75113
Still WIP at Insee.
76-
## Germany
77-
78-
None
114+
</div>
115+
</details>
79116

80-
## Austria
81-
None
117+
::: {.callout-tip}
118+
## Further reading
82119

120+
- [Three Levels of ML Software by ml-ops.org](https://ml-ops.org/content/three-levels-of-ml-software)
121+
- [Reproducibility guidelines by Anaconda](https://www.anaconda.com/blog/8-levels-of-reproducibility)
122+
:::

chapters/metadata.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
{"abstract":{},"authors":{},"category":"training courses with R and Python","deploymentUrl":{},"lastModification":"2025-10-21","name":{"en":"Monitoring","fr":"Monitoring"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}
1+
{"abstract":{},"authors":{},"category":"MLOps guidelines for NSIs","deploymentUrl":{},"lastModification":"2026-01-12","name":{"en":"Data","fr":"Data"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}

styles.css

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,157 @@
11
/* css styles */
2+
3+
/*-- scss:defaults --*/
4+
5+
.insee {
6+
margin-top: 1em;
7+
margin-bottom: 1em;
8+
border-radius: .25rem;
9+
border-left: solid #acacac .3rem;
10+
border-right: solid 0.5px silver;
11+
border-top: solid 0.5px silver;
12+
border-bottom: solid 0.5px silver;
13+
border-left-color: #90b7f3 !important;
14+
}
15+
.insee-header {
16+
//margin-top: 0.5em;
17+
margin-bottom: 0.5em;
18+
border-bottom: none;
19+
font-weight: 600;
20+
opacity: 85%;
21+
font-size: 0.9rem;
22+
padding-left: 0.5em;
23+
padding-right: 0.5em;
24+
display: flex;
25+
background-color: #90b7f3;
26+
height: 1.7em;
27+
overflow: hidden;
28+
}
29+
30+
31+
.destatis {
32+
margin-top: 1em;
33+
margin-bottom: 1em;
34+
border-radius: .25rem;
35+
border-left: solid #acacac .3rem;
36+
border-right: solid 0.5px silver;
37+
border-top: solid 0.5px silver;
38+
border-bottom: solid 0.5px silver;
39+
border-left-color: #f3c790 !important;
40+
}
41+
.destatis-header {
42+
//margin-top: 0.5em;
43+
margin-bottom: 0.5em;
44+
border-bottom: none;
45+
font-weight: 600;
46+
opacity: 85%;
47+
font-size: 0.9rem;
48+
padding-left: 0.5em;
49+
padding-right: 0.5em;
50+
display: flex;
51+
background-color: #f3c790;
52+
height: 1.7em;
53+
overflow: hidden;
54+
}
55+
56+
57+
.austria {
58+
margin-top: 1em;
59+
margin-bottom: 1em;
60+
border-radius: .25rem;
61+
border-left: solid #acacac .3rem;
62+
border-right: solid 0.5px silver;
63+
border-top: solid 0.5px silver;
64+
border-bottom: solid 0.5px silver;
65+
border-left-color: #9ce49f !important;
66+
}
67+
.austria-header {
68+
//margin-top: 0.5em;
69+
margin-bottom: 0.5em;
70+
border-bottom: none;
71+
font-weight: 600;
72+
opacity: 85%;
73+
font-size: 0.9rem;
74+
padding-left: 0.5em;
75+
padding-right: 0.5em;
76+
display: flex;
77+
background-color: #9ce49f;
78+
height: 1.7em;
79+
overflow: hidden;
80+
}
81+
82+
.solutionbox-icon {
83+
height: 0.9rem;
84+
width: 0.9rem;
85+
display: inline-block;
86+
content: "";
87+
background-repeat: no-repeat;
88+
background-size: 0.9rem 0.9rem;
89+
margin-top: .5rem;
90+
padding-right: 1.25rem;
91+
}
92+
93+
.solutionbox-header {
94+
//margin-top: 0.5em;
95+
margin-bottom: 0.5em;
96+
border-bottom: none;
97+
font-weight: 600;
98+
opacity: 85%;
99+
font-size: 0.9rem;
100+
padding-left: 0.5em;
101+
padding-right: 0.5em;
102+
display: flex;
103+
background-color: #90b7f3;
104+
height: 2em;
105+
overflow: hidden;
106+
}
107+
108+
109+
.solutionbox-body {
110+
font-size: 0.9rem;
111+
font-weight: 400;
112+
padding-left: 0.5em;
113+
padding-right: 0.5em;
114+
}
115+
116+
.solutionbox-body > :last-child {
117+
padding-bottom: 0.5rem;
118+
margin-bottom: 0;
119+
}
120+
121+
summary.insee-header {
122+
color: inherit;
123+
opacity: 1;
124+
}
125+
summary.destatis-header {
126+
color: inherit;
127+
opacity: 1;
128+
}
129+
summary.austria-header {
130+
color: inherit;
131+
opacity: 1;
132+
}
133+
134+
135+
summary.insee-header::before {
136+
content: "▸";
137+
}
138+
139+
details[open] summary.insee-header::before {
140+
content: "▾";
141+
}
142+
143+
summary.destatis-header::before {
144+
content: "▸";
145+
}
146+
147+
details[open] summary.destatis-header::before {
148+
content: "▾";
149+
}
150+
151+
summary.austria-header::before {
152+
content: "▸";
153+
}
154+
155+
details[open] summary.austria-header::before {
156+
content: "▾";
157+
}

0 commit comments

Comments
 (0)