You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/chapter1.qmd
+59-19Lines changed: 59 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -3,17 +3,27 @@ title: "Data"
3
3
format: html # This qmd file will be compiled as an HTML web page
4
4
---
5
5
6
-
7
-
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
6
+
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, API etc.) and the compute environment.
8
7
9
8
Enforcing such a separation enable to:
9
+
10
10
- have a strict reproducibility of the full pipeline
11
11
- independence and better maintainability for each of the components
12
12
13
13
# Data storage
14
14
15
-
In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
15
+
In that spirit, data should absolutely lie in a **stable storage** - preferably cloud-based, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
16
+
17
+
Any preprocessing step should be clearly documented, with a fully reproducible script
18
+
16
19
20
+
<detailsclass="insee">
21
+
<summaryclass="insee-header">
22
+
<spanclass="solutionbox-icon"></span>
23
+
Insee: S3-based storage
24
+
</summary>
25
+
26
+
<divclass="solutionbox-body">
17
27
At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
18
28
19
29
Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data.
44
-
For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
45
-
We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
46
-
For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
47
53
48
-
## Austria
49
-
Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.
50
54
55
+
<detailsclass="destatis">
56
+
<summaryclass="destatis-header">
57
+
<spanclass="solutionbox-icon"></span>
58
+
Destatis:
59
+
</summary>
51
60
52
-
# Data cleaning
61
+
<divclass="solutionbox-body">
62
+
63
+
In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data.
64
+
For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
65
+
We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
66
+
For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
53
67
54
-
## Germany
55
68
56
69
The data cleaning in our project is quite straightforward, since the text entries contain short texts (mostly keywords) instead of long ones.
57
70
First, data augmentation is performed by adding new text entries (e.g. text like "groceries" or "beverages") to the dataset, adding multiple newly generated text values to each household to enrich the data.
58
71
Adding a variety of new textual entries helps the model to generelize better.
59
72
Secondly, we clean the data by removing punctation and handling missing values.
60
73
61
-
## Austria
74
+
</div>
75
+
</details>
76
+
77
+
78
+
<detailsclass="austria">
79
+
<summaryclass="austria-header">
80
+
<spanclass="solutionbox-icon"></span>
81
+
Austria:
82
+
</summary>
83
+
84
+
<divclass="solutionbox-body">
85
+
86
+
Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model.
87
+
62
88
Duplicated entries are removed from the data. Text inputs are transformed into all lower-case letters. Further, we remove stop words, umlaut-charaters (ä,ö,ü), special characters (e.g. -,+,#,), gender-specific words endings (e.g. "-in", ":innen"), and numbers.
63
89
Each categorical variable has a predefined set of valid input classes, since the model can only handle known classes. All known inputs are translated into this set of classes. Unknown inputs are set to their respective "unknown" category.
64
90
91
+
</div>
92
+
</details>
93
+
94
+
65
95
66
96
# Data versioning
67
97
98
+
99
+
<detailsclass="insee">
100
+
<summaryclass="insee-header">
101
+
<spanclass="solutionbox-icon"></span>
102
+
Insee: MLFlow Datasets
103
+
</summary>
104
+
105
+
<divclass="solutionbox-body">
68
106
Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
69
107
70
108
Several tools are available to seamlessly achieve this versioning:
@@ -73,10 +111,12 @@ Several tools are available to seamlessly achieve this versioning:
73
111
- DVC
74
112
75
113
Still WIP at Insee.
76
-
## Germany
77
-
78
-
None
114
+
</div>
115
+
</details>
79
116
80
-
## Austria
81
-
None
117
+
::: {.callout-tip}
118
+
## Further reading
82
119
120
+
-[Three Levels of ML Software by ml-ops.org](https://ml-ops.org/content/three-levels-of-ml-software)
121
+
-[Reproducibility guidelines by Anaconda](https://www.anaconda.com/blog/8-levels-of-reproducibility)
{"abstract":{},"authors":{},"category":"training courses with R and Python","deploymentUrl":{},"lastModification":"2025-10-21","name":{"en":"Monitoring","fr":"Monitoring"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}
1
+
{"abstract":{},"authors":{},"category":"MLOps guidelines for NSIs","deploymentUrl":{},"lastModification":"2026-01-12","name":{"en":"Data","fr":"Data"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}
0 commit comments