feat: add first draft for data

meilame-tayebjee · meilame-tayebjee · commit dbe0d0680827 · 2026-01-12T17:33:36.000Z
with boxes and further reading
diff --git a/chapters/chapter1.qmd b/chapters/chapter1.qmd
@@ -3,17 +3,27 @@ title: "Data"
 format: html # This qmd file will be compiled as an HTML web page
 ---
 
-
-One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
+One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, API etc.) and the compute environment.
 
 Enforcing such a separation enable to:
+
 - have a strict reproducibility of the full pipeline
 - independence and better maintainability for each of the components
 
 # Data storage
 
-In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
+In that spirit, data should absolutely lie in a **stable storage** - preferably cloud-based, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
+
+Any preprocessing step should be clearly documented, with a fully reproducible script
+
 
+<details class="insee">
+<summary class="insee-header">
+  <span class="solutionbox-icon"></span>
+  Insee: S3-based storage
+</summary>
+
+<div class="solutionbox-body">
 At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
 
 Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
@@ -36,35 +46,63 @@ df_train = pd.read_parquet("df_train.parquet", filesystem=fs)
 # Saving too
 df_train.to_parquet("df_train.parquet", filesystem=fs)
 ```
-# Data storage
 
-## Germany
+</div>
+</details>
 
-In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data. 
-For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
-We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
-For rights management, we use Ranger, which provides a big variety of access control to ensure data security.   
 
-## Austria
-Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model. 
 
+<details class="destatis">
+<summary class="destatis-header">
+  <span class="solutionbox-icon"></span>
+  Destatis: 
+</summary>
 
-# Data cleaning
+<div class="solutionbox-body">
+
+In order to ensure that the data is stored and used efficiently we make use of the Hadoop Distributed File System (HDFS) and parquet for data partitioning. HDFS is especially made for handling a large amount of data. 
+For programming and data processing, we use Cloudera Machine Learning (CML) with PySpark, which allows us to efficiently work on the data.
+We store our data in the Parquet format, which is ideal for big data and in addition, to make it easier for users to handle and cross-check the data, we use Hue (Hadoop User Experience), an open-source SQL-based cloud editor.
+For rights management, we use Ranger, which provides a big variety of access control to ensure data security.
 
-## Germany
 
 The data cleaning in our project is quite straightforward, since the text entries contain short texts (mostly keywords) instead of long ones. 
 First, data augmentation is performed by adding new text entries (e.g. text like "groceries" or "beverages") to the dataset, adding multiple newly generated text values to each household to enrich the data. 
 Adding a variety of new textual entries helps the model to generelize better. 
 Secondly, we clean the data by removing punctation and handling missing values. 
 
-## Austria
+</div>
+</details>
+
+
+<details class="austria">
+<summary class="austria-header">
+  <span class="solutionbox-icon"></span>
+  Austria: 
+</summary>
+
+<div class="solutionbox-body">
+
+Training data is stored as csv-files. New files are added quarterly by the subject matter experts (between 300-500 data entries), which are then used as to retrain the model. 
+
 Duplicated entries are removed from the data. Text inputs are transformed into all lower-case letters. Further, we remove stop words, umlaut-charaters (ä,ö,ü), special characters (e.g. -,+,#,), gender-specific words endings (e.g. "-in", ":innen"), and numbers.
 Each categorical variable has a predefined set of valid input classes, since the model can only handle known classes. All known inputs are translated into this set of classes. Unknown inputs are set to their respective "unknown" category. 
 
+</div>
+</details>
+
+
 
 # Data versioning
 
+
+<details class="insee">
+<summary class="insee-header">
+  <span class="solutionbox-icon"></span>
+  Insee: MLFlow Datasets
+</summary>
+
+<div class="solutionbox-body">
 Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
 
 Several tools are available to seamlessly achieve this versioning:
@@ -73,10 +111,12 @@ Several tools are available to seamlessly achieve this versioning:
 - DVC
 
 Still WIP at Insee.
-## Germany 
-
-None
+</div>
+</details>
 
-## Austria 
-None
+::: {.callout-tip}
+## Further reading
 
+- [Three Levels of ML Software by ml-ops.org](https://ml-ops.org/content/three-levels-of-ml-software)
+- [Reproducibility guidelines by Anaconda](https://www.anaconda.com/blog/8-levels-of-reproducibility)
+:::
diff --git a/chapters/metadata.json b/chapters/metadata.json
@@ -1 +1 @@
-{"abstract":{},"authors":{},"category":"training courses with R and Python","deploymentUrl":{},"lastModification":"2025-10-21","name":{"en":"Monitoring","fr":"Monitoring"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}
+{"abstract":{},"authors":{},"category":"MLOps guidelines for NSIs","deploymentUrl":{},"lastModification":"2026-01-12","name":{"en":"Data","fr":"Data"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}
diff --git a/styles.css b/styles.css
@@ -1 +1,157 @@
 /* css styles */
+
+/*-- scss:defaults --*/
+
+.insee {
+  margin-top: 1em;
+  margin-bottom: 1em;  
+  border-radius: .25rem;
+  border-left: solid #acacac .3rem;
+  border-right: solid 0.5px silver;
+  border-top: solid 0.5px silver;
+  border-bottom: solid 0.5px silver;
+  border-left-color: #90b7f3 !important;
+}
+.insee-header {
+  //margin-top: 0.5em;
+  margin-bottom: 0.5em;
+  border-bottom: none;
+  font-weight: 600;
+  opacity: 85%;
+  font-size: 0.9rem;
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+  display: flex;
+  background-color: #90b7f3;
+  height: 1.7em;
+  overflow: hidden;
+}
+
+
+.destatis {
+  margin-top: 1em;
+  margin-bottom: 1em;  
+  border-radius: .25rem;
+  border-left: solid #acacac .3rem;
+  border-right: solid 0.5px silver;
+  border-top: solid 0.5px silver;
+  border-bottom: solid 0.5px silver;
+  border-left-color: #f3c790 !important;
+}
+.destatis-header {
+  //margin-top: 0.5em;
+  margin-bottom: 0.5em;
+  border-bottom: none;
+  font-weight: 600;
+  opacity: 85%;
+  font-size: 0.9rem;
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+  display: flex;
+  background-color: #f3c790;
+  height: 1.7em;
+  overflow: hidden;
+}
+
+
+.austria {
+  margin-top: 1em;
+  margin-bottom: 1em;  
+  border-radius: .25rem;
+  border-left: solid #acacac .3rem;
+  border-right: solid 0.5px silver;
+  border-top: solid 0.5px silver;
+  border-bottom: solid 0.5px silver;
+  border-left-color: #9ce49f !important;
+}
+.austria-header {
+  //margin-top: 0.5em;
+  margin-bottom: 0.5em;
+  border-bottom: none;
+  font-weight: 600;
+  opacity: 85%;
+  font-size: 0.9rem;
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+  display: flex;
+  background-color: #9ce49f;
+  height: 1.7em;
+  overflow: hidden;
+}
+
+.solutionbox-icon {
+  height: 0.9rem;
+  width: 0.9rem;
+  display: inline-block;
+  content: "";
+  background-repeat: no-repeat;
+  background-size: 0.9rem 0.9rem;
+  margin-top: .5rem;
+  padding-right: 1.25rem;
+}
+
+.solutionbox-header {
+  //margin-top: 0.5em;
+  margin-bottom: 0.5em;
+  border-bottom: none;
+  font-weight: 600;
+  opacity: 85%;
+  font-size: 0.9rem;
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+  display: flex;
+  background-color: #90b7f3;
+  height: 2em;
+  overflow: hidden;
+}
+
+
+.solutionbox-body {
+  font-size: 0.9rem;
+  font-weight: 400;
+  padding-left: 0.5em;
+  padding-right: 0.5em;
+}
+
+.solutionbox-body > :last-child {
+  padding-bottom: 0.5rem;
+  margin-bottom: 0;
+}
+
+summary.insee-header {
+  color: inherit;
+  opacity: 1;
+}
+summary.destatis-header {
+  color: inherit;
+  opacity: 1;
+}
+summary.austria-header {
+  color: inherit;
+  opacity: 1;
+}
+
+
+summary.insee-header::before {
+  content: "▸";
+}
+
+details[open] summary.insee-header::before {
+  content: "▾";
+}
+
+summary.destatis-header::before {
+  content: "▸";
+}
+
+details[open] summary.destatis-header::before {
+  content: "▾";
+}
+
+summary.austria-header::before {
+  content: "▸";
+}
+
+details[open] summary.austria-header::before {
+  content: "▾";
+}

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{"abstract":{},"authors":{},"category":"training courses with R and Python","deploymentUrl":{},"lastModification":"2025-10-21","name":{"en":"Monitoring","fr":"Monitoring"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}`
	`1`	`+{"abstract":{},"authors":{},"category":"MLOps guidelines for NSIs","deploymentUrl":{},"lastModification":"2026-01-12","name":{"en":"Data","fr":"Data"},"skills":{},"suggestedRequirements":{},"tags":{},"timeRequired":0}`