Merge pull request #3 from AIML4OS/France

meilame-tayebjee · web-flow · commit cc26e82c8610 · 2026-01-12T15:52:09.000+01:00
France
diff --git a/chapters/chapter1.qmd b/chapters/chapter1.qmd
@@ -4,6 +4,38 @@ format: html # This qmd file will be compiled as an HTML web page
 ---
 
 
+One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
+
+Enforcing such a separation enable to:
+- have a strict reproducibility of the full pipeline
+- independence and better maintainability for each of the components
+
+# Data storage
+
+In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
+
+At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
+
+Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
+
+For instance in Python: 
+
+```{python}
+#| eval: false
+
+# Connecting to the storage via a filesystem
+fs = S3FileSystem(
+        client_kwargs={"endpoint_url": f"https://{os.environ['AWS_S3_ENDPOINT']}"},
+        key=os.environ["AWS_ACCESS_KEY_ID"],
+        secret=os.environ["AWS_SECRET_ACCESS_KEY"],
+    )
+
+# Loading a dataframe is very easy !
+df_train = pd.read_parquet("df_train.parquet", filesystem=fs)
+
+# Saving too
+df_train.to_parquet("df_train.parquet", filesystem=fs)
+```
 # Data storage
 
 ## Germany
@@ -33,6 +65,14 @@ Each categorical variable has a predefined set of valid input classes, since the
 
 # Data versioning
 
+Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
+
+Several tools are available to seamlessly achieve this versioning:
+
+- MLFlow Datasets
+- DVC
+
+Still WIP at Insee.
 ## Germany 
 
 None
diff --git a/chapters/chapter2.qmd b/chapters/chapter2.qmd
@@ -5,10 +5,31 @@ format: html # This qmd file will be compiled as an HTML web page
 
 # Model training
 
-# Model wrapping
+- Parallelized training for maximum reproductibility
+-> Fix the seed
+-> Parallelized training with tools such as Argo Workflows
+-> Logging tools (MLFlow, Weights & Biases...)
 
 # Model validation
 
+- Key metrics to be checked before deployment
+- You should have a fully reproductible validation script/pipeline that seamlessly takes a trained model and output the validation metrics
+- Best practice: the validation should be run automatically after training and logged
+
+# Model wrapping
+
+- Encapsulates a trained and validated model for easy service
+- While the model *per se* takes preprocessed/tokenized tensors as input, the wrapper aims at taking RAW text and outputting readable predictions (not logits)
+    - A single `.predict()` method should work seamlessly
+- Handle all the preprocessing steps (and the internal machinery needed to run inference)
+- The package torchTextClassifiers has been developed in this mindset
+- MLFlow is also naturally designed to help you do that
+
 # Model storage & versioning
 
+- You should keep track of all the experiments (all the architectures, all the different hyperparameters), and you should be able to load any experiment that you have tried, at any time
+- The logging tools generally also handles the storage part
+- To "promote" a model once you are satisfied with its performance (and make it ready for deployment), you should have a way to tag and version your models (ex: SVM-v2, BERT-v4...) and so on.
+    - At deployment time, you should be able to fetch a model only using its tag and its version (including a previous one if something suddenly broke !)
 
+At Insee : MLFlow