You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/chapter1.qmd
+40Lines changed: 40 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,38 @@ format: html # This qmd file will be compiled as an HTML web page
4
4
---
5
5
6
6
7
+
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
8
+
9
+
Enforcing such a separation enable to:
10
+
- have a strict reproducibility of the full pipeline
11
+
- independence and better maintainability for each of the components
12
+
13
+
# Data storage
14
+
15
+
In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
16
+
17
+
At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
18
+
19
+
Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
@@ -33,6 +65,14 @@ Each categorical variable has a predefined set of valid input classes, since the
33
65
34
66
# Data versioning
35
67
68
+
Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
69
+
70
+
Several tools are available to seamlessly achieve this versioning:
Copy file name to clipboardExpand all lines: chapters/chapter2.qmd
+22-1Lines changed: 22 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,31 @@ format: html # This qmd file will be compiled as an HTML web page
5
5
6
6
# Model training
7
7
8
-
# Model wrapping
8
+
- Parallelized training for maximum reproductibility
9
+
-> Fix the seed
10
+
-> Parallelized training with tools such as Argo Workflows
11
+
-> Logging tools (MLFlow, Weights & Biases...)
9
12
10
13
# Model validation
11
14
15
+
- Key metrics to be checked before deployment
16
+
- You should have a fully reproductible validation script/pipeline that seamlessly takes a trained model and output the validation metrics
17
+
- Best practice: the validation should be run automatically after training and logged
18
+
19
+
# Model wrapping
20
+
21
+
- Encapsulates a trained and validated model for easy service
22
+
- While the model *per se* takes preprocessed/tokenized tensors as input, the wrapper aims at taking RAW text and outputting readable predictions (not logits)
23
+
- A single `.predict()` method should work seamlessly
24
+
- Handle all the preprocessing steps (and the internal machinery needed to run inference)
25
+
- The package torchTextClassifiers has been developed in this mindset
26
+
- MLFlow is also naturally designed to help you do that
27
+
12
28
# Model storage & versioning
13
29
30
+
- You should keep track of all the experiments (all the architectures, all the different hyperparameters), and you should be able to load any experiment that you have tried, at any time
31
+
- The logging tools generally also handles the storage part
32
+
- To "promote" a model once you are satisfied with its performance (and make it ready for deployment), you should have a way to tag and version your models (ex: SVM-v2, BERT-v4...) and so on.
33
+
- At deployment time, you should be able to fetch a model only using its tag and its version (including a previous one if something suddenly broke !)
0 commit comments