Skip to content

Commit cc26e82

Browse files
Merge pull request #3 from AIML4OS/France
France
2 parents 659af58 + de4c210 commit cc26e82

File tree

2 files changed

+62
-1
lines changed

2 files changed

+62
-1
lines changed

chapters/chapter1.qmd

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,38 @@ format: html # This qmd file will be compiled as an HTML web page
44
---
55

66

7+
One of the most important practices for Machine Learning projects is to strictly separate data, code (incl. model architecture, training code, APT etc.) and the compute environment.
8+
9+
Enforcing such a separation enable to:
10+
- have a strict reproducibility of the full pipeline
11+
- independence and better maintainability for each of the components
12+
13+
# Data storage
14+
15+
In that spirit, data should absolutely lie in a stable storage, far from the messy environment of code and compute. If your code or your computer crashes, your data should be safe.
16+
17+
At Insee, we extensively use cloud-based S3 data storage solution, based on the open-source MinIO framework - be it on the SSP Cloud (public Onyxia instance for collaborative, non-sensitive use cases) or LS3 (the internal Onyxia instance for secured, offline projects).
18+
19+
Access your data from the storage is then very easy, from any compute environment (think of it as a Google Drive share link for instance).
20+
21+
For instance in Python:
22+
23+
```{python}
24+
#| eval: false
25+
26+
# Connecting to the storage via a filesystem
27+
fs = S3FileSystem(
28+
client_kwargs={"endpoint_url": f"https://{os.environ['AWS_S3_ENDPOINT']}"},
29+
key=os.environ["AWS_ACCESS_KEY_ID"],
30+
secret=os.environ["AWS_SECRET_ACCESS_KEY"],
31+
)
32+
33+
# Loading a dataframe is very easy !
34+
df_train = pd.read_parquet("df_train.parquet", filesystem=fs)
35+
36+
# Saving too
37+
df_train.to_parquet("df_train.parquet", filesystem=fs)
38+
```
739
# Data storage
840

941
## Germany
@@ -33,6 +65,14 @@ Each categorical variable has a predefined set of valid input classes, since the
3365

3466
# Data versioning
3567

68+
Just as code (see chapter 2), a good practice is to version the dataset, to exactly know on which data the model has been trained (or which is the latest version for the model to be trained on).
69+
70+
Several tools are available to seamlessly achieve this versioning:
71+
72+
- MLFlow Datasets
73+
- DVC
74+
75+
Still WIP at Insee.
3676
## Germany
3777

3878
None

chapters/chapter2.qmd

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,31 @@ format: html # This qmd file will be compiled as an HTML web page
55

66
# Model training
77

8-
# Model wrapping
8+
- Parallelized training for maximum reproductibility
9+
-> Fix the seed
10+
-> Parallelized training with tools such as Argo Workflows
11+
-> Logging tools (MLFlow, Weights & Biases...)
912

1013
# Model validation
1114

15+
- Key metrics to be checked before deployment
16+
- You should have a fully reproductible validation script/pipeline that seamlessly takes a trained model and output the validation metrics
17+
- Best practice: the validation should be run automatically after training and logged
18+
19+
# Model wrapping
20+
21+
- Encapsulates a trained and validated model for easy service
22+
- While the model *per se* takes preprocessed/tokenized tensors as input, the wrapper aims at taking RAW text and outputting readable predictions (not logits)
23+
- A single `.predict()` method should work seamlessly
24+
- Handle all the preprocessing steps (and the internal machinery needed to run inference)
25+
- The package torchTextClassifiers has been developed in this mindset
26+
- MLFlow is also naturally designed to help you do that
27+
1228
# Model storage & versioning
1329

30+
- You should keep track of all the experiments (all the architectures, all the different hyperparameters), and you should be able to load any experiment that you have tried, at any time
31+
- The logging tools generally also handles the storage part
32+
- To "promote" a model once you are satisfied with its performance (and make it ready for deployment), you should have a way to tag and version your models (ex: SVM-v2, BERT-v4...) and so on.
33+
- At deployment time, you should be able to fetch a model only using its tag and its version (including a previous one if something suddenly broke !)
1434

35+
At Insee : MLFlow

0 commit comments

Comments
 (0)