Skip to content

Commit 17d94ac

Browse files
committed
Update Blog “end-to-end-easy-to-use-pipeline-for-training-a-model-on-medmnist-v2-using-hpe-machine-learning-development-environment-flask”
1 parent a14a1ca commit 17d94ac

File tree

1 file changed

+48
-48
lines changed

1 file changed

+48
-48
lines changed

content/blog/end-to-end-easy-to-use-pipeline-for-training-a-model-on-medmnist-v2-using-hpe-machine-learning-development-environment-flask.md

Lines changed: 48 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -45,63 +45,22 @@ In this blog post, you'll get to see firsthand how HPE Machine Learning Developm
4545

4646
If you are interested in more details about how this example was developed, take a look at the "Practice" section. For a full, in-depth, model porting guide, check out this [model porting guide.](https://docs.determined.ai/latest/tutorials/pytorch-porting-tutorial.html) The code for this example and the instructions used to run it can be found in the [repository](https://github.com/ighodgao/determined_medmnist_e2e).
4747

48-
| *Feature* | *Without HPE Machine Learning Development Environment* | *With HPE Machine Learning Development Environment* |
49-
| ----------- | ----------- | ----------- |
50-
| Distributed Training | Configure using open-source tools of your choice (e.g. Ray, Horovod) | Fault tolerant distributed training automatically enabled |
51-
| Experiment Visualization | Write custom code or configure using open-source tools of your choice, (e.g. Weights & Biases, Tensorboard) | Training metrics (model accuracy, model loss) available natively in WebUI, including Tensorboard extension |
52-
Checkpointing | Write custom logic to save checkpoints during training, which may not be robust to code failures, or configure using open-source tools of your choice | Automatic, robust checkpoint management (e.g. best checkpoint saved at end of training, automatic checkpoint deletion, save checkpoint on experiment pause)|
53-
Hyperparameter Search | Write custom code or configure using tools of your choice (e.g. Ray Tune, Optuna) | State-of-the-art hyperparameter search algorithm (Adaptive ASHA) automatically available out of the box
48+
| *Feature* | *Without HPE Machine Learning Development Environment* | *With HPE Machine Learning Development Environment* |
49+
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
50+
| Distributed Training | Configure using open-source tools of your choice (e.g. Ray, Horovod) | Fault tolerant distributed training automatically enabled |
51+
| Experiment Visualization | Write custom code or configure using open-source tools of your choice, (e.g. Weights & Biases, Tensorboard) | Training metrics (model accuracy, model loss) available natively in WebUI, including Tensorboard extension |
52+
| Checkpointing | Write custom logic to save checkpoints during training, which may not be robust to code failures, or configure using open-source tools of your choice | Automatic, robust checkpoint management (e.g. best checkpoint saved at end of training, automatic checkpoint deletion, save checkpoint on experiment pause) |
53+
| Hyperparameter Search | Write custom code or configure using tools of your choice (e.g. Ray Tune, Optuna) | State-of-the-art hyperparameter search algorithm (Adaptive ASHA) automatically available out of the box |
5454

5555
As you can see, without a centralized training platform to handle all necessary features in one place, users are left to write custom code or use a variety of open-source tools. This can get complicated very quickly, as it’s difficult to manage multiple dependencies, and compatibility issues start to arise between tools.
5656

5757
In many cases, HPE Machine Learning Development Environment can reduce the length of a training script to nearly half its original size, due to the sheer amount of boilerplate code normally required to enable these features. Let’s take a closer look at each of these to see HPE Machine Learning Development Environment in action.
5858

5959
Let's take a closer look at the core features of HPE Machine Learning Development Environment!
6060

61-
### Experiment visualization and metric logging 
62-
63-
Visualization tools are important when developing models due to the probabilistic nature of machine learning. Debugging a model often involves analyzing a model’s training journey by visualizing metrics at different timestamps during an experiment. Commonly used tools for visualization often require manual configuration. Let’s take a look at how the [original training script](https://github.com/MedMNIST/experiments/blob/main/MedMNIST2D/train_and_eval_pytorch.py) handles visualization:  
64-
65-
The original script uses a library called [tensorboardX](https://tensorboardx.readthedocs.io/en/latest/tensorboard.html#module-tensorboardX)
66-
67-
```python
68-
from tensorboardX import SummaryWriter
69-
```
70-
71-
Using this library, a writer object is created for handling visualization data: 
72-
73-
```python
74-
writer = SummaryWriter(log_dir=os.path.join(output_root, 'Tensorboard_Results'))
75-
```
76-
77-
The writer object is referenced a total of 9 times throughout the script.  
78-
79-
In addition, training and testing metrics are manually calculated and logged in various places throughout the script, e.g.:  
80-
81-
```python
82-
logs = ['loss', 'auc', 'acc']
83-
train_logs = ['train_'+log for log in logs]
84-
val_logs = ['val_'+log for log in logs]
85-
test_logs = ['test_'+log for log in logs]
86-
log_dict = OrderedDict.fromkeys(train_logs+val_logs+test_logs, 0)
87-
```
88-
89-
```python
90-
train_log = 'train auc: %.5f acc: %.5f\n' % (train_metrics[1], train_metrics[2])
91-
val_log = 'val auc: %.5f acc: %.5f\n' % (val_metrics[1], val_metrics[2])
92-
test_log = 'test auc: %.5f acc: %.5f\n' % (test_metrics[1], test_metrics[2])
93-
94-
log = '%s\n' % (data_flag) + train_log + val_log + test_log
95-
print(log)
96-
```
97-
98-
With Determined, no manual metric tracking or logging is necessary. When porting your model to one of our high-level APIs, the default training and testing metrics, such as model losses, are automatically configured and rendered natively in the WebUI: 
99-
100-
![](/img/screenshot1.png)
101-
10261
### Distributed training 
10362

104-
Distributed training refers to the process of distributing a model training workload across multiple devices, such as GPUs. It’s very common for machine learning workloads to run for weeks on end due to large model and dataset sizes, so distributing mode training across GPUs can drastically speed up the time it takes to develop a machine learning model.  
63+
Distributed training refers to the process of distributing a model training workload across multiple devices, such as GPUs. It’s very common for machine learning workloads to run for weeks on end due to large model and dataset sizes, so distributing mode training across GPUs can drastically speed up the time it takes to develop a machine learning model, from weeks to hours.  
10564

10665
However, this is difficult to set up and difficult to manage: manual interaction with GPUs through code is often necessary when setting up distributed training, and, once set up, managing distributed training is cumbersome due to issues like fault tolerance. Fault tolerance refers to the ability of a system to gracefully handle and continue a training job even if something on the infrastructure level goes wrong, such as a device failing. Setting up a fault tolerant solution manually is an enormous lift on an ML team, and not normally within the scope of a researcher’s abilities.  
10766

@@ -144,6 +103,47 @@ After taking these steps, you’d be able to watch your experiment progress in t
144103

145104
![](/img/screenshot2.png)
146105

106+
### Experiment visualization and metric logging 
107+
108+
Visualization tools are important when developing models due to the probabilistic nature of machine learning. Debugging a model often involves analyzing a model’s training journey by visualizing metrics at different timestamps during an experiment. Commonly used tools for visualization often require manual configuration. Let’s take a look at how the [original training script](https://github.com/MedMNIST/experiments/blob/main/MedMNIST2D/train_and_eval_pytorch.py) handles visualization:  
109+
110+
The original script uses a library called [tensorboardX](https://tensorboardx.readthedocs.io/en/latest/tensorboard.html#module-tensorboardX)
111+
112+
```python
113+
from tensorboardX import SummaryWriter
114+
```
115+
116+
Using this library, a writer object is created for handling visualization data: 
117+
118+
```python
119+
writer = SummaryWriter(log_dir=os.path.join(output_root, 'Tensorboard_Results'))
120+
```
121+
122+
The writer object is referenced a total of 9 times throughout the script.  
123+
124+
In addition, training and testing metrics are manually calculated and logged in various places throughout the script, e.g.:  
125+
126+
```python
127+
logs = ['loss', 'auc', 'acc']
128+
train_logs = ['train_'+log for log in logs]
129+
val_logs = ['val_'+log for log in logs]
130+
test_logs = ['test_'+log for log in logs]
131+
log_dict = OrderedDict.fromkeys(train_logs+val_logs+test_logs, 0)
132+
```
133+
134+
```python
135+
train_log = 'train auc: %.5f acc: %.5f\n' % (train_metrics[1], train_metrics[2])
136+
val_log = 'val auc: %.5f acc: %.5f\n' % (val_metrics[1], val_metrics[2])
137+
test_log = 'test auc: %.5f acc: %.5f\n' % (test_metrics[1], test_metrics[2])
138+
139+
log = '%s\n' % (data_flag) + train_log + val_log + test_log
140+
print(log)
141+
```
142+
143+
With Determined, no manual metric tracking or logging is necessary. When porting your model to one of our high-level APIs, the default training and testing metrics, such as model losses, are automatically configured and rendered natively in the WebUI: 
144+
145+
![](/img/screenshot1.png)
146+
147147
### Automatic checkpointing 
148148

149149
Checkpointing a model throughout an experiment is important to maintain training progress and for preserving the best model at the end of an experiment. Let’s take a look at how the original training script handles model checkpointing.

0 commit comments

Comments
 (0)