You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/end-to-end-easy-to-use-pipeline-for-training-a-model-on-medmnist-v2-using-hpe-machine-learning-development-environment-flask.md
+48-48Lines changed: 48 additions & 48 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,63 +45,22 @@ In this blog post, you'll get to see firsthand how HPE Machine Learning Developm
45
45
46
46
If you are interested in more details about how this example was developed, take a look at the "Practice" section. For a full, in-depth, model porting guide, check out this [model porting guide.](https://docs.determined.ai/latest/tutorials/pytorch-porting-tutorial.html) The code for this example and the instructions used to run it can be found in the [repository](https://github.com/ighodgao/determined_medmnist_e2e).
47
47
48
-
|*Feature*|*Without HPE Machine Learning Development Environment*|*With HPE Machine Learning Development Environment*|
49
-
| ----------- | ----------- | ----------- |
50
-
| Distributed Training | Configure using open-source tools of your choice (e.g. Ray, Horovod) | Fault tolerant distributed training automatically enabled |
51
-
| Experiment Visualization | Write custom code or configure using open-source tools of your choice, (e.g. Weights & Biases, Tensorboard) | Training metrics (model accuracy, model loss) available natively in WebUI, including Tensorboard extension |
52
-
Checkpointing | Write custom logic to save checkpoints during training, which may not be robust to code failures, or configure using open-source tools of your choice | Automatic, robust checkpoint management (e.g. best checkpoint saved at end of training, automatic checkpoint deletion, save checkpoint on experiment pause)|
53
-
Hyperparameter Search | Write custom code or configure using tools of your choice (e.g. Ray Tune, Optuna) | State-of-the-art hyperparameter search algorithm (Adaptive ASHA) automatically available out of the box
48
+
|*Feature*|*Without HPE Machine Learning Development Environment*|*With HPE Machine Learning Development Environment*|
| Distributed Training | Configure using open-source tools of your choice (e.g. Ray, Horovod) | Fault tolerant distributed training automatically enabled|
51
+
| Experiment Visualization | Write custom code or configure using open-source tools of your choice, (e.g. Weights & Biases, Tensorboard) | Training metrics (model accuracy, model loss) available natively in WebUI, including Tensorboard extension|
52
+
|Checkpointing | Write custom logic to save checkpoints during training, which may not be robust to code failures, or configure using open-source tools of your choice | Automatic, robust checkpoint management (e.g. best checkpoint saved at end of training, automatic checkpoint deletion, save checkpoint on experiment pause)|
53
+
|Hyperparameter Search | Write custom code or configure using tools of your choice (e.g. Ray Tune, Optuna) | State-of-the-art hyperparameter search algorithm (Adaptive ASHA) automatically available out of the box |
54
54
55
55
As you can see, without a centralized training platform to handle all necessary features in one place, users are left to write custom code or use a variety of open-source tools. This can get complicated very quickly, as it’s difficult to manage multiple dependencies, and compatibility issues start to arise between tools.
56
56
57
57
In many cases, HPE Machine Learning Development Environment can reduce the length of a training script to nearly half its original size, due to the sheer amount of boilerplate code normally required to enable these features. Let’s take a closer look at each of these to see HPE Machine Learning Development Environment in action.
58
58
59
59
Let's take a closer look at the core features of HPE Machine Learning Development Environment!
60
60
61
-
### Experiment visualization and metric logging
62
-
63
-
Visualization tools are important when developing models due to the probabilistic nature of machine learning. Debugging a model often involves analyzing a model’s training journey by visualizing metrics at different timestamps during an experiment. Commonly used tools for visualization often require manual configuration. Let’s take a look at how the [original training script](https://github.com/MedMNIST/experiments/blob/main/MedMNIST2D/train_and_eval_pytorch.py) handles visualization:
64
-
65
-
The original script uses a library called [tensorboardX](https://tensorboardx.readthedocs.io/en/latest/tensorboard.html#module-tensorboardX):
66
-
67
-
```python
68
-
from tensorboardX import SummaryWriter
69
-
```
70
-
71
-
Using this library, a writer object is created for handling visualization data:
With Determined, no manual metric tracking or logging is necessary. When porting your model to one of our high-level APIs, the default training and testing metrics, such as model losses, are automatically configured and rendered natively in the WebUI:
99
-
100
-

101
-
102
61
### Distributed training
103
62
104
-
Distributed training refers to the process of distributing a model training workload across multiple devices, such as GPUs. It’s very common for machine learning workloads to run for weeks on end due to large model and dataset sizes, so distributing mode training across GPUs can drastically speed up the time it takes to develop a machine learning model.
63
+
Distributed training refers to the process of distributing a model training workload across multiple devices, such as GPUs. It’s very common for machine learning workloads to run for weeks on end due to large model and dataset sizes, so distributing mode training across GPUs can drastically speed up the time it takes to develop a machine learning model, from weeks to hours.
105
64
106
65
However, this is difficult to set up and difficult to manage: manual interaction with GPUs through code is often necessary when setting up distributed training, and, once set up, managing distributed training is cumbersome due to issues like fault tolerance. Fault tolerance refers to the ability of a system to gracefully handle and continue a training job even if something on the infrastructure level goes wrong, such as a device failing. Setting up a fault tolerant solution manually is an enormous lift on an ML team, and not normally within the scope of a researcher’s abilities.
107
66
@@ -144,6 +103,47 @@ After taking these steps, you’d be able to watch your experiment progress in t
144
103
145
104

146
105
106
+
### Experiment visualization and metric logging
107
+
108
+
Visualization tools are important when developing models due to the probabilistic nature of machine learning. Debugging a model often involves analyzing a model’s training journey by visualizing metrics at different timestamps during an experiment. Commonly used tools for visualization often require manual configuration. Let’s take a look at how the [original training script](https://github.com/MedMNIST/experiments/blob/main/MedMNIST2D/train_and_eval_pytorch.py) handles visualization:
109
+
110
+
The original script uses a library called [tensorboardX](https://tensorboardx.readthedocs.io/en/latest/tensorboard.html#module-tensorboardX):
111
+
112
+
```python
113
+
from tensorboardX import SummaryWriter
114
+
```
115
+
116
+
Using this library, a writer object is created for handling visualization data:
With Determined, no manual metric tracking or logging is necessary. When porting your model to one of our high-level APIs, the default training and testing metrics, such as model losses, are automatically configured and rendered natively in the WebUI:
144
+
145
+

146
+
147
147
### Automatic checkpointing
148
148
149
149
Checkpointing a model throughout an experiment is important to maintain training progress and for preserving the best model at the end of an experiment. Let’s take a look at how the original training script handles model checkpointing.
0 commit comments