Skip to content

Commit 667c589

Browse files
committed
Update Blog “production-ready-object-detection-model-training-workflow-with-hpe-machine-learning-development-environment”
1 parent d586e9b commit 667c589

File tree

1 file changed

+9
-15
lines changed

1 file changed

+9
-15
lines changed

content/blog/production-ready-object-detection-model-training-workflow-with-hpe-machine-learning-development-environment.md

Lines changed: 9 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ tags:
1111
- machine-learning
1212
- data-ml-engineer
1313
---
14-
This blog will recount the seamless user experience with [HPE Machine Learning Development Environment](https://www.hpe.com/us/en/solutions/artificial-intelligence/machine-learning-development-environment.html), pointing out how easy it is to achieve machine learning at scale with HPE.  
14+
This in-depth blog tutorial is divided into five separate sections, where I will recount the seamless user experience one has when working with [HPE Machine Learning Development Environment](https://www.hpe.com/us/en/solutions/artificial-intelligence/machine-learning-development-environment.html), pointing out how easy it is to achieve machine learning at scale with HPE.  
1515

1616
Over the five parts of this blog, we’re going to review end-to-end training of an object detection model using NVIDIA’s PyTorch Container from [NVIDIA's NGC Catalog](https://www.nvidia.com/en-us/gpu-cloud/), a Jupyter Notebook, the open-source training platform from [Determined AI](http://www.determined.ai/), and [Kserve](https://www.kubeflow.org/docs/external-add-ons/kserve/kserve/) to deploy the model into production.  
1717

@@ -184,10 +184,7 @@ Here we are using the SAHI library to slice our large satellite images. Satellit
184184
## 4. Upload to s3 bucket to support distributed training
185185

186186
We will now upload our exported data to a publically accessible S3 bucket. This will enable for a large scale distributed experiment to have access to the dataset without installing the dataset on device.
187-
View these links to learn how to upload your dataset to an S3 bucket. Review the `S3Backend` class in `data.py`
188-
189-
* https://docs.determined.ai/latest/training/load-model-data.html#streaming-from-object-storage
190-
* https://codingsight.com/upload-files-to-aws-s3-with-the-aws-cli/
187+
View [Determined Documentation](<* https://docs.determined.ai/latest/training/load-model-data.html#streaming-from-object-storage>) and [AWS instructions](<* https://codingsight.com/upload-files-to-aws-s3-with-the-aws-cli/>) to learn how to upload your dataset to an S3 bucket. Review the `S3Backend` class in `data.py`
191188

192189
Once you create an S3 bucket that is publically accessible, here are example commands to upload the preprocessed dataset to S3:
193190

@@ -215,7 +212,7 @@ Let's get started!
215212

216213
## Execute docker run to create NGC environment for Data Prep
217214

218-
make sure to map host directory to docker directory, we will use the host directory again to
215+
Make sure to map host directory to docker directory, we will use the host directory again to
219216

220217
* `docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/ubuntu:/home/ubuntu -p 8008:8888 -it nvcr.io/nvidia/pytorch:21.11-py3 /bin/bash`
221218

@@ -516,10 +513,7 @@ mkdir /tmp/val_sliced_no_neg
516513
mv val_300_02.json /tmp/val_sliced_no_neg/val_300_02.json
517514
```
518515

519-
*Note that completing this tutorial requires you to upload your dataset from Step 2 into a publically accessible S3 bucket. This will enable for a large scale distributed experiment to have access to the dataset without installing the dataset on device. View these links to learn how to upload your dataset to an S3 bucket. Review the `S3Backend` class in `data.py`.*
520-
521-
* https://docs.determined.ai/latest/training/load-model-data.html#streaming-from-object-storage
522-
* https://codingsight.com/upload-files-to-aws-s3-with-the-aws-cli/
516+
*Note that completing this tutorial requires you to upload your dataset from Step 2 into a publically accessible S3 bucket. This will enable for a large scale distributed experiment to have access to the dataset without installing the dataset on device. View [Determined Documentation](<* https://docs.determined.ai/latest/training/load-model-data.html#streaming-from-object-storage>) and [AWS instructions](<* https://codingsight.com/upload-files-to-aws-s3-with-the-aws-cli/>) to learn how to upload your dataset to an S3 bucket. Review the* `S3Backend` class in `data.py`
523517

524518
When you define your S3 bucket and uploaded your dataset, make sure to change the `TARIN_DATA_DIR` in `build_training_data_loader` with the defined path in the S3 bucket.
525519

@@ -559,7 +553,7 @@ Run the below commands in a terminal, and complete logging into the determined c
559553

560554
## Define Determined Experiment
561555

562-
In [Determined](www.determined.ai), a *trial* is a training task that consists of a dataset, a deep learning model, and values for all of the model’s hyperparameters. An *experiment* is a collection of one or more trials: an experiment can either train a single model (with a single trial), or can train multiple models via. a hyperparameter sweep a user-defined hyperparameter space.
556+
In [Determined](https://www.determined.ai/), a *trial* is a training task that consists of a dataset, a deep learning model, and values for all of the model’s hyperparameters. An *experiment* is a collection of one or more trials: an experiment can either train a single model (with a single trial), or can train multiple models via. a hyperparameter sweep a user-defined hyperparameter space.
563557

564558
Here is what a configuration file looks like for a distributed training experiment.
565559

@@ -752,7 +746,7 @@ Let's get started!
752746

753747
Run the below commands to set up a python virtual environment, and install all the python packages needed for this tutorial
754748

755-
```
749+
```cwl
756750
sudo apt-get update && sudo apt-get install python3.8-venv
757751
python3 -m venv kserve_env
758752
source kserve_env/bin/activate
@@ -889,7 +883,7 @@ Checkpoints created from a Determined Experiment will save both the model parame
889883

890884
Run the below command in a terminal:
891885

892-
```bash
886+
```cwl
893887
python kserve_utils/torchserve_utils/strip_checkpoint.py --ckpt-path kserve_utils/torchserve_utils/trained_model.pth \
894888
--new-ckpt-name kserve_utils/torchserve_utils/trained_model_stripped.pth
895889
```
@@ -898,7 +892,7 @@ python kserve_utils/torchserve_utils/strip_checkpoint.py --ckpt-path kserve_util
898892

899893
Run the below command to export the Pytorch Checkpoint into a .mar file that is required for torchserve inference. Our Kserve InferenceService will automatically deploy a Pod with a docker image that support TorchServe inferencing.
900894

901-
```bash
895+
```cwl
902896
torch-model-archiver --model-name xview-fasterrcnn \
903897
--version 1.0 \
904898
--model-file kserve_utils/torchserve_utils/model-xview.py \
@@ -943,7 +937,7 @@ model_snapshot={"name": "startup.cfg","modelCount": 1,"models": {"xview-fasterrc
943937
944938
#### What the properties.json looks like
945939
946-
```
940+
```json
947941
[
948942
{
949943
"model-name": "xview-fasterrcnn",

0 commit comments

Comments
 (0)