Skip to content

Commit 0eacbdb

Browse files
authored
Merge branch 'master' into chore/remove-support-for-ecr-spec-fallback-for-jumpstart-models
2 parents 71f67ab + 8d08c99 commit 0eacbdb

File tree

141 files changed

+10751
-500
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

141 files changed

+10751
-500
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ env/
3232
.python-version
3333
*.html
3434
**/_repack_script_launcher.sh
35+
src/sagemaker/modules/train/container_drivers/sm_train.sh
36+
src/sagemaker/modules/train/container_drivers/sourcecode.json
37+
src/sagemaker/modules/train/container_drivers/distributed.json
3538
tests/data/**/_repack_model.py
3639
tests/data/experiment/sagemaker-dev-1.0.tar.gz
3740
src/sagemaker/serve/tmp_workspace

.pydocstylerc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
inherit = false
33
ignore = D104,D107,D202,D203,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
44
match = (?!record_pb2).*\.py
5+
match-dir = (?!.*test).*

CHANGELOG.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,102 @@
11
# Changelog
22

3+
## v2.237.0 (2024-12-05)
4+
5+
### Features
6+
7+
* Support SageMakerTrainingPlan for training jobs
8+
* AMI support for BRM
9+
* Adding Bedrock Store model support for HubService
10+
11+
### Bug Fixes and Other Changes
12+
13+
* Fix unit tests
14+
* update boto3 and sagemaker-core version
15+
* fix gpu_image uri
16+
* Hotfix to construct rubik uri correctly
17+
* fix codestyles
18+
* fix merge artifact
19+
* fix merge artifact
20+
* fix test_requiremenets.txt
21+
* chore: Merge from main
22+
23+
## v2.236.0 (2024-12-04)
24+
25+
### Features
26+
27+
* Partner App Auth Provider for SDK support
28+
* add pre-processing and post-processing logic to inference_spec
29+
* add utility function to capture local snapshot
30+
* support script mode with local train.sh
31+
32+
### Bug Fixes and Other Changes
33+
34+
* Add graphene to doc requirements
35+
* Add graphne to the doc requirements
36+
* Enable the Recipe tests marked with @pytest.mark.skip(reason="Hyperpod recipe code unavailable"
37+
* Add model trainer documentation
38+
* Usage docs for training recipes
39+
* Neuron URIs update
40+
* Update URIs to public for training recipes
41+
* Changes for SMP v2.7.0
42+
* Change default source directory to current, add option to specify source dir
43+
* Remove default values for fields in recipe_overrides and fix recipe path.
44+
* Update MANIFEST.in so that wheel builds correctly
45+
* fix the file uploading signature verification error
46+
* remove example notebooks artifacts
47+
* Morpheus tests
48+
* Integ tests for local mode model trainer
49+
* Update hyperpod recipe uris
50+
* Add interface units for ModelTrainer
51+
* Model Trainer Bucket improvements
52+
* Update ModelTrainer Interface Parameters
53+
* add in-process mode definition to docs
54+
* Intelligent defaults for Model Trainer
55+
* Fix tests and codestyle
56+
* add integ test for base_model_builder_deploy and remove print statement
57+
* Revert image builder
58+
* pin xgboost dlc to 1.7.1 to fix test
59+
* Skip JS model mapping with env vars or image URI provided
60+
* Use sagemaker core Session
61+
* Integration tests for Model Builder Handshake
62+
* [Updated] Add telemetry to ModelTrainer, Estimator and ModelBuilder
63+
* Update kandinsky in ModelTrainer and allow setting requirements
64+
* add modelID support to model builder InProcess model
65+
* Add Rich Logging to Model Builder
66+
* Notebooks update for Bugbash
67+
* Add bugbash bootstrapping
68+
* add inference morpheus nbs
69+
* Update ModelTrainer Notebooks
70+
* Bug fixes
71+
* Single container local training
72+
* update notebooks
73+
* update notebooks
74+
* Add recipes examples
75+
* Unified Deployment interface in Model Builder
76+
* Use exact python path in trainer template
77+
* Support building image from Dockerfile
78+
* Add Support for Training Recipes
79+
* Trainer handshake
80+
* Pass hyperparameters as CLI args
81+
* Add in_process mode support for DJL and TorchServe servers
82+
* Remove ignored files
83+
* Simplify Config Class Names and DistributedRunner structures
84+
* Fix bug in script mode setup ModelTrainer
85+
* Mask Sensitive Env Logs in Container
86+
* Add path to set Additional Settings in ModelTrainer
87+
* Add Distributed Training Support Model Trainer
88+
* Cleanup ModelTrainer code
89+
* Latest Container Image
90+
* General image builder
91+
* Cleanup ModelTrainer
92+
* Revert Image Spec
93+
* Support intelligent parameters
94+
* Add enviornment variable bootstrapping script
95+
* Add example notebook
96+
* Add unit tests for ModelTrainer
97+
* Image Spec refactoring and updates
98+
* Base model trainer
99+
3100
## v2.235.2 (2024-11-22)
4101

5102
## v2.235.1 (2024-11-20)

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
recursive-include src/sagemaker *.py
22

33
include src/sagemaker/image_uri_config/*.json
4+
include src/sagemaker/pytorch/training_recipes.json
45
include src/sagemaker/serve/schema/*.json
56
include src/sagemaker/serve/requirements.txt
7+
include src/sagemaker/modules/train/sm_recipes/training_recipes.json
68
recursive-include requirements *
79

810
include VERSION

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.235.3.dev0
1+
2.237.1.dev0

doc/api/training/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Training APIs
55
.. toctree::
66
:maxdepth: 4
77

8+
model_trainer
89
algorithm
910
analytics
1011
automl

doc/api/training/model_trainer.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
ModelTrainer
2+
------------
3+
4+
.. autoclass:: sagemaker.modules.train.model_trainer.ModelTrainer
5+
:members:
6+
7+
Configs
8+
~~~~~~~
9+
10+
.. automodule:: sagemaker.modules.configs
11+
:members:
12+
13+
Distributed
14+
~~~~~~~~~~~
15+
16+
.. automodule:: sagemaker.modules.distributed
17+
:members:

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 125 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,9 @@ To train a PyTorch model by using the SageMaker Python SDK:
2121
.. |create pytorch estimator| replace:: Create a ``sagemaker.pytorch.PyTorch`` Estimator
2222
.. _create pytorch estimator: #create-an-estimator
2323

24-
.. |call fit| replace:: Call the estimator's ``fit`` method
25-
.. _call fit: #call-the-fit-method
26-
27-
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_
24+
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_ OR `Choose an Amazon SageMaker HyperPod recipe`_
2825
2. |create pytorch estimator|_
29-
3. |call fit|_
26+
3. `Call the estimator's fit method or ModelTrainer's train method`_
3027

3128
Prepare a PyTorch Training Script
3229
=================================
@@ -175,6 +172,16 @@ see `AWS Deep Learning Containers <https://github.com/aws/deep-learning-containe
175172
- `Images for HuggingFace <https://github.com/aws/deep-learning-containers/tree/master/huggingface>`__
176173

177174

175+
Choose an Amazon Sagemaker HyperPod recipe
176+
==========================================
177+
178+
Alternatively, instead of using your own training script, you can choose an
179+
`Amazon SageMaker HyperPod recipe <https://github.com/aws/sagemaker-hyperpod-recipes>`_ to launch training for a supported model.
180+
If using a recipe, you do not need to provide your own training script. You only need to determine
181+
which recipe you want to run. You can modify a recipe as explained in the next section.
182+
183+
184+
178185
Create an Estimator
179186
===================
180187

@@ -196,10 +203,121 @@ directories ('train' and 'test').
196203
'test': 's3://my-data-bucket/path/to/my/test/data'})
197204
198205
206+
Amazon Sagemaker HyperPod recipes
207+
---------------------------------
208+
Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:
199209

210+
Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.
200211

201-
Call the fit Method
202-
===================
212+
When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
213+
This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
214+
or a local file or a custom url. Please note that you must override the following using
215+
``recipe_overrides``:
216+
217+
* directory paths for the local container in the recipe as appropriate for Python SDK
218+
* the output s3 URIs
219+
* Huggingface access token
220+
* any other recipe fields you wish to edit
221+
222+
The code snippet below shows an example.
223+
Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
224+
for more details about the expected local paths in the container and the Amazon SageMaker
225+
HyperPod recipes tutorial for more examples.
226+
You can override the fields by either setting ``recipe_overrides`` or
227+
providing a modified ``training_recipe`` through a local file or a custom url.
228+
When using the recipe, any provided ``entry_point`` will be ignored.
229+
230+
SageMaker will automatically set up the distribution args.
231+
It will also determine the image to use for your model and device type,
232+
but you can override this with the ``image_uri`` arg.
233+
234+
You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
235+
``source_dir`` will default to current working directory unless specified.
236+
A local copy of training scripts and recipe will be saved in the ``source_dir``.
237+
You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.
238+
239+
Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
240+
Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.
241+
242+
243+
Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.
244+
245+
246+
.. code:: python
247+
248+
overrides = {
249+
"run": {
250+
"results_dir": "/opt/ml/model",
251+
},
252+
"exp_manager": {
253+
"exp_dir": "",
254+
"explicit_log_dir": "/opt/ml/output/tensorboard",
255+
"checkpoint_dir": "/opt/ml/checkpoints",
256+
},
257+
"model": {
258+
"data": {
259+
"train_dir": "/opt/ml/input/data/train",
260+
"val_dir": "/opt/ml/input/data/val",
261+
},
262+
},
263+
}
264+
pytorch_estimator = PyTorch(
265+
output_path=output_path,
266+
base_job_name=f"llama-recipe",
267+
role=role,
268+
instance_type="ml.p5.48xlarge",
269+
training_recipe="hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
270+
recipe_overrides=recipe_overrides,
271+
sagemaker_session=sagemaker_session,
272+
tensorboard_output_config=tensorboard_output_config,
273+
)
274+
pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
275+
'test': 's3://my-data-bucket/path/to/my/test/data'})
276+
277+
# Or alternatively with ModelTrainer
278+
recipe_overrides = {
279+
"run": {
280+
"results_dir": "/opt/ml/model",
281+
},
282+
"exp_manager": {
283+
"exp_dir": "",
284+
"explicit_log_dir": "/opt/ml/output/tensorboard",
285+
"checkpoint_dir": "/opt/ml/checkpoints",
286+
},
287+
"model": {
288+
"data": {
289+
"train_dir": "/opt/ml/input/data/train",
290+
"val_dir": "/opt/ml/input/data/val",
291+
},
292+
},
293+
}
294+
295+
model_trainer = ModelTrainer.from_recipe(
296+
output_path=output_path,
297+
base_job_name=f"llama-recipe",
298+
training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
299+
recipe_overrides=recipe_overrides,
300+
compute=Compute(instance_type="ml.p5.48xlarge"),
301+
sagemaker_session=sagemaker_session
302+
).with_tensorboard_output_config(
303+
tensorboard_output_config=tensorboard_output_config
304+
)
305+
306+
train_input = Input(
307+
channel_name="train",
308+
data_source="s3://my-data-bucket/path/to/my/training/data"
309+
)
310+
311+
test_input = Input(
312+
channel_name="test",
313+
data_source="s3://my-data-bucket/path/to/my/test/data"
314+
)
315+
316+
model_trainer.train(input_data_config=[train_input, test_input)
317+
318+
319+
Call the estimator's fit method or ModelTrainer's train method
320+
==============================================================
203321
204322
You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional
205323
arguments.

doc/overview.rst

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Using the SageMaker Python SDK
44

55
SageMaker Python SDK provides several high-level abstractions for working with Amazon SageMaker. These are:
66

7+
- **ModelTrainer**: New interface encapsulating training on SageMaker.
78
- **Estimators**: Encapsulate training on SageMaker.
89
- **Models**: Encapsulate built ML models.
910
- **Predictors**: Provide real-time inference and transformation using Python data-types against a SageMaker endpoint.
@@ -24,8 +25,8 @@ Train a Model with the SageMaker Python SDK
2425
To train a model by using the SageMaker Python SDK, you:
2526

2627
1. Prepare a training script
27-
2. Create an estimator
28-
3. Call the ``fit`` method of the estimator
28+
2. Create a ModelTrainer or Estimator
29+
3. Call the ``train`` method of the ModelTrainer or the ``fit`` method of the Estimator
2930

3031
After you train a model, you can save it, and then serve the model as an endpoint to get real-time inferences or get inferences for an entire dataset by using batch transform.
3132

@@ -85,6 +86,46 @@ If you want to use, for example, boolean hyperparameters, you need to specify ``
8586
For more on training environment variables, please visit `SageMaker Containers <https://github.com/aws/sagemaker-containers>`_.
8687

8788

89+
Using ModelTrainer
90+
==================
91+
92+
To use the ModelTrainer class, you need to provide a few essential parameters such as the training image URI and the source code configuration. The class allows you to spin up a SageMaker training job with minimal parameters, particularly by specifying the source code and training image.
93+
94+
For more information about class definitions see `ModelTrainer <https://sagemaker.readthedocs.io/en/stable/api/training/model_trainer.html>`_.
95+
96+
Example: Launching a Training Job with Custom Script
97+
98+
.. code:: python
99+
100+
from sagemaker.modules.train import ModelTrainer
101+
from sagemaker.modules.configs import SourceCode, InputData
102+
103+
# Image URI for the training job
104+
pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"
105+
106+
# Define the script to be run
107+
source_code = SourceCode(
108+
source_dir="basic-script-mode",
109+
requirements="requirements.txt",
110+
entry_script="custom_script.py",
111+
)
112+
113+
# Define the ModelTrainer
114+
model_trainer = ModelTrainer(
115+
training_image=pytorch_image,
116+
source_code=source_code,
117+
base_job_name="script-mode",
118+
)
119+
120+
# Pass the input data
121+
input_data = InputData(
122+
channel_name="train",
123+
data_source=training_input_path, # S3 path where training data is stored
124+
)
125+
126+
# Start the training job
127+
model_trainer.train(input_data_config=[input_data], wait=False)
128+
88129
Using Estimators
89130
================
90131

doc/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ packaging==20.9
55
jinja2==3.1.4
66
schema==0.7.5
77
accelerate>=0.24.1,<=0.27.0
8+
graphene<4.0

0 commit comments

Comments
 (0)