Skip to content

Commit d3cc946

Browse files
authored
Merge branch 'master' into fix-duplicate-transformer-docstring
2 parents 8908fd4 + 2102bb7 commit d3cc946

File tree

204 files changed

+24920
-3510
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

204 files changed

+24920
-3510
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ env/
3232
.python-version
3333
*.html
3434
**/_repack_script_launcher.sh
35+
src/sagemaker/modules/train/container_drivers/sm_train.sh
36+
src/sagemaker/modules/train/container_drivers/sourcecode.json
37+
src/sagemaker/modules/train/container_drivers/distributed.json
3538
tests/data/**/_repack_model.py
3639
tests/data/experiment/sagemaker-dev-1.0.tar.gz
3740
src/sagemaker/serve/tmp_workspace

.pydocstylerc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
inherit = false
33
ignore = D104,D107,D202,D203,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
44
match = (?!record_pb2).*\.py
5+
match-dir = (?!.*test).*

CHANGELOG.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,153 @@
11
# Changelog
22

3+
## v2.237.2 (2024-12-17)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* update image_uri_configs 12-13-2024 17:07:12 PST
8+
* Cloudpickle upgrade
9+
10+
## v2.237.1 (2024-12-12)
11+
12+
### Bug Fixes and Other Changes
13+
14+
* chore: remove support for ecr spec fallbacks for jumpstart models
15+
* Cloudpickle Revert
16+
* Cloudpickle update
17+
* Numpy update
18+
* Protobuf update
19+
* Update to fetch latest Cloudpickle version
20+
21+
## v2.237.0 (2024-12-05)
22+
23+
### Features
24+
25+
* Support SageMakerTrainingPlan for training jobs
26+
* AMI support for BRM
27+
* Adding Bedrock Store model support for HubService
28+
29+
### Bug Fixes and Other Changes
30+
31+
* Fix unit tests
32+
* update boto3 and sagemaker-core version
33+
* fix gpu_image uri
34+
* Hotfix to construct rubik uri correctly
35+
* fix codestyles
36+
* fix merge artifact
37+
* fix merge artifact
38+
* fix test_requiremenets.txt
39+
* chore: Merge from main
40+
41+
## v2.236.0 (2024-12-04)
42+
43+
### Features
44+
45+
* Partner App Auth Provider for SDK support
46+
* add pre-processing and post-processing logic to inference_spec
47+
* add utility function to capture local snapshot
48+
* support script mode with local train.sh
49+
50+
### Bug Fixes and Other Changes
51+
52+
* Add graphene to doc requirements
53+
* Add graphne to the doc requirements
54+
* Enable the Recipe tests marked with @pytest.mark.skip(reason="Hyperpod recipe code unavailable"
55+
* Add model trainer documentation
56+
* Usage docs for training recipes
57+
* Neuron URIs update
58+
* Update URIs to public for training recipes
59+
* Changes for SMP v2.7.0
60+
* Change default source directory to current, add option to specify source dir
61+
* Remove default values for fields in recipe_overrides and fix recipe path.
62+
* Update MANIFEST.in so that wheel builds correctly
63+
* fix the file uploading signature verification error
64+
* remove example notebooks artifacts
65+
* Morpheus tests
66+
* Integ tests for local mode model trainer
67+
* Update hyperpod recipe uris
68+
* Add interface units for ModelTrainer
69+
* Model Trainer Bucket improvements
70+
* Update ModelTrainer Interface Parameters
71+
* add in-process mode definition to docs
72+
* Intelligent defaults for Model Trainer
73+
* Fix tests and codestyle
74+
* add integ test for base_model_builder_deploy and remove print statement
75+
* Revert image builder
76+
* pin xgboost dlc to 1.7.1 to fix test
77+
* Skip JS model mapping with env vars or image URI provided
78+
* Use sagemaker core Session
79+
* Integration tests for Model Builder Handshake
80+
* [Updated] Add telemetry to ModelTrainer, Estimator and ModelBuilder
81+
* Update kandinsky in ModelTrainer and allow setting requirements
82+
* add modelID support to model builder InProcess model
83+
* Add Rich Logging to Model Builder
84+
* Notebooks update for Bugbash
85+
* Add bugbash bootstrapping
86+
* add inference morpheus nbs
87+
* Update ModelTrainer Notebooks
88+
* Bug fixes
89+
* Single container local training
90+
* update notebooks
91+
* update notebooks
92+
* Add recipes examples
93+
* Unified Deployment interface in Model Builder
94+
* Use exact python path in trainer template
95+
* Support building image from Dockerfile
96+
* Add Support for Training Recipes
97+
* Trainer handshake
98+
* Pass hyperparameters as CLI args
99+
* Add in_process mode support for DJL and TorchServe servers
100+
* Remove ignored files
101+
* Simplify Config Class Names and DistributedRunner structures
102+
* Fix bug in script mode setup ModelTrainer
103+
* Mask Sensitive Env Logs in Container
104+
* Add path to set Additional Settings in ModelTrainer
105+
* Add Distributed Training Support Model Trainer
106+
* Cleanup ModelTrainer code
107+
* Latest Container Image
108+
* General image builder
109+
* Cleanup ModelTrainer
110+
* Revert Image Spec
111+
* Support intelligent parameters
112+
* Add enviornment variable bootstrapping script
113+
* Add example notebook
114+
* Add unit tests for ModelTrainer
115+
* Image Spec refactoring and updates
116+
* Base model trainer
117+
118+
## v2.235.2 (2024-11-22)
119+
120+
## v2.235.1 (2024-11-20)
121+
122+
### Bug Fixes and Other Changes
123+
124+
* Update sagemaker-core dep
125+
* update image_uri_configs 11-20-2024 06:17:41 PST
126+
127+
## v2.235.0 (2024-11-19)
128+
129+
### Features
130+
131+
* Optimize() validations across TRT, VLLM, Neuron container optimizations
132+
133+
### Bug Fixes and Other Changes
134+
135+
* update image_uri_configs 11-19-2024 06:17:58 PST
136+
137+
## v2.234.0 (2024-11-19)
138+
139+
### Features
140+
141+
* optimization technique related validations.
142+
143+
### Bug Fixes and Other Changes
144+
145+
* Revert "change: add TGI 2.4.0 image uri (#4922)"
146+
* pin testing deps
147+
* add TGI 2.4.0 image uri
148+
* add jumpstart ap-southeast-5
149+
* Move sagemaker-mlflow to extras
150+
3151
## v2.233.0 (2024-11-04)
4152

5153
### Features

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
recursive-include src/sagemaker *.py
22

33
include src/sagemaker/image_uri_config/*.json
4+
include src/sagemaker/pytorch/training_recipes.json
45
include src/sagemaker/serve/schema/*.json
56
include src/sagemaker/serve/requirements.txt
7+
include src/sagemaker/modules/train/sm_recipes/training_recipes.json
68
recursive-include requirements *
79

810
include VERSION

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.233.1.dev0
1+
2.237.3.dev0

doc/api/training/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Training APIs
55
.. toctree::
66
:maxdepth: 4
77

8+
model_trainer
89
algorithm
910
analytics
1011
automl

doc/api/training/model_trainer.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
ModelTrainer
2+
------------
3+
4+
.. autoclass:: sagemaker.modules.train.model_trainer.ModelTrainer
5+
:members:
6+
7+
Configs
8+
~~~~~~~
9+
10+
.. automodule:: sagemaker.modules.configs
11+
:members:
12+
13+
Distributed
14+
~~~~~~~~~~~
15+
16+
.. automodule:: sagemaker.modules.distributed
17+
:members:

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 125 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,9 @@ To train a PyTorch model by using the SageMaker Python SDK:
2121
.. |create pytorch estimator| replace:: Create a ``sagemaker.pytorch.PyTorch`` Estimator
2222
.. _create pytorch estimator: #create-an-estimator
2323

24-
.. |call fit| replace:: Call the estimator's ``fit`` method
25-
.. _call fit: #call-the-fit-method
26-
27-
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_
24+
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_ OR `Choose an Amazon SageMaker HyperPod recipe`_
2825
2. |create pytorch estimator|_
29-
3. |call fit|_
26+
3. `Call the estimator's fit method or ModelTrainer's train method`_
3027

3128
Prepare a PyTorch Training Script
3229
=================================
@@ -175,6 +172,16 @@ see `AWS Deep Learning Containers <https://github.com/aws/deep-learning-containe
175172
- `Images for HuggingFace <https://github.com/aws/deep-learning-containers/tree/master/huggingface>`__
176173

177174

175+
Choose an Amazon Sagemaker HyperPod recipe
176+
==========================================
177+
178+
Alternatively, instead of using your own training script, you can choose an
179+
`Amazon SageMaker HyperPod recipe <https://github.com/aws/sagemaker-hyperpod-recipes>`_ to launch training for a supported model.
180+
If using a recipe, you do not need to provide your own training script. You only need to determine
181+
which recipe you want to run. You can modify a recipe as explained in the next section.
182+
183+
184+
178185
Create an Estimator
179186
===================
180187

@@ -196,10 +203,121 @@ directories ('train' and 'test').
196203
'test': 's3://my-data-bucket/path/to/my/test/data'})
197204
198205
206+
Amazon Sagemaker HyperPod recipes
207+
---------------------------------
208+
Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:
199209

210+
Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.
200211

201-
Call the fit Method
202-
===================
212+
When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
213+
This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
214+
or a local file or a custom url. Please note that you must override the following using
215+
``recipe_overrides``:
216+
217+
* directory paths for the local container in the recipe as appropriate for Python SDK
218+
* the output s3 URIs
219+
* Huggingface access token
220+
* any other recipe fields you wish to edit
221+
222+
The code snippet below shows an example.
223+
Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
224+
for more details about the expected local paths in the container and the Amazon SageMaker
225+
HyperPod recipes tutorial for more examples.
226+
You can override the fields by either setting ``recipe_overrides`` or
227+
providing a modified ``training_recipe`` through a local file or a custom url.
228+
When using the recipe, any provided ``entry_point`` will be ignored.
229+
230+
SageMaker will automatically set up the distribution args.
231+
It will also determine the image to use for your model and device type,
232+
but you can override this with the ``image_uri`` arg.
233+
234+
You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
235+
``source_dir`` will default to current working directory unless specified.
236+
A local copy of training scripts and recipe will be saved in the ``source_dir``.
237+
You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.
238+
239+
Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
240+
Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.
241+
242+
243+
Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.
244+
245+
246+
.. code:: python
247+
248+
overrides = {
249+
"run": {
250+
"results_dir": "/opt/ml/model",
251+
},
252+
"exp_manager": {
253+
"exp_dir": "",
254+
"explicit_log_dir": "/opt/ml/output/tensorboard",
255+
"checkpoint_dir": "/opt/ml/checkpoints",
256+
},
257+
"model": {
258+
"data": {
259+
"train_dir": "/opt/ml/input/data/train",
260+
"val_dir": "/opt/ml/input/data/val",
261+
},
262+
},
263+
}
264+
pytorch_estimator = PyTorch(
265+
output_path=output_path,
266+
base_job_name=f"llama-recipe",
267+
role=role,
268+
instance_type="ml.p5.48xlarge",
269+
training_recipe="hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
270+
recipe_overrides=recipe_overrides,
271+
sagemaker_session=sagemaker_session,
272+
tensorboard_output_config=tensorboard_output_config,
273+
)
274+
pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
275+
'test': 's3://my-data-bucket/path/to/my/test/data'})
276+
277+
# Or alternatively with ModelTrainer
278+
recipe_overrides = {
279+
"run": {
280+
"results_dir": "/opt/ml/model",
281+
},
282+
"exp_manager": {
283+
"exp_dir": "",
284+
"explicit_log_dir": "/opt/ml/output/tensorboard",
285+
"checkpoint_dir": "/opt/ml/checkpoints",
286+
},
287+
"model": {
288+
"data": {
289+
"train_dir": "/opt/ml/input/data/train",
290+
"val_dir": "/opt/ml/input/data/val",
291+
},
292+
},
293+
}
294+
295+
model_trainer = ModelTrainer.from_recipe(
296+
output_path=output_path,
297+
base_job_name=f"llama-recipe",
298+
training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
299+
recipe_overrides=recipe_overrides,
300+
compute=Compute(instance_type="ml.p5.48xlarge"),
301+
sagemaker_session=sagemaker_session
302+
).with_tensorboard_output_config(
303+
tensorboard_output_config=tensorboard_output_config
304+
)
305+
306+
train_input = Input(
307+
channel_name="train",
308+
data_source="s3://my-data-bucket/path/to/my/training/data"
309+
)
310+
311+
test_input = Input(
312+
channel_name="test",
313+
data_source="s3://my-data-bucket/path/to/my/test/data"
314+
)
315+
316+
model_trainer.train(input_data_config=[train_input, test_input)
317+
318+
319+
Call the estimator's fit method or ModelTrainer's train method
320+
==============================================================
203321
204322
You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional
205323
arguments.

0 commit comments

Comments
 (0)