[PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release by bhanutejagk · Pull Request #5644 · aws/deep-learning-containers

bhanutejagk · 2026-02-09T23:35:21Z

Add buildspecs for EC2 and SageMaker
Add CPU and GPU Dockerfiles
Add EC2 test file for PyTorch 2.10
Update conftest.py with pytorch_training___2__10 fixture
Update SageMaker conftest.py skip_smppy_test for 2.10

GitHub Issue #, if available:

Note:

If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests Run

75e2e56 - passed all tests

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

Using dlc_developer_config.toml
Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)

How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true
sagemaker_local_tests = true

How to use PR description

Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:

# /buildspec <buildspec_path>
- e.g.: # /buildspec pytorch/training/buildspec.yml
- If this line is commented out, dlc_developer_config.toml will be used.
# /tests <test_list>
- e.g.: # /tests sanity security ec2
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.

# /buildspec <buildspec_path>
# /tests <test_list>

Formatting

I have run black -l 100 on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)

PR Checklist

Expand

I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
(If applicable) I've documented below the DLC image/dockerfile this relates to
(If applicable) I've documented below the tests I've run on the DLC image
(If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
(If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

Expand

(If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
(If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
(If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
(If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add buildspecs for EC2 and SageMaker - Add CPU and GPU Dockerfiles - Add EC2 test file for PyTorch 2.10 - Update conftest.py with pytorch_training___2__10 fixture - Update SageMaker conftest.py skip_smppy_test for 2.10

…ency

- Split pip install into separate commands to prevent dependency resolver from downgrading torch 2.10.0 to 2.9.1 - Add torch version constraint when installing fastai/accelerate/spacy - Increase CPU image_size_baseline from 7200 to 12000 in buildspec files

- Update Dockerfiles to use sagemaker>=3.0.0 - Rewrite __init__.py with v3 utilities (ModelTrainer, SourceCode, Compute, InputData) - Convert all active SageMaker training tests to v3 API: - Use ModelTrainer instead of PyTorch Estimator - Use Torchrun() and SMDataParallel() for distributed training - Use SourceCode, Compute, InputData configs - Convert local training tests to v3 API with Mode.LOCAL_CONTAINER - Preserve skipped tests' v2 code as comments for reference - Add China region skip for tests that previously used _disable_sm_profiler (ModelTrainer doesn't support disable_profiler parameter)

…ency

… TF benchmarks use v2 SDK imports

…lure under sagemaker>=3

…ility

SM SDK v3 moved UnexpectedStatusException from sagemaker.exceptions to sagemaker.core.exceptions. Use try/except to import from the correct location based on the installed SDK version. Files fixed: - test/sagemaker_tests/__init__.py - test/sagemaker_tests/pytorch/__init__.py - test/sagemaker_tests/pytorch/training/integration/sagemaker/__init__.py

Jyothirmaikottu · 2026-03-12T00:59:03Z

test/vllm/sagemaker/test_sm_endpoint.py

We are using these APIs in the script so removing them might cause issues. Also, vllm tests are migrated to V2, check recent currency work - #5716

So we are now not at all using tests in V1 for vlllm?

Jyothirmaikottu · 2026-03-12T01:01:30Z

test/requirements.txt

 pytest-json-report
 pytest-xdist
-sagemaker>=2,<3
-sagemaker-experiments


This file is used by other frameworks to run SM local tests, the prior containers are still at SM<3 so updating here might cause issues with those tests. I believe we are migrating to v3 only for 2.10 Currency ?

Jyothirmaikottu · 2026-03-12T01:02:05Z

test/sagemaker_tests/pytorch/training/requirements.txt

Same here, you can try rerunning tests for prior versions to test this change.

Thanks, wlll test changes for prior versions after all changes were confirmed

jinyan-li1

PR desc only includes the EC2 image test run 75e2e56 - could you please add the link to SM image test run as well?

jinyan-li1 · 2026-03-12T02:47:41Z

test/dlc_tests/benchmark/sagemaker/tensorflow/training/test_trcomp_performance.py

+try:
+    from sagemaker.tensorflow import TensorFlow
+except ImportError:
+    TensorFlow = None


nit: setting these to None on import failure helps avoiding collection errors, but the actual tests still call these. If sagemaker>=3, they might hit type errors. We can add some skipif guards to tests that call those like TestImageClassification in test_trcomp_performance - something like:
@pytest.mark.skipif(TensorFlow is None, reason="requires sagemaker<3")
same for PyTorch in test_performance_inductor
but both tests are skipped right now so up to you if you think this is necessary

jinyan-li1 · 2026-03-12T02:55:40Z

pytorch/training/docker/2.10/py3/Dockerfile.cpu

+########################################################
+
+FROM common AS ec2
+


redeclare ARG PYTHON here to be consistent with GPU image

I saw the prior versions, it is included until 2.7 but not present in 2.8 and 2.9 I wonder why. I think we should include for clean maintainance of code.

jinyan-li1 · 2026-03-12T02:56:08Z

pytorch/training/docker/2.10/py3/Dockerfile.cpu

+#################################################################
+
+FROM common AS sagemaker
+


same as EC2, redeclare ARG PYTHON here

jinyan-li1 · 2026-03-12T02:58:44Z

pytorch/training/docker/2.10/py3/cu130/Dockerfile.gpu

+    spacy \
+    thinc \
+    blis \
+    numpy \


can we keep only one of numpy installations, see line 96

Yeah previously fastai downgraded numpy version to a compatible version so reinstalling numpy was required. but now since fastai is removed reinstalling it is not needed. spacy, thinc and blis doesnt effect numpy version. Will remove that. Thanks.

jinyan-li1 · 2026-03-12T02:59:14Z

pytorch/training/docker/2.10/py3/Dockerfile.cpu

+    spacy \
+    thinc \
+    blis \
+    numpy \


same issue with numpy being installed twice

jinyan-li1 · 2026-03-12T03:17:00Z

test/sagemaker_tests/pytorch/training/integration/local/test_single_machine_training.py

@@ -61,22 +94,22 @@ def test_fastai_mnist(docker_image, instance_type, py_version, sagemaker_local_s
        pytest.skip("Fast ai is not supported on PyTorch v1.9.x, v1.10.x, v1.11.x, v1.12.x")
    if Version(image_framework_version) in SpecifierSet("~=2.6.0"):


lets add a skip for >=2.10 or ~=2.10 since we removed fastai

jinyan-li1

fastai has a new release that seems to support pytorch 2.10, i think we'd want to try to add it back https://github.com/fastai/fastai/releases/tag/2.8.7

…r.py, configure EC2 buildspec

…rebuild image

…rc, benchmark)

…te_function skip, pip_check mlflow/pandas exception, do_build=false

- test_utility_installation.py: Use double quotes in version_cmd so they survive the python -c '...' wrapping by run_cmd_on_container - test_pre_release.py: Relax mlflow pandas regex to match with or without lower bound (pandas<3,>=X vs pandas<3)

- Replace sagemaker.modules.* imports with sagemaker.train.* (v3 GA path) - Remove all try/except ImportError v2 fallbacks from test files - Tighten boto3/botocore bounds to >=1.42.0 to fix resolution-too-deep - Bump awscli to >=1.38.0 (compatible with sagemaker-core requirements)

- Add mlflow CVEs (71577-71693) and skops CVE (71782) to SM allowlists - Preserve existing protobuf 85151 entry in both CPU and GPU allowlists - Fix sagemaker_v3/requirements.txt: remove botocore/awscli pins that caused ResolutionImpossible, simplify to boto3>=1.35.0,<2.0 - Set do_build=true to bake allowlists into fresh image

- boto3>=1.42.2 matches sagemaker-core>=2.1.0 requirement - mock>=4.0 overrides shared mock==2.0.0 pin (sagemaker-core needs >4.0) - Disable all tests except SM remote + SM EFA - do_build=false (image already built)

- conftest.py: guard sagemaker.pytorch.PyTorch import with try/except - sagemaker/__init__.py: guard sagemaker.pytorch and sagemaker.exceptions imports - pytorch/__init__.py: guard sagemaker.exceptions import - sagemaker_v3/timeout.py: standalone implementation, no v2 dependency

In SM SDK v3, the following are removed from the top-level namespace: - sagemaker.LocalSession - sagemaker.Session - sagemaker.utils - sagemaker.pytorch.PyTorch - sagemaker.exceptions Guard these imports with try/except in all shared files that are loaded by pytest when collecting v3 tests: - conftest.py: LocalSession, Session -> None with pytest.skip in fixtures - sagemaker/__init__.py: utils -> None, exceptions -> placeholder class - pytorch/__init__.py: exceptions -> placeholder class - sagemaker_tests/__init__.py: Session -> v3 path, exceptions -> placeholder Verified locally: pytest --collect-only on sagemaker_v3/ collects 59 tests with zero import errors using sagemaker==3.5.0.

Add PyTorch 2.10 Training DLC with CUDA 13.0 and Python 3.13

8dbd698

- Add buildspecs for EC2 and SageMaker - Add CPU and GPU Dockerfiles - Add EC2 test file for PyTorch 2.10 - Update conftest.py with pytorch_training___2__10 fixture - Update SageMaker conftest.py skip_smppy_test for 2.10

bhanutejagk requested a review from a team as a code owner February 9, 2026 23:35

aws-deep-learning-containers-ci bot added authorized build Reflects file change in build folder ec2 pytorch Reflects file change in pytorch folder sagemaker_tests Size:XL Determines the size of the PR test Reflects file change in test folder labels Feb 9, 2026

Bhanu Teja Goshikonda added 3 commits February 9, 2026 15:58

Configure build for PyTorch 2.10 training EC2 images

ca9fd66

fix: add setuptools for pkg_resources in Python 3.13 (OSS compliance)

9eaae59

Merge remote-tracking branch 'upstream/master' into pytorch-2.10-curr…

f333b46

…ency

bhanutejagk force-pushed the pytorch-2.10-currency branch from f3a0b07 to f333b46 Compare February 10, 2026 18:31

Bhanu Teja Goshikonda added 5 commits February 10, 2026 11:57

fix: move setuptools install to EC2/SageMaker stages for pkg_resources

a67b5b1

fix: pin setuptools to 81.0.0 for pkg_resources compatibility

4e2c069

fix: pin setuptools to 80.10.1 (pkg_resources removed in 81+)

2d9b14e

fix: pin setuptools to 81.0.0 and remove redundant installs

20a7fe6

bhanutejagk force-pushed the pytorch-2.10-currency branch from 7c26f36 to 73dfc44 Compare February 11, 2026 16:23

bhanutejagk and others added 2 commits February 11, 2026 08:23

Merge branch 'master' into pytorch-2.10-currency

3b9e8b3

Fix torch 2.10 version pinning and remove setuptools pin

a94e483

bhanutejagk force-pushed the pytorch-2.10-currency branch from dad8bde to a94e483 Compare February 11, 2026 22:41

Bhanu Teja Goshikonda added 3 commits February 11, 2026 16:26

Revert pytorch install changes to match 2.9 style

0bffd28

Set build_inference to false

89aeed1

Remove fastai - requires torch<2.10, not compatible with PyTorch 2.10

5df3ed9

bhanutejagk requested review from a team as code owners February 18, 2026 02:15

Bhanu Teja Goshikonda added 2 commits February 17, 2026 19:22

Merge upstream/master into pytorch-2.10-currency

80b9b3f

Bhanu Teja Goshikonda added 4 commits March 11, 2026 13:12

Merge remote-tracking branch 'upstream/master' into pytorch-2.10-curr…

3c251e7

…ency

Disable sagemaker_benchmark_tests - all PT benchmarks are skipped and…

75e2e56

… TF benchmarks use v2 SDK imports

Revert dlc_developer_config.toml to defaults

3af55f1

Wrap SM SDK v2 imports in try/except to prevent pytest collection fai…

029617b

…lure under sagemaker>=3

aws-deep-learning-containers-ci bot added the benchmark label Mar 11, 2026

Bhanu Teja Goshikonda added 2 commits March 11, 2026 15:23

Also wrap sagemaker.utils import in try/except for SM SDK v3 compatib…

95fcf60

…ility

Jyothirmaikottu reviewed Mar 12, 2026

View reviewed changes

jinyan-li1 reviewed Mar 12, 2026

View reviewed changes

Bhanu Teja Goshikonda added 2 commits March 12, 2026 14:05

Add SM SDK v3 test files for PyTorch 2.10, route v3 tests in sagemake…

86bc532

…r.py, configure EC2 buildspec

Add fastai back to 2.10 Dockerfiles (fastai 2.8.7 supports torch<3), …

36574d7

…rebuild image

bhanutejagk changed the title ~~Add PyTorch 2.10 Training DLC with CUDA 13.0 and Python 3.13~~ [PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release Mar 13, 2026

Bhanu Teja Goshikonda added 2 commits March 12, 2026 17:18

Switch to SM buildspec with do_build=true, enable all SM tests (efa, …

6c1fc9a

…rc, benchmark)

Fix sanity test failures for SM SDK v3: sagemaker version check, remo…

79ad620

…te_function skip, pip_check mlflow/pandas exception, do_build=false

aws-deep-learning-containers-ci bot added the sanity label Mar 13, 2026

Bhanu Teja Goshikonda added 6 commits March 13, 2026 05:35

Fix v3 requirements: boto3>=1.42.2, mock>=4.0; run SM tests only

4cb3681

- boto3>=1.42.2 matches sagemaker-core>=2.1.0 requirement - mock>=4.0 overrides shared mock==2.0.0 pin (sagemaker-core needs >4.0) - Disable all tests except SM remote + SM EFA - do_build=false (image already built)

		########################################################

		FROM common AS ec2

		#################################################################

		FROM common AS sagemaker

		@@ -61,22 +94,22 @@ def test_fastai_mnist(docker_image, instance_type, py_version, sagemaker_local_s
		pytest.skip("Fast ai is not supported on PyTorch v1.9.x, v1.10.x, v1.11.x, v1.12.x")
		if Version(image_framework_version) in SpecifierSet("~=2.6.0"):

Conversation

bhanutejagk commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests Run

Formatting

PR Checklist

Pytest Marker Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyan-li1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyan-li1 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyan-li1 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyan-li1 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinyan-li1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bhanutejagk commented Feb 9, 2026 •

edited

Loading

jinyan-li1 Mar 12, 2026 •

edited

Loading

jinyan-li1 Mar 12, 2026 •

edited

Loading

jinyan-li1 Mar 12, 2026 •

edited

Loading