Skip to content

[PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release#5644

Open
bhanutejagk wants to merge 38 commits intoaws:masterfrom
bhanutejagk:pytorch-2.10-currency
Open

[PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release#5644
bhanutejagk wants to merge 38 commits intoaws:masterfrom
bhanutejagk:pytorch-2.10-currency

Conversation

@bhanutejagk
Copy link
Contributor

@bhanutejagk bhanutejagk commented Feb 9, 2026

  • Add buildspecs for EC2 and SageMaker
  • Add CPU and GPU Dockerfiles
  • Add EC2 test file for PyTorch 2.10
  • Update conftest.py with pytorch_training___2__10 fixture
  • Update SageMaker conftest.py skip_smppy_test for 2.10

GitHub Issue #, if available:

Note:

  • If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.

  • All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests Run

75e2e56 - passed all tests

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>

Formatting

PR Checklist

Expand
  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
  • If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

Expand
  • (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add buildspecs for EC2 and SageMaker
- Add CPU and GPU Dockerfiles
- Add EC2 test file for PyTorch 2.10
- Update conftest.py with pytorch_training___2__10 fixture
- Update SageMaker conftest.py skip_smppy_test for 2.10
@bhanutejagk bhanutejagk requested a review from a team as a code owner February 9, 2026 23:35
@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added authorized build Reflects file change in build folder ec2 pytorch Reflects file change in pytorch folder sagemaker_tests Size:XL Determines the size of the PR test Reflects file change in test folder labels Feb 9, 2026
@bhanutejagk bhanutejagk force-pushed the pytorch-2.10-currency branch from f3a0b07 to f333b46 Compare February 10, 2026 18:31
Bhanu Teja Goshikonda added 5 commits February 10, 2026 11:57
- Split pip install into separate commands to prevent dependency resolver
  from downgrading torch 2.10.0 to 2.9.1
- Add torch version constraint when installing fastai/accelerate/spacy
- Increase CPU image_size_baseline from 7200 to 12000 in buildspec files
@bhanutejagk bhanutejagk force-pushed the pytorch-2.10-currency branch from 7c26f36 to 73dfc44 Compare February 11, 2026 16:23
@bhanutejagk bhanutejagk force-pushed the pytorch-2.10-currency branch from dad8bde to a94e483 Compare February 11, 2026 22:41
@bhanutejagk bhanutejagk requested review from a team as code owners February 18, 2026 02:15
Bhanu Teja Goshikonda added 2 commits February 17, 2026 19:22
- Update Dockerfiles to use sagemaker>=3.0.0
- Rewrite __init__.py with v3 utilities (ModelTrainer, SourceCode, Compute, InputData)
- Convert all active SageMaker training tests to v3 API:
  - Use ModelTrainer instead of PyTorch Estimator
  - Use Torchrun() and SMDataParallel() for distributed training
  - Use SourceCode, Compute, InputData configs
- Convert local training tests to v3 API with Mode.LOCAL_CONTAINER
- Preserve skipped tests' v2 code as comments for reference
- Add China region skip for tests that previously used _disable_sm_profiler
  (ModelTrainer doesn't support disable_profiler parameter)
Bhanu Teja Goshikonda added 2 commits March 11, 2026 15:23
SM SDK v3 moved UnexpectedStatusException from sagemaker.exceptions to
sagemaker.core.exceptions. Use try/except to import from the correct
location based on the installed SDK version.

Files fixed:
- test/sagemaker_tests/__init__.py
- test/sagemaker_tests/pytorch/__init__.py
- test/sagemaker_tests/pytorch/training/integration/sagemaker/__init__.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using these APIs in the script so removing them might cause issues. Also, vllm tests are migrated to V2, check recent currency work - #5716

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are now not at all using tests in V1 for vlllm?

pytest-json-report
pytest-xdist
sagemaker>=2,<3
sagemaker-experiments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is used by other frameworks to run SM local tests, the prior containers are still at SM<3 so updating here might cause issues with those tests. I believe we are migrating to v3 only for 2.10 Currency ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can try rerunning tests for prior versions to test this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, wlll test changes for prior versions after all changes were confirmed

Copy link
Contributor

@jinyan-li1 jinyan-li1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR desc only includes the EC2 image test run 75e2e56 - could you please add the link to SM image test run as well?

try:
from sagemaker.tensorflow import TensorFlow
except ImportError:
TensorFlow = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: setting these to None on import failure helps avoiding collection errors, but the actual tests still call these. If sagemaker>=3, they might hit type errors. We can add some skipif guards to tests that call those like TestImageClassification in test_trcomp_performance - something like:
@pytest.mark.skipif(TensorFlow is None, reason="requires sagemaker<3")
same for PyTorch in test_performance_inductor
but both tests are skipped right now so up to you if you think this is necessary

########################################################

FROM common AS ec2

Copy link
Contributor

@jinyan-li1 jinyan-li1 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redeclare ARG PYTHON here to be consistent with GPU image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the prior versions, it is included until 2.7 but not present in 2.8 and 2.9 I wonder why. I think we should include for clean maintainance of code.

#################################################################

FROM common AS sagemaker

Copy link
Contributor

@jinyan-li1 jinyan-li1 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as EC2, redeclare ARG PYTHON here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

spacy \
thinc \
blis \
numpy \
Copy link
Contributor

@jinyan-li1 jinyan-li1 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep only one of numpy installations, see line 96

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah previously fastai downgraded numpy version to a compatible version so reinstalling numpy was required. but now since fastai is removed reinstalling it is not needed. spacy, thinc and blis doesnt effect numpy version. Will remove that. Thanks.

spacy \
thinc \
blis \
numpy \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue with numpy being installed twice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -61,22 +94,22 @@ def test_fastai_mnist(docker_image, instance_type, py_version, sagemaker_local_s
pytest.skip("Fast ai is not supported on PyTorch v1.9.x, v1.10.x, v1.11.x, v1.12.x")
if Version(image_framework_version) in SpecifierSet("~=2.6.0"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a skip for >=2.10 or ~=2.10 since we removed fastai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Contributor

@jinyan-li1 jinyan-li1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastai has a new release that seems to support pytorch 2.10, i think we'd want to try to add it back https://github.com/fastai/fastai/releases/tag/2.8.7

@bhanutejagk bhanutejagk changed the title Add PyTorch 2.10 Training DLC with CUDA 13.0 and Python 3.13 [PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release Mar 13, 2026
Bhanu Teja Goshikonda added 2 commits March 12, 2026 17:18
Bhanu Teja Goshikonda added 6 commits March 13, 2026 05:35
- test_utility_installation.py: Use double quotes in version_cmd so
  they survive the python -c '...' wrapping by run_cmd_on_container
- test_pre_release.py: Relax mlflow pandas regex to match with or
  without lower bound (pandas<3,>=X vs pandas<3)
- Replace sagemaker.modules.* imports with sagemaker.train.* (v3 GA path)
- Remove all try/except ImportError v2 fallbacks from test files
- Tighten boto3/botocore bounds to >=1.42.0 to fix resolution-too-deep
- Bump awscli to >=1.38.0 (compatible with sagemaker-core requirements)
- Add mlflow CVEs (71577-71693) and skops CVE (71782) to SM allowlists
- Preserve existing protobuf 85151 entry in both CPU and GPU allowlists
- Fix sagemaker_v3/requirements.txt: remove botocore/awscli pins that caused
  ResolutionImpossible, simplify to boto3>=1.35.0,<2.0
- Set do_build=true to bake allowlists into fresh image
- boto3>=1.42.2 matches sagemaker-core>=2.1.0 requirement
- mock>=4.0 overrides shared mock==2.0.0 pin (sagemaker-core needs >4.0)
- Disable all tests except SM remote + SM EFA
- do_build=false (image already built)
- conftest.py: guard sagemaker.pytorch.PyTorch import with try/except
- sagemaker/__init__.py: guard sagemaker.pytorch and sagemaker.exceptions imports
- pytorch/__init__.py: guard sagemaker.exceptions import
- sagemaker_v3/timeout.py: standalone implementation, no v2 dependency
In SM SDK v3, the following are removed from the top-level namespace:
- sagemaker.LocalSession
- sagemaker.Session
- sagemaker.utils
- sagemaker.pytorch.PyTorch
- sagemaker.exceptions

Guard these imports with try/except in all shared files that are loaded
by pytest when collecting v3 tests:
- conftest.py: LocalSession, Session -> None with pytest.skip in fixtures
- sagemaker/__init__.py: utils -> None, exceptions -> placeholder class
- pytorch/__init__.py: exceptions -> placeholder class
- sagemaker_tests/__init__.py: Session -> v3 path, exceptions -> placeholder

Verified locally: pytest --collect-only on sagemaker_v3/ collects 59 tests
with zero import errors using sagemaker==3.5.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorized benchmark build Reflects file change in build folder ec2 pytorch Reflects file change in pytorch folder sagemaker_tests sanity Size:XL Determines the size of the PR test Reflects file change in test folder

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants