[PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release#5644
[PyTorch][Training][EC2][SageMaker]PyTorch 2.10 Currency Release#5644bhanutejagk wants to merge 38 commits intoaws:masterfrom
Conversation
- Add buildspecs for EC2 and SageMaker - Add CPU and GPU Dockerfiles - Add EC2 test file for PyTorch 2.10 - Update conftest.py with pytorch_training___2__10 fixture - Update SageMaker conftest.py skip_smppy_test for 2.10
f3a0b07 to
f333b46
Compare
- Split pip install into separate commands to prevent dependency resolver from downgrading torch 2.10.0 to 2.9.1 - Add torch version constraint when installing fastai/accelerate/spacy - Increase CPU image_size_baseline from 7200 to 12000 in buildspec files
7c26f36 to
73dfc44
Compare
dad8bde to
a94e483
Compare
- Update Dockerfiles to use sagemaker>=3.0.0 - Rewrite __init__.py with v3 utilities (ModelTrainer, SourceCode, Compute, InputData) - Convert all active SageMaker training tests to v3 API: - Use ModelTrainer instead of PyTorch Estimator - Use Torchrun() and SMDataParallel() for distributed training - Use SourceCode, Compute, InputData configs - Convert local training tests to v3 API with Mode.LOCAL_CONTAINER - Preserve skipped tests' v2 code as comments for reference - Add China region skip for tests that previously used _disable_sm_profiler (ModelTrainer doesn't support disable_profiler parameter)
… TF benchmarks use v2 SDK imports
…lure under sagemaker>=3
SM SDK v3 moved UnexpectedStatusException from sagemaker.exceptions to sagemaker.core.exceptions. Use try/except to import from the correct location based on the installed SDK version. Files fixed: - test/sagemaker_tests/__init__.py - test/sagemaker_tests/pytorch/__init__.py - test/sagemaker_tests/pytorch/training/integration/sagemaker/__init__.py
There was a problem hiding this comment.
We are using these APIs in the script so removing them might cause issues. Also, vllm tests are migrated to V2, check recent currency work - #5716
There was a problem hiding this comment.
So we are now not at all using tests in V1 for vlllm?
| pytest-json-report | ||
| pytest-xdist | ||
| sagemaker>=2,<3 | ||
| sagemaker-experiments |
There was a problem hiding this comment.
This file is used by other frameworks to run SM local tests, the prior containers are still at SM<3 so updating here might cause issues with those tests. I believe we are migrating to v3 only for 2.10 Currency ?
There was a problem hiding this comment.
Same here, you can try rerunning tests for prior versions to test this change.
There was a problem hiding this comment.
Thanks, wlll test changes for prior versions after all changes were confirmed
jinyan-li1
left a comment
There was a problem hiding this comment.
PR desc only includes the EC2 image test run 75e2e56 - could you please add the link to SM image test run as well?
| try: | ||
| from sagemaker.tensorflow import TensorFlow | ||
| except ImportError: | ||
| TensorFlow = None |
There was a problem hiding this comment.
nit: setting these to None on import failure helps avoiding collection errors, but the actual tests still call these. If sagemaker>=3, they might hit type errors. We can add some skipif guards to tests that call those like TestImageClassification in test_trcomp_performance - something like:
@pytest.mark.skipif(TensorFlow is None, reason="requires sagemaker<3")
same for PyTorch in test_performance_inductor
but both tests are skipped right now so up to you if you think this is necessary
| ######################################################## | ||
|
|
||
| FROM common AS ec2 | ||
|
|
There was a problem hiding this comment.
redeclare ARG PYTHON here to be consistent with GPU image
There was a problem hiding this comment.
I saw the prior versions, it is included until 2.7 but not present in 2.8 and 2.9 I wonder why. I think we should include for clean maintainance of code.
| ################################################################# | ||
|
|
||
| FROM common AS sagemaker | ||
|
|
There was a problem hiding this comment.
same as EC2, redeclare ARG PYTHON here
| spacy \ | ||
| thinc \ | ||
| blis \ | ||
| numpy \ |
There was a problem hiding this comment.
can we keep only one of numpy installations, see line 96
There was a problem hiding this comment.
Yeah previously fastai downgraded numpy version to a compatible version so reinstalling numpy was required. but now since fastai is removed reinstalling it is not needed. spacy, thinc and blis doesnt effect numpy version. Will remove that. Thanks.
| spacy \ | ||
| thinc \ | ||
| blis \ | ||
| numpy \ |
There was a problem hiding this comment.
same issue with numpy being installed twice
| @@ -61,22 +94,22 @@ def test_fastai_mnist(docker_image, instance_type, py_version, sagemaker_local_s | |||
| pytest.skip("Fast ai is not supported on PyTorch v1.9.x, v1.10.x, v1.11.x, v1.12.x") | |||
| if Version(image_framework_version) in SpecifierSet("~=2.6.0"): | |||
There was a problem hiding this comment.
lets add a skip for >=2.10 or ~=2.10 since we removed fastai
jinyan-li1
left a comment
There was a problem hiding this comment.
fastai has a new release that seems to support pytorch 2.10, i think we'd want to try to add it back https://github.com/fastai/fastai/releases/tag/2.8.7
…r.py, configure EC2 buildspec
…te_function skip, pip_check mlflow/pandas exception, do_build=false
- test_utility_installation.py: Use double quotes in version_cmd so they survive the python -c '...' wrapping by run_cmd_on_container - test_pre_release.py: Relax mlflow pandas regex to match with or without lower bound (pandas<3,>=X vs pandas<3)
- Replace sagemaker.modules.* imports with sagemaker.train.* (v3 GA path) - Remove all try/except ImportError v2 fallbacks from test files - Tighten boto3/botocore bounds to >=1.42.0 to fix resolution-too-deep - Bump awscli to >=1.38.0 (compatible with sagemaker-core requirements)
- Add mlflow CVEs (71577-71693) and skops CVE (71782) to SM allowlists - Preserve existing protobuf 85151 entry in both CPU and GPU allowlists - Fix sagemaker_v3/requirements.txt: remove botocore/awscli pins that caused ResolutionImpossible, simplify to boto3>=1.35.0,<2.0 - Set do_build=true to bake allowlists into fresh image
- boto3>=1.42.2 matches sagemaker-core>=2.1.0 requirement - mock>=4.0 overrides shared mock==2.0.0 pin (sagemaker-core needs >4.0) - Disable all tests except SM remote + SM EFA - do_build=false (image already built)
- conftest.py: guard sagemaker.pytorch.PyTorch import with try/except - sagemaker/__init__.py: guard sagemaker.pytorch and sagemaker.exceptions imports - pytorch/__init__.py: guard sagemaker.exceptions import - sagemaker_v3/timeout.py: standalone implementation, no v2 dependency
In SM SDK v3, the following are removed from the top-level namespace: - sagemaker.LocalSession - sagemaker.Session - sagemaker.utils - sagemaker.pytorch.PyTorch - sagemaker.exceptions Guard these imports with try/except in all shared files that are loaded by pytest when collecting v3 tests: - conftest.py: LocalSession, Session -> None with pytest.skip in fixtures - sagemaker/__init__.py: utils -> None, exceptions -> placeholder class - pytorch/__init__.py: exceptions -> placeholder class - sagemaker_tests/__init__.py: Session -> v3 path, exceptions -> placeholder Verified locally: pytest --collect-only on sagemaker_v3/ collects 59 tests with zero import errors using sagemaker==3.5.0.
GitHub Issue #, if available:
Note:
If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.
All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.
Description
Tests Run
75e2e56 - passed all tests
By default, docker image builds and tests are disabled. Two ways to run builds and tests:
How to use the helper utility for updating dlc_developer_config.toml
Assuming your remote is called
origin(you can find out more withgit remote -v)...python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp originpython src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp originpython src/prepare_dlc_dev_environment.py -rcp originNOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:
sagemaker_remote_tests = truesagemaker_efa_tests = truesagemaker_rc_tests = truesagemaker_local_tests = trueHow to use PR description
Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:# /buildspec <buildspec_path># /buildspec pytorch/training/buildspec.yml# /tests <test_list># /tests sanity security ec2sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.Formatting
black -l 100on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)PR Checklist
Expand
Pytest Marker Checklist
Expand
@pytest.mark.model("<model-type>")to the new tests which I have added, to specify the Deep Learning model that is used in the test (use"N/A"if the test doesn't use a model)@pytest.mark.integration("<feature-being-tested>")to the new tests which I have added, to specify the feature that will be tested@pytest.mark.multinode(<integer-num-nodes>)to the new tests which I have added, to specify the number of nodes used on a multi-node test@pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">)to the new tests which I have added, if a test is specifically applicable to only one processor typeBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.