-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add warning about not supporting torch.nn.SyncBatchNorm #5046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Warning: ``torch.nn.SyncBatchNorm`` is not supported and its existence in | ||
``init_process_group`` will cause an exception during distributed training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the wording more generic instead of just this use case ?
smdistributed with init_process_group will not allow use of certain other torch features, such as (and likely not limited to) torch.nn.SyncBatchNorm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* Add warning about not supporting * update wording --------- Co-authored-by: pintaoz <[email protected]>
* Add warning about not supporting * update wording --------- Co-authored-by: pintaoz <[email protected]>
* Add warning about not supporting * update wording --------- Co-authored-by: pintaoz <[email protected]>
* feature: integrate amtviz for visualization of tuning jobs * Move RecordSerializer and RecordDeserializer to sagemaker.serializers and sagemaker.deserialzers (#5037) * Move RecordSerializer and RecordDeserializer to sagemaker.serializers and sagemaker.deserializers * fix codestyle * fix test --------- Co-authored-by: pintaoz <[email protected]> * Add framework_version to all TensorFlowModel examples (#5038) * Add framework_version to all TensorFlowModel examples * update framework_version to x.x.x --------- Co-authored-by: pintaoz <[email protected]> * Fix hyperparameter strategy docs (#5045) * fix: pass in inference_ami_version to model_based endpoint type (#5043) * fix: pass in inference_ami_version to model_based endpoint type * documentation: update contributing.md w/ venv instructions and pip install fixes --------- Co-authored-by: Zhaoqi <[email protected]> * Add warning about not supporting torch.nn.SyncBatchNorm (#5046) * Add warning about not supporting * update wording --------- Co-authored-by: pintaoz <[email protected]> * prepare release v2.239.2 * update development version to v2.239.3.dev0 * change: update image_uri_configs 02-19-2025 06:18:15 PST * fix: codestyle, type hints, license, and docstrings * documentation: add docstring for amtviz module * fix: fix docstyle and flake8 errors * fix: code reformat using black --------- Co-authored-by: Uemit Yoldas <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: parknate@ <[email protected]> Co-authored-by: timkuo-amazon <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: ci <ci> Co-authored-by: sagemaker-bot <[email protected]>
* feature: integrate amtviz for visualization of tuning jobs * Move RecordSerializer and RecordDeserializer to sagemaker.serializers and sagemaker.deserialzers (aws#5037) * Move RecordSerializer and RecordDeserializer to sagemaker.serializers and sagemaker.deserializers * fix codestyle * fix test --------- Co-authored-by: pintaoz <[email protected]> * Add framework_version to all TensorFlowModel examples (aws#5038) * Add framework_version to all TensorFlowModel examples * update framework_version to x.x.x --------- Co-authored-by: pintaoz <[email protected]> * Fix hyperparameter strategy docs (aws#5045) * fix: pass in inference_ami_version to model_based endpoint type (aws#5043) * fix: pass in inference_ami_version to model_based endpoint type * documentation: update contributing.md w/ venv instructions and pip install fixes --------- Co-authored-by: Zhaoqi <[email protected]> * Add warning about not supporting torch.nn.SyncBatchNorm (aws#5046) * Add warning about not supporting * update wording --------- Co-authored-by: pintaoz <[email protected]> * prepare release v2.239.2 * update development version to v2.239.3.dev0 * change: update image_uri_configs 02-19-2025 06:18:15 PST * fix: codestyle, type hints, license, and docstrings * documentation: add docstring for amtviz module * fix: fix docstyle and flake8 errors * fix: code reformat using black --------- Co-authored-by: Uemit Yoldas <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: parknate@ <[email protected]> Co-authored-by: timkuo-amazon <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: ci <ci> Co-authored-by: sagemaker-bot <[email protected]>
Issue #, if available:
#2571
Description of changes:
Add warning about not supporting torch.nn.SyncBatchNorm
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.