forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
Add torch.backends.cuda.math_sdp.fp32_precision #2841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…jectVariable subclasses (pytorch#167801) Removed redundant `_nonvar_fields` assignments from 5 UserDefinedObjectVariable subclasses. These explicit re-assignments are unnecessary because Python's class attribute inheritance automatically provides access to parent class attributes. **Classes cleaned up:** - UserDefinedDictVariable - UserDefinedSetVariable - UserDefinedListVariable - UserDefinedTupleVariable - MutableMappingVariable All 5 classes inherit from `UserDefinedObjectVariable`, which defines `_nonvar_fields`. The pattern `_nonvar_fields = UserDefinedObjectVariable._nonvar_fields` is pure redundancy - the child classes will automatically inherit this attribute from the parent. ## Changes - **Lines removed:** 10 (5 redundant assignments + 5 blank lines) - **File modified:** `torch/_dynamo/variables/user_defined.py` ## Impact - **Code reduction:** -10 lines - **Maintainability:** ↑ (less redundancy) - **Risk:** Zero (identical behavior via inheritance) Pull Request resolved: pytorch#167801 Approved by: https://github.com/guilhermeleobas
pytorch#168146)" This reverts commit 08bfadf. Reverted pytorch#168146 on behalf of https://github.com/yangw-dev due to failed internal tests due to AttributeError: 'LocalIntNode' object has no attribute 'int_', please fix it and re-merge again ([comment](pytorch#168146 (comment)))
This fixes an issue with the tests in fbcode Pull Request resolved: pytorch#168931 Approved by: https://github.com/anijain2305
…ytorch#168914) Pull Request resolved: pytorch#168914 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931
…ch#168915) Pull Request resolved: pytorch#168915 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914
Pull Request resolved: pytorch#168924 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915
…torch#168925) They are leaking state and breaking other tests Pull Request resolved: pytorch#168925 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915, pytorch#168924
Test would fail because op names were already in use. Pull Request resolved: pytorch#168926 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915, pytorch#168924, pytorch#168925
…g of pow func (pytorch#167723)" This reverts commit f1c49c9. Reverted pytorch#167723 on behalf of https://github.com/yangw-dev due to break trunk inductor tests test/inductor/test_triton_cpu_backend.py ([comment](pytorch#167723 (comment)))
This reverts commit 4909fd8. Reverted pytorch#158219 on behalf of https://github.com/jeffdaily due to broke ROCm dynamo inductor benchmarks on ciflow/inductor-periodic label which wasn't run by default for this PR ([comment](pytorch#158219 (comment)))
Pull Request resolved: pytorch#168396 Approved by: https://github.com/xmfan ghstack dependencies: pytorch#168289
…iodic frequency from every 6 hours to every 2 hours (pytorch#168990) Fix low utilization issue for linux.dgx.b200. linux.dgx.b200.8 is much busier. According to https://hud.pytorch.org/runners/pytorch?search=dgx.b200 Pull Request resolved: pytorch#168990 Approved by: https://github.com/drisspg
….8 (pytorch#168985) Summary: Fix pytorch#168353. aot_inductor.emit_multi_arch_kernel requires a newer CUDA version. Pull Request resolved: pytorch#168985 Approved by: https://github.com/yushangdi
…#166044) related to pytorch#163970 Changes: Below are addressed from review from @malfet and @atalman: 1. Simplified the x86 TORCH_CUDA_ARCH_LIST logic to reuse the base list in`.ci/manywheel/build_cuda.sh`. 2. Added function filter_aarch64_archs() that filters the TORCH_CUDA_ARCH_LIST for aarch64 based on the x86 code. 3. Added function in `.ci/pytorch/build.sh` to report error if ACL is not present. 4. Deprecated previous aarch64 scripts (`.ci/aarch64_linux/` folder). Improvements: 1. Significant improvement in build time for CUDA ARM wheel build - Reduced build time from 5.5–6 hours to 1 hour 40–50 minutes taking this 13.0 build for example, 6h 11m 46s to 1h 50m 1s ≈ 70 % faster build time old: https://github.com/pytorch/pytorch/actions/runs/19304934204/job/55209695430 new: https://github.com/pytorch/pytorch/actions/runs/19301014750/job/55195226316 Reason: MAX_JOBS=5 is now removed after we move away from original aarch64 build workflow, previously it was OOM in building flash-attn, new MAX_JOBS is 12. https://github.com/pytorch/pytorch/pull/166044/files#diff-ccef31095e4f2d203710232531c38bff3251e41cf73ec84ee59f224bb64034aeL280 2. Unified workflow for building x86 and sbsa wheels - more maintainable code Pull Request resolved: pytorch#166044 Approved by: https://github.com/atalman
…ng clarity (pytorch#168272) Fixes pytorch#168160 TYPE_MATCH guards currently generate code like: ___check_type_id(x, 94229757490048) The numeric type-id provides no information about the type being checked. This PR appends a human-readable `repr(type)` as a trailing comment: ___check_type_id(x, 94229757490048) # <class 'torch.nn.modules.linear.Linear'> ### What This Change Does - Adds `repr(t)` to improve readability of guard output. - No behavior or semantics are changed — this is a debug-only improvement. ### Testing Verified that `repr(type)` produces readable, accurate names for built-in, user-defined, and torch.nn module types. Runtime behavior is unchanged; CI will validate everything end-to-end. Pull Request resolved: pytorch#168272 Approved by: https://github.com/williamwen42, https://github.com/anijain2305
Found in pytorch#167407 but affects non-threaded builds as well Pull Request resolved: pytorch#168325 Approved by: https://github.com/williamwen42
…h#168989) Right now we get a pretty hard to understand error message: ``` Traceback (most recent call last): File "/home/bobren/local/a/pytorch/spc.py", line 80, in <module> .save_compiled_function(path) File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 129, in save_compiled_function f.write(type(self).serialize(self)) File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 145, in serialize type(compiled_fn).serialize_compile_artifacts(compiled_fn), File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile_types.py", line 54, in serialize_compile_artifacts def deserialize_compile_artifacts(cls, data: bytes) -> Any: TypeError: 'NoneType' object is not callable ``` which happens because cache is bypassed, so the "serialize" field on compiled_fn is set to None. after this PR we get a much more direct error message: ``` (/home/bobren/local/a/pytorch-env) [9:18] devgpu009:/home/bobren/local/a/pytorch [130] ❯ cache_tlp python spc.py Wrapped class is <class '__main__.WrappedBasicNN_1'> my_property: 123 Traceback (most recent call last): File "/home/bobren/local/a/pytorch/spc.py", line 79, in <module> .aot_compile(((input_tensor,), {})) File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 806, in aot_compile return aot_compile_fullgraph( File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 236, in aot_compile_fullgraph compiled_fn = backend( File "/home/bobren/local/a/pytorch/torch/__init__.py", line 2445, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2525, in compile_fx return _maybe_wrap_and_compile_fx_main( File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2602, in _maybe_wrap_and_compile_fx_main return _compile_fx_main( File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2797, in _compile_fx_main return aot_autograd( File "/home/bobren/local/a/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, **self.kwargs) File "/home/bobren/local/a/pytorch/torch/_functorch/aot_autograd.py", line 1097, in aot_module_simplified compiled_fn = AOTAutogradCache.try_load( File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 708, in try_load raise e File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 639, in try_load cache_key, debug_lines = autograd_cache_key( File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 499, in autograd_cache_key check_cacheable(gm) File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 292, in check_cacheable check_node_safe(node) File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 240, in check_node_safe raise BypassAOTAutogradCache( torch._functorch._aot_autograd.autograd_cache.BypassAOTAutogradCache: Unsupported call_function target tag_activation_checkpoint. Function module: torch.ops.higher_order, Function name: tag_activation_checkpoint ``` Pull Request resolved: pytorch#168989 Approved by: https://github.com/jamesjwu
`call_function` is starting to get pretty long so pull the `nonstrict_traceable` portion out into a helper function before we make it even longer for pytorch#168890. Pull Request resolved: pytorch#168932 Approved by: https://github.com/anijain2305
…ch#168119)" This reverts commit c566552. Reverted pytorch#168119 on behalf of https://github.com/yushangdi due to This PR caused DebugMode to hang/segfault sometimes. See repro in P2054777054 ([comment](pytorch#168119 (comment)))
Last change to this file was back in 2021, and last CircleCI job was wound down probably in 2022, so it's safe to assume it's unsued Pull Request resolved: pytorch#169003 Approved by: https://github.com/huydhn
We may pick wrong contiguous node in mix-order reduction fusion due to dynamic shapes. Differential Revision: [D87788131](https://our.internmc.facebook.com/intern/diff/D87788131) Pull Request resolved: pytorch#168371 Approved by: https://github.com/PaulZhang12
As titled, there are comm size estimation regression after this PR: pytorch#167852, which cause DSV3 dynamic shape estimation error: pytorch/torchtitan#2037. Also added dynamic shape comm estimation test cases in the PR cc. @eellison @ezyang Pull Request resolved: pytorch#168199 Approved by: https://github.com/laithsakka
This PR adds support for effectful ops within invoke_subgraphs.
* Most of the logic is in `invoke_subgraph.py_functionalize_impl`.
* In the functionalization metadata collection phase, we note the tokens before going further down the dispatcher, and then note the tokens after coming back from the dispatcher. If there are nodes in the invoke_subgraph subgraph that contain effects, the number of effects should change, or the tokens used for an effect should.
* We will store this effect difference in the `InvokeSubgraphCache` where the key is the identifier and value is the effect. For now we only support one effect within a subgraph.
* During the tracing part of AOTAutograd, we will then wrap the subgraph to take in and output a token.
Before:
```
def forward(self, x):
repeated_subgraph0 = self.repeated_subgraph0
invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', x)
return invoke_subgraph
def repeated_subgraph(self, x):
record_memory = torch.ops.mylib.record_memory.default("forward", "N")
add = torch.ops.aten.add(x, x)
return add
```
After:
```
def forward(self, token, x):
repeated_subgraph0 = self.repeated_subgraph0
invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x)
getitem = invoke_subgraph[0] # output token
getitem_1 = invoke_subgraph[1]
return (getitem, getitem_1)
def repeated_subgraph(self, token, x):
with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N')
getitem = with_effects[0] # output token
add = torch.ops.aten.add(x, x)
return (getitem, add)
```
* Then there is a bunch of logic within `_remove_effect_tokens` to handle removing the effects from the invoke_subgraph subgraph
Differential Revision: [D87392741](https://our.internmc.facebook.com/intern/diff/D87392741)
Pull Request resolved: pytorch#167231
Approved by: https://github.com/anijain2305
…torch#167245) In the [previous PR](https://github.com/pytorch/pytorch/pull/167231/files#diff-e2b74af5d8b538a7d07d18507d27010703742ddad5f819992b55f5abc6d9a502R964-R966) we found that the autograd eager impl of invoke_subgraph calls the subgraph twice. If the subgraph contains effects then effects will be run twice, which is bad. This PR fixes the issue by getting the output metadata from `subgraph`'s `node.meta` if it exists. Differential Revision: [D87392740](https://our.internmc.facebook.com/intern/diff/D87392740) Pull Request resolved: pytorch#167245 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#167231
Updates the implementation of `unlift_tokens` to handle unlifting invoke_subgraph.
The context of `unlift_tokens` is currently tokens are threaded as inputs and outputs of the toplevel graph produced by AOTAutograd. However we don't want the inductor traced graph to have any notion of effects/tokens, just that the tokens should introduce some extra dependency behavior. So, we unlift the tokens from the toplevel graph. Instead of placeholder nodes the tokens will come from a `_make_token` call, and instead of outputting the tokens we will sink all tokens into `_sink_tokens`.
Similarly, we want the invoke_subgraph subgraph to not have any notion of tokens, so we will also remove the tokens from the inputs of the invoke_subgraph subgraph. However, we still need a way mark the invoke_subgraph call as being effectful at the toplevel module to prevent invoke_subgraph calls from being reordered, so I wrap the invoke_subgraph with an effects.
Before:
```
def forward(self, token, x):
repeated_subgraph0 = self.repeated_subgraph0
invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x)
getitem = invoke_subgraph[0] # output token
getitem_1 = invoke_subgraph[1]
return (getitem, getitem_1)
def repeated_subgraph(self, token, x):
with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N')
getitem = with_effects[0] # output token
add = torch.ops.aten.add(x, x)
return (getitem, add)
```
After:
```
def forward(self, x):
token = torch.ops.prims._make_token.default()
repeated_subgraph0 = self.repeated_subgraph0
invoke_subgraph = torch.ops.higher_order.with_effects(
token, torch.ops.higher_order.invoke_subgraph, repeated_subgraph0, 'subgraph_0', token, x
)
getitem = invoke_subgraph[0] # output token
getitem_1 = invoke_subgraph[1]
_ = torch.ops.prims._sink_tokens.default([getitem])
return (getitem_1,)
def repeated_subgraph(self, x):
token = torch.ops.prims._make_token.default()
with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N')
getitem = with_effects[0] # output token
add = torch.ops.aten.add(x, x)
_ = torch.ops.prims._sink_tokens.default([getitem])
return (add,)
```
Differential Revision: [D87668981](https://our.internmc.facebook.com/intern/diff/D87668981)
Pull Request resolved: pytorch#167363
Approved by: https://github.com/fxdawnn
ghstack dependencies: pytorch#167231, pytorch#167245
This PR uses context managers and suppresses ruff `SIM115` warnings in some places. Pull Request resolved: pytorch#167788 Approved by: https://github.com/albanD
The test fails due to undefined variable: ``` Running inductor/test_flex_decoding 1/1 ... SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set Executing ['/opt/app-root/bin/python', '-bb', 'inductor/test_flex_decoding.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... Running inductor/test_fxir_backend 1/1 ... SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set Executing ['/opt/app-root/bin/python', '-bb', 'inductor/test_fxir_backend.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... /opt/app-root/lib64/python3.12/site-packages/torch/__init__.py:1613: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /root/pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) Traceback (most recent call last): File "/root/pytorch/test/inductor/test_flex_decoding.py", line 309, in <module> class TestFlexDecoding(InductorTestCase): File "/root/pytorch/test/inductor/test_flex_decoding.py", line 751, in TestFlexDecoding @unittest.skipIf(SKIP_UT_ON_CPU, "Skip on CPU as not supported") ^^^^^^^^^^^^^^ NameError: name 'SKIP_UT_ON_CPU' is not defined ``` Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#165404 Approved by: https://github.com/drisspg, https://github.com/Skylion007
…152161) Fixes: pytorch#146211 This PR fixes an issue with `torch.take_along_dim()` not correctly handling negative indices. Previously, using negative values in the `indices` tensor caused an out-of-bounds error. This update wraps indices correctly, matching Python-style indexing semantics. ### 🔧 Changes - Modified `_take_along_dim_helper` to apply modulo logic for dimension-safe negative indexing. - Added a unit test `test_take_along_dim_negative_indices` to `test/test_indexing.py` to assert correctness of negative indexing behavior. ### 🧪 Testing ```bash pytest test/test_indexing.py -k test_take_along_dim_negative_indices ``` Pull Request resolved: pytorch#152161 Approved by: https://github.com/albanD
pytorch#162720) …or FPE. Fixes pytorch#142462 Pull Request resolved: pytorch#162720 Approved by: https://github.com/isuruf
This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`. Pull Request resolved: pytorch#168350 Approved by: https://github.com/albanD
Pull Request resolved: pytorch#169324 Approved by: https://github.com/malfet, https://github.com/yarongmu-google, https://github.com/jansel ghstack dependencies: pytorch#169323
…atio_chain (pytorch#169309) Fixes https://www.internalfb.com/tasks/?t=246834114 Pull Request resolved: pytorch#169309 Approved by: https://github.com/ezyang
…pytorch#169310) Fixes https://www.internalfb.com/tasks/?t=246782196 Pull Request resolved: pytorch#169310 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#169309
…se_observed_exception (pytorch#168337)" This reverts commit fb5be22. Reverted pytorch#168337 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail some dynamo tests in trunk ([comment](pytorch#168337 (comment)))
…uards and verbose_code_parts (pytorch#169102) Fix pytorch#168379. 1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``. 2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added. We decided to pursuit the second options to reduce the sorting time overhead and simplicity. Pull Request resolved: pytorch#169102 Approved by: https://github.com/anijain2305
Summary: Compress aoti stack (replace full paths with filenames). Test Plan: ``` [nbeloborodov@devgpu031]~/fbsource/fbcode% strobe gpuevent --duration-ms=60000 --collect-kernel-events --kernel-sample-interval=0 --pids 1016951 Running "gpuevent" with run id -4456078642709746 and group_trace_id "" on hosts: ["::1"] Press Ctrl-C to stop the run > Queuing... (00:00:00.001) > Preparing... (00:00:04.055) > Profiling... (00:01:00.383) > Processing... (00:00:00.643) > Logging... (00:00:00.025) > Finished | Host | Return Code | Samples | Result Links | |------|-------------|---------|------------------------------------------------------------| | ::1 | SUCCESS | 4 | Raw samples: | | | | | https://fburl.com/scuba/strobelight_gpu/on_demand/zsglu6sc | | | | | | | | | | Run Details: | | | | | https://fburl.com/scuba/strobelight_runs/hmcuaz8u | ``` Differential Revision: D88005763 Pull Request resolved: pytorch#169291 Approved by: https://github.com/yushangdi
… is defined (pytorch#167496) Fixes pytorch#161660 This extends the `TORCH_STABLE_ONLY` stopgap added in pytorch#161658 Pull Request resolved: pytorch#167496 Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/atalman
Adding reduce_scatter_tensor_out to use in fx passes to efficiently decompose reduce_scatter without concatenation. Pull Request resolved: pytorch#168260 Approved by: https://github.com/wconstab
# Motivation There are several issues related to the data type and precision that an accelerator supports (see pytorch#165038 and pytorch#143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as **cuda** and **mps**. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: pytorch#165631 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <[email protected]> Co-authored-by: Jiawei Li <[email protected]>
…#169239) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#169239 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka
…ytorch#166610) Pull Request resolved: pytorch#166610 Approved by: https://github.com/SherlockNoMad
Pull Request resolved: pytorch#169131 Approved by: https://github.com/anijain2305
…ch#167342) Provides type coverage to torch/_dynamo/variables/nn_module.py Coverage report: `mypy torch/_dynamo/variables/nn_module.py --linecount-report /tmp/coverage_log` Compare before to after - we go from 0 lines and 0 funcs covered to 1378 lines and 31 funcs covered Pull Request resolved: pytorch#167342 Approved by: https://github.com/williamwen42
…den to use ieee rather than tf32
|
Jenkins build for b996c2e2f44dbc4d952beb09bdb7dbfa84837d03 commit finished as NOT_BUILT |
9497fc8 to
b996c2e
Compare
|
Jenkins build for b996c2e2f44dbc4d952beb09bdb7dbfa84837d03 commit is in progress |
Author
|
Branch was rebased upstream. Need to do cherry picks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds a new float32 precision API
torch.backends.cuda.math_sdp.fp32_precision to configure fp32 precision
behavior of SDPBackend.MATH
Rationale
The test/test_transformers.py testing suite calculates the numerical
tolerance by comparing output tensors from the same precision ("reference")
and higher precision ("golden"), both calculated by SDPBackend.MATH.
However, the golden output is calculated with TF32 rather than FP32, which in
fact is less accurate than the FA/ME backend if they used IEEE rather than
TF32 for their accumulation.
The loss of precison causes false negatives in SDPA tests like
TestSDPACudaOnlyCUDA.test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_enable_gqa_True_n_heads1_cuda_float16
, at least on ROCM platform. The false negative disappears after forcing
higher_precision_dtype = torch.float64
Major Changes
To restore the precision of golden output, a new API
torch.backends.cuda.math_sdp.fp32_precision is introduced, which allows
configuration of "matmul" precision during SDPBackend.MATH, and a new
decorator @math_sdp_precision("ieee") is added to all tests that use
check_out_and_grad. At last, an assert is added to the inner most function
_check_equal as a sanity check to ensure math_sdp has the right precison
configured for torch.float32 golden tensors.
Known Issues
The backward phase honors the configuration when calling backward(), regardless
the configuration when creating the graph.