Add torch.backends.cuda.math_sdp.fp32_precision #2841

anatoliylitv · 2025-12-01T21:11:26Z

Overview
This PR adds a new float32 precision API
torch.backends.cuda.math_sdp.fp32_precision to configure fp32 precision
behavior of SDPBackend.MATH

Rationale
The test/test_transformers.py testing suite calculates the numerical
tolerance by comparing output tensors from the same precision ("reference")
and higher precision ("golden"), both calculated by SDPBackend.MATH.
However, the golden output is calculated with TF32 rather than FP32, which in
fact is less accurate than the FA/ME backend if they used IEEE rather than
TF32 for their accumulation.

The loss of precison causes false negatives in SDPA tests like
TestSDPACudaOnlyCUDA.test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_enable_gqa_True_n_heads1_cuda_float16
, at least on ROCM platform. The false negative disappears after forcing
higher_precision_dtype = torch.float64

Major Changes
To restore the precision of golden output, a new API
torch.backends.cuda.math_sdp.fp32_precision is introduced, which allows
configuration of "matmul" precision during SDPBackend.MATH, and a new
decorator @math_sdp_precision("ieee") is added to all tests that use
check_out_and_grad. At last, an assert is added to the inner most function
_check_equal as a sanity check to ensure math_sdp has the right precison
configured for torch.float32 golden tensors.

Known Issues
The backward phase honors the configuration when calling backward(), regardless
the configuration when creating the graph.

…jectVariable subclasses (pytorch#167801) Removed redundant `_nonvar_fields` assignments from 5 UserDefinedObjectVariable subclasses. These explicit re-assignments are unnecessary because Python's class attribute inheritance automatically provides access to parent class attributes. **Classes cleaned up:** - UserDefinedDictVariable - UserDefinedSetVariable - UserDefinedListVariable - UserDefinedTupleVariable - MutableMappingVariable All 5 classes inherit from `UserDefinedObjectVariable`, which defines `_nonvar_fields`. The pattern `_nonvar_fields = UserDefinedObjectVariable._nonvar_fields` is pure redundancy - the child classes will automatically inherit this attribute from the parent. ## Changes - **Lines removed:** 10 (5 redundant assignments + 5 blank lines) - **File modified:** `torch/_dynamo/variables/user_defined.py` ## Impact - **Code reduction:** -10 lines - **Maintainability:** ↑ (less redundancy) - **Risk:** Zero (identical behavior via inheritance) Pull Request resolved: pytorch#167801 Approved by: https://github.com/guilhermeleobas

pytorch#168146)" This reverts commit 08bfadf. Reverted pytorch#168146 on behalf of https://github.com/yangw-dev due to failed internal tests due to AttributeError: 'LocalIntNode' object has no attribute 'int_', please fix it and re-merge again ([comment](pytorch#168146 (comment)))

This fixes an issue with the tests in fbcode Pull Request resolved: pytorch#168931 Approved by: https://github.com/anijain2305

…ytorch#168914) Pull Request resolved: pytorch#168914 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931

…ch#168915) Pull Request resolved: pytorch#168915 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914

Pull Request resolved: pytorch#168924 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915

…torch#168925) They are leaking state and breaking other tests Pull Request resolved: pytorch#168925 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915, pytorch#168924

Test would fail because op names were already in use. Pull Request resolved: pytorch#168926 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915, pytorch#168924, pytorch#168925

…g of pow func (pytorch#167723)" This reverts commit f1c49c9. Reverted pytorch#167723 on behalf of https://github.com/yangw-dev due to break trunk inductor tests test/inductor/test_triton_cpu_backend.py ([comment](pytorch#167723 (comment)))

This reverts commit 4909fd8. Reverted pytorch#158219 on behalf of https://github.com/jeffdaily due to broke ROCm dynamo inductor benchmarks on ciflow/inductor-periodic label which wasn't run by default for this PR ([comment](pytorch#158219 (comment)))

Pull Request resolved: pytorch#168396 Approved by: https://github.com/xmfan ghstack dependencies: pytorch#168289

…iodic frequency from every 6 hours to every 2 hours (pytorch#168990) Fix low utilization issue for linux.dgx.b200. linux.dgx.b200.8 is much busier. According to https://hud.pytorch.org/runners/pytorch?search=dgx.b200 Pull Request resolved: pytorch#168990 Approved by: https://github.com/drisspg

….8 (pytorch#168985) Summary: Fix pytorch#168353. aot_inductor.emit_multi_arch_kernel requires a newer CUDA version. Pull Request resolved: pytorch#168985 Approved by: https://github.com/yushangdi

@malfet

…#166044) related to pytorch#163970 Changes: Below are addressed from review from @malfet and @atalman: 1. Simplified the x86 TORCH_CUDA_ARCH_LIST logic to reuse the base list in`.ci/manywheel/build_cuda.sh`. 2. Added function filter_aarch64_archs() that filters the TORCH_CUDA_ARCH_LIST for aarch64 based on the x86 code. 3. Added function in `.ci/pytorch/build.sh` to report error if ACL is not present. 4. Deprecated previous aarch64 scripts (`.ci/aarch64_linux/` folder). Improvements: 1. Significant improvement in build time for CUDA ARM wheel build - Reduced build time from 5.5–6 hours to 1 hour 40–50 minutes taking this 13.0 build for example, 6h 11m 46s to 1h 50m 1s ≈ 70 % faster build time old: https://github.com/pytorch/pytorch/actions/runs/19304934204/job/55209695430 new: https://github.com/pytorch/pytorch/actions/runs/19301014750/job/55195226316 Reason: MAX_JOBS=5 is now removed after we move away from original aarch64 build workflow, previously it was OOM in building flash-attn, new MAX_JOBS is 12. https://github.com/pytorch/pytorch/pull/166044/files#diff-ccef31095e4f2d203710232531c38bff3251e41cf73ec84ee59f224bb64034aeL280 2. Unified workflow for building x86 and sbsa wheels - more maintainable code Pull Request resolved: pytorch#166044 Approved by: https://github.com/atalman

…ng clarity (pytorch#168272) Fixes pytorch#168160 TYPE_MATCH guards currently generate code like: ___check_type_id(x, 94229757490048) The numeric type-id provides no information about the type being checked. This PR appends a human-readable `repr(type)` as a trailing comment: ___check_type_id(x, 94229757490048) # <class 'torch.nn.modules.linear.Linear'> ### What This Change Does - Adds `repr(t)` to improve readability of guard output. - No behavior or semantics are changed — this is a debug-only improvement. ### Testing Verified that `repr(type)` produces readable, accurate names for built-in, user-defined, and torch.nn module types. Runtime behavior is unchanged; CI will validate everything end-to-end. Pull Request resolved: pytorch#168272 Approved by: https://github.com/williamwen42, https://github.com/anijain2305

Found in pytorch#167407 but affects non-threaded builds as well Pull Request resolved: pytorch#168325 Approved by: https://github.com/williamwen42

…h#168989) Right now we get a pretty hard to understand error message: ``` Traceback (most recent call last): File "/home/bobren/local/a/pytorch/spc.py", line 80, in <module> .save_compiled_function(path) File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 129, in save_compiled_function f.write(type(self).serialize(self)) File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 145, in serialize type(compiled_fn).serialize_compile_artifacts(compiled_fn), File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile_types.py", line 54, in serialize_compile_artifacts def deserialize_compile_artifacts(cls, data: bytes) -> Any: TypeError: 'NoneType' object is not callable ``` which happens because cache is bypassed, so the "serialize" field on compiled_fn is set to None. after this PR we get a much more direct error message: ``` (/home/bobren/local/a/pytorch-env) [9:18] devgpu009:/home/bobren/local/a/pytorch [130] ❯ cache_tlp python spc.py Wrapped class is <class '__main__.WrappedBasicNN_1'> my_property: 123 Traceback (most recent call last): File "/home/bobren/local/a/pytorch/spc.py", line 79, in <module> .aot_compile(((input_tensor,), {})) File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 806, in aot_compile return aot_compile_fullgraph( File "/home/bobren/local/a/pytorch/torch/_dynamo/aot_compile.py", line 236, in aot_compile_fullgraph compiled_fn = backend( File "/home/bobren/local/a/pytorch/torch/__init__.py", line 2445, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2525, in compile_fx return _maybe_wrap_and_compile_fx_main( File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2602, in _maybe_wrap_and_compile_fx_main return _compile_fx_main( File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx.py", line 2797, in _compile_fx_main return aot_autograd( File "/home/bobren/local/a/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, **self.kwargs) File "/home/bobren/local/a/pytorch/torch/_functorch/aot_autograd.py", line 1097, in aot_module_simplified compiled_fn = AOTAutogradCache.try_load( File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 708, in try_load raise e File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 639, in try_load cache_key, debug_lines = autograd_cache_key( File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 499, in autograd_cache_key check_cacheable(gm) File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 292, in check_cacheable check_node_safe(node) File "/home/bobren/local/a/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 240, in check_node_safe raise BypassAOTAutogradCache( torch._functorch._aot_autograd.autograd_cache.BypassAOTAutogradCache: Unsupported call_function target tag_activation_checkpoint. Function module: torch.ops.higher_order, Function name: tag_activation_checkpoint ``` Pull Request resolved: pytorch#168989 Approved by: https://github.com/jamesjwu

`call_function` is starting to get pretty long so pull the `nonstrict_traceable` portion out into a helper function before we make it even longer for pytorch#168890. Pull Request resolved: pytorch#168932 Approved by: https://github.com/anijain2305

…ch#168119)" This reverts commit c566552. Reverted pytorch#168119 on behalf of https://github.com/yushangdi due to This PR caused DebugMode to hang/segfault sometimes. See repro in P2054777054 ([comment](pytorch#168119 (comment)))

Last change to this file was back in 2021, and last CircleCI job was wound down probably in 2022, so it's safe to assume it's unsued Pull Request resolved: pytorch#169003 Approved by: https://github.com/huydhn

We may pick wrong contiguous node in mix-order reduction fusion due to dynamic shapes. Differential Revision: [D87788131](https://our.internmc.facebook.com/intern/diff/D87788131) Pull Request resolved: pytorch#168371 Approved by: https://github.com/PaulZhang12

@eellison

As titled, there are comm size estimation regression after this PR: pytorch#167852, which cause DSV3 dynamic shape estimation error: pytorch/torchtitan#2037. Also added dynamic shape comm estimation test cases in the PR cc. @eellison @ezyang Pull Request resolved: pytorch#168199 Approved by: https://github.com/laithsakka

This PR adds support for effectful ops within invoke_subgraphs. * Most of the logic is in `invoke_subgraph.py_functionalize_impl`. * In the functionalization metadata collection phase, we note the tokens before going further down the dispatcher, and then note the tokens after coming back from the dispatcher. If there are nodes in the invoke_subgraph subgraph that contain effects, the number of effects should change, or the tokens used for an effect should. * We will store this effect difference in the `InvokeSubgraphCache` where the key is the identifier and value is the effect. For now we only support one effect within a subgraph. * During the tracing part of AOTAutograd, we will then wrap the subgraph to take in and output a token. Before: ``` def forward(self, x): repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', x) return invoke_subgraph def repeated_subgraph(self, x): record_memory = torch.ops.mylib.record_memory.default("forward", "N") add = torch.ops.aten.add(x, x) return add ``` After: ``` def forward(self, token, x): repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x) getitem = invoke_subgraph[0] # output token getitem_1 = invoke_subgraph[1] return (getitem, getitem_1) def repeated_subgraph(self, token, x): with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N') getitem = with_effects[0] # output token add = torch.ops.aten.add(x, x) return (getitem, add) ``` * Then there is a bunch of logic within `_remove_effect_tokens` to handle removing the effects from the invoke_subgraph subgraph Differential Revision: [D87392741](https://our.internmc.facebook.com/intern/diff/D87392741) Pull Request resolved: pytorch#167231 Approved by: https://github.com/anijain2305

…torch#167245) In the [previous PR](https://github.com/pytorch/pytorch/pull/167231/files#diff-e2b74af5d8b538a7d07d18507d27010703742ddad5f819992b55f5abc6d9a502R964-R966) we found that the autograd eager impl of invoke_subgraph calls the subgraph twice. If the subgraph contains effects then effects will be run twice, which is bad. This PR fixes the issue by getting the output metadata from `subgraph`'s `node.meta` if it exists. Differential Revision: [D87392740](https://our.internmc.facebook.com/intern/diff/D87392740) Pull Request resolved: pytorch#167245 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#167231

Updates the implementation of `unlift_tokens` to handle unlifting invoke_subgraph. The context of `unlift_tokens` is currently tokens are threaded as inputs and outputs of the toplevel graph produced by AOTAutograd. However we don't want the inductor traced graph to have any notion of effects/tokens, just that the tokens should introduce some extra dependency behavior. So, we unlift the tokens from the toplevel graph. Instead of placeholder nodes the tokens will come from a `_make_token` call, and instead of outputting the tokens we will sink all tokens into `_sink_tokens`. Similarly, we want the invoke_subgraph subgraph to not have any notion of tokens, so we will also remove the tokens from the inputs of the invoke_subgraph subgraph. However, we still need a way mark the invoke_subgraph call as being effectful at the toplevel module to prevent invoke_subgraph calls from being reordered, so I wrap the invoke_subgraph with an effects. Before: ``` def forward(self, token, x): repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.invoke_subgraph(repeated_subgraph0, 'subgraph_0', token, x) getitem = invoke_subgraph[0] # output token getitem_1 = invoke_subgraph[1] return (getitem, getitem_1) def repeated_subgraph(self, token, x): with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N') getitem = with_effects[0] # output token add = torch.ops.aten.add(x, x) return (getitem, add) ``` After: ``` def forward(self, x): token = torch.ops.prims._make_token.default() repeated_subgraph0 = self.repeated_subgraph0 invoke_subgraph = torch.ops.higher_order.with_effects( token, torch.ops.higher_order.invoke_subgraph, repeated_subgraph0, 'subgraph_0', token, x ) getitem = invoke_subgraph[0] # output token getitem_1 = invoke_subgraph[1] _ = torch.ops.prims._sink_tokens.default([getitem]) return (getitem_1,) def repeated_subgraph(self, x): token = torch.ops.prims._make_token.default() with_effects = torch.ops.higher_order.with_effects(token, torch.ops.mylib.record_memory.default, 'forward', 'N') getitem = with_effects[0] # output token add = torch.ops.aten.add(x, x) _ = torch.ops.prims._sink_tokens.default([getitem]) return (add,) ``` Differential Revision: [D87668981](https://our.internmc.facebook.com/intern/diff/D87668981) Pull Request resolved: pytorch#167363 Approved by: https://github.com/fxdawnn ghstack dependencies: pytorch#167231, pytorch#167245

This PR uses context managers and suppresses ruff `SIM115` warnings in some places. Pull Request resolved: pytorch#167788 Approved by: https://github.com/albanD

The test fails due to undefined variable: ``` Running inductor/test_flex_decoding 1/1 ... SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set Executing ['/opt/app-root/bin/python', '-bb', 'inductor/test_flex_decoding.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... Running inductor/test_fxir_backend 1/1 ... SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set Executing ['/opt/app-root/bin/python', '-bb', 'inductor/test_fxir_backend.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... /opt/app-root/lib64/python3.12/site-packages/torch/__init__.py:1613: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /root/pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) Traceback (most recent call last): File "/root/pytorch/test/inductor/test_flex_decoding.py", line 309, in <module> class TestFlexDecoding(InductorTestCase): File "/root/pytorch/test/inductor/test_flex_decoding.py", line 751, in TestFlexDecoding @unittest.skipIf(SKIP_UT_ON_CPU, "Skip on CPU as not supported") ^^^^^^^^^^^^^^ NameError: name 'SKIP_UT_ON_CPU' is not defined ``` Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#165404 Approved by: https://github.com/drisspg, https://github.com/Skylion007

…152161) Fixes: pytorch#146211 This PR fixes an issue with `torch.take_along_dim()` not correctly handling negative indices. Previously, using negative values in the `indices` tensor caused an out-of-bounds error. This update wraps indices correctly, matching Python-style indexing semantics. ### 🔧 Changes - Modified `_take_along_dim_helper` to apply modulo logic for dimension-safe negative indexing. - Added a unit test `test_take_along_dim_negative_indices` to `test/test_indexing.py` to assert correctness of negative indexing behavior. ### 🧪 Testing ```bash pytest test/test_indexing.py -k test_take_along_dim_negative_indices ``` Pull Request resolved: pytorch#152161 Approved by: https://github.com/albanD

pytorch#162720) …or FPE. Fixes pytorch#142462 Pull Request resolved: pytorch#162720 Approved by: https://github.com/isuruf

This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`. Pull Request resolved: pytorch#168350 Approved by: https://github.com/albanD

Pull Request resolved: pytorch#169324 Approved by: https://github.com/malfet, https://github.com/yarongmu-google, https://github.com/jansel ghstack dependencies: pytorch#169323

…atio_chain (pytorch#169309) Fixes https://www.internalfb.com/tasks/?t=246834114 Pull Request resolved: pytorch#169309 Approved by: https://github.com/ezyang

…pytorch#169310) Fixes https://www.internalfb.com/tasks/?t=246782196 Pull Request resolved: pytorch#169310 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#169309

…se_observed_exception (pytorch#168337)" This reverts commit fb5be22. Reverted pytorch#168337 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail some dynamo tests in trunk ([comment](pytorch#168337 (comment)))

…uards and verbose_code_parts (pytorch#169102) Fix pytorch#168379. 1. The results are validated in the improved testing that the ``___dict_contains`` will be sorted based on the verbose part. The first solution was also suggested in https://fb.workplace.com/groups/1075192433118967/permalink/1650742858897252/ by sorting the ``get_leaf_guards()`` in ``construct_manager_string``. 2. The second solution will be adopted the ``OrderedSet`` in setGuards during guards construction to make sure the ``contain_dict`` are displayed as the order of being added. We decided to pursuit the second options to reduce the sorting time overhead and simplicity. Pull Request resolved: pytorch#169102 Approved by: https://github.com/anijain2305

Summary: Compress aoti stack (replace full paths with filenames). Test Plan: ``` [nbeloborodov@devgpu031]~/fbsource/fbcode% strobe gpuevent --duration-ms=60000 --collect-kernel-events --kernel-sample-interval=0 --pids 1016951 Running "gpuevent" with run id -4456078642709746 and group_trace_id "" on hosts: ["::1"] Press Ctrl-C to stop the run > Queuing... (00:00:00.001) > Preparing... (00:00:04.055) > Profiling... (00:01:00.383) > Processing... (00:00:00.643) > Logging... (00:00:00.025) > Finished | Host | Return Code | Samples | Result Links | |------|-------------|---------|------------------------------------------------------------| | ::1 | SUCCESS | 4 | Raw samples: | | | | | https://fburl.com/scuba/strobelight_gpu/on_demand/zsglu6sc | | | | | | | | | | Run Details: | | | | | https://fburl.com/scuba/strobelight_runs/hmcuaz8u | ``` Differential Revision: D88005763 Pull Request resolved: pytorch#169291 Approved by: https://github.com/yushangdi

… is defined (pytorch#167496) Fixes pytorch#161660 This extends the `TORCH_STABLE_ONLY` stopgap added in pytorch#161658 Pull Request resolved: pytorch#167496 Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/atalman

Adding reduce_scatter_tensor_out to use in fx passes to efficiently decompose reduce_scatter without concatenation. Pull Request resolved: pytorch#168260 Approved by: https://github.com/wconstab

@guangyey

# Motivation There are several issues related to the data type and precision that an accelerator supports (see pytorch#165038 and pytorch#143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as **cuda** and **mps**. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: pytorch#165631 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <[email protected]> Co-authored-by: Jiawei Li <[email protected]>

…#169239) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#169239 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka

…ytorch#166610) Pull Request resolved: pytorch#166610 Approved by: https://github.com/SherlockNoMad

Pull Request resolved: pytorch#169131 Approved by: https://github.com/anijain2305

…ch#167342) Provides type coverage to torch/_dynamo/variables/nn_module.py Coverage report: `mypy torch/_dynamo/variables/nn_module.py --linecount-report /tmp/coverage_log` Compare before to after - we go from 0 lines and 0 funcs covered to 1378 lines and 31 funcs covered Pull Request resolved: pytorch#167342 Approved by: https://github.com/williamwen42

…den to use ieee rather than tf32

…generation

rocm-repo-management-api · 2025-12-02T17:05:54Z

Jenkins build for b996c2e2f44dbc4d952beb09bdb7dbfa84837d03 commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-12-02T17:20:57Z

Jenkins build for b996c2e2f44dbc4d952beb09bdb7dbfa84837d03 commit is in progress
Links: Blue Ocean view / Build artifacts

anatoliylitv · 2025-12-02T17:45:59Z

Branch was rebased upstream. Need to do cherry picks.

hinriksnaer and others added 30 commits November 24, 2025 16:18

[dynamo] Skip optree tests when optree isn't installed (pytorch#168931)

bec6e68

This fixes an issue with the tests in fbcode Pull Request resolved: pytorch#168931 Approved by: https://github.com/anijain2305

[dynamo] Fix local test failures for test/dynamo/test_after_aot.py (p…

821047d

…ytorch#168914) Pull Request resolved: pytorch#168914 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931

[dynamo] Fix local test failures for test_compiler_bisector.py (pytor…

55b10d7

…ch#168915) Pull Request resolved: pytorch#168915 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914

[dynamo] Fix more cases of tests leaking config changes (pytorch#168924)

627f6c7

Pull Request resolved: pytorch#168924 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#168931, pytorch#168914, pytorch#168915

Fix local_map default partitioner issue (pytorch#168396)

8989130

Pull Request resolved: pytorch#168396 Approved by: https://github.com/xmfan ghstack dependencies: pytorch#168289

[AOTI] Skip emit_multi_arch_kernel when CUDA version is lower than 12…

09de09e

….8 (pytorch#168985) Summary: Fix pytorch#168353. aot_inductor.emit_multi_arch_kernel requires a newer CUDA version. Pull Request resolved: pytorch#168985 Approved by: https://github.com/yushangdi

[3.14] Fix nn.Module annotations lookup (pytorch#168325)

43e23ee

Found in pytorch#167407 but affects non-threaded builds as well Pull Request resolved: pytorch#168325 Approved by: https://github.com/williamwen42

[BE] Delete Pytorch-circleci-labels (pytorch#169003)

f6572c3

Last change to this file was back in 2021, and last CircleCI job was wound down probably in 2022, so it's safe to assume it's unsued Pull Request resolved: pytorch#169003 Approved by: https://github.com/huydhn

[3/N] Use context managers (pytorch#167788)

9e88d50

This PR uses context managers and suppresses ruff `SIM115` warnings in some places. Pull Request resolved: pytorch#167788 Approved by: https://github.com/albanD

Adding check for step size=0 in unfold backward to avoid divide by 0 … (

6936e33

pytorch#162720) …or FPE. Fixes pytorch#142462 Pull Request resolved: pytorch#162720 Approved by: https://github.com/isuruf

oulgen and others added 23 commits December 2, 2025 06:50

[pallas backend] support bitcast (pytorch#169324)

afdff7f

Pull Request resolved: pytorch#169324 Approved by: https://github.com/malfet, https://github.com/yarongmu-google, https://github.com/jansel ghstack dependencies: pytorch#169323

[inductor] Increase tolerance for test_emulate_precision_casts_mean_r…

a951a9c

…atio_chain (pytorch#169309) Fixes https://www.internalfb.com/tasks/?t=246834114 Pull Request resolved: pytorch#169309 Approved by: https://github.com/ezyang

[inductor] Increase tolerance for test_conv3d_binary_broadcast_shapes (…

556375b

…pytorch#169310) Fixes https://www.internalfb.com/tasks/?t=246782196 Pull Request resolved: pytorch#169310 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#169309

[dist] add reduce_scatter_out (pytorch#168260)

70d797a

Adding reduce_scatter_tensor_out to use in fx passes to efficiently decompose reduce_scatter without concatenation. Pull Request resolved: pytorch#168260 Approved by: https://github.com/wconstab

[precompile] Include inductor-generated Dynamo guards in AOT (pytorch…

174272c

…#169239) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#169239 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka

Support AC in default partitioner when functionalization is enabled (p…

9ff4a2e

…ytorch#166610) Pull Request resolved: pytorch#166610 Approved by: https://github.com/SherlockNoMad

Add DCE pass to remove unused intermediates (pytorch#169131)

44ac693

Pull Request resolved: pytorch#169131 Approved by: https://github.com/anijain2305

Add torch.backends.cuda.math_sdp.fp32_precision

32d0f1e

Make torch.backends.cuda.math_sdp.fp32_precision effective for math_sdp

e0c7cbb

torch/testing: add ctx mananger math_sdp_precision

e7c76da

test/test_transformers: decorate all tests that uses fp32 math as gol…

8654659

…den to use ieee rather than tf32

fix build error

0e4f9a3

test/test_transformers: sanity check of golden tensor

fe067e7

fix Fp32PrecisonGuard

19fde78

more documentation

6e6c0c6

fix lint

8284287

Follow the practice of cuBLASModule and ignore MathSDPModule in docs …

b996c2e

…generation

xinyazhang force-pushed the xinyazhang/math_sdp_ieee branch from 9497fc8 to b996c2e Compare December 2, 2025 17:08

xinyazhang requested review from jeffdaily and jithunnair-amd as code owners December 2, 2025 17:08

anatoliylitv closed this Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add torch.backends.cuda.math_sdp.fp32_precision #2841

Add torch.backends.cuda.math_sdp.fp32_precision #2841

Uh oh!

anatoliylitv commented Dec 1, 2025

Uh oh!

rocm-repo-management-api bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Dec 2, 2025

Uh oh!

anatoliylitv commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Add torch.backends.cuda.math_sdp.fp32_precision #2841

Add torch.backends.cuda.math_sdp.fp32_precision #2841

Uh oh!

Conversation

anatoliylitv commented Dec 1, 2025

Uh oh!

rocm-repo-management-api bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Dec 2, 2025

Uh oh!

anatoliylitv commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

rocm-repo-management-api bot commented Dec 2, 2025 •

edited

Loading