forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 75
[AUTOGENERATED] develop_IFU_20251114 #2806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pragupta
wants to merge
511
commits into
develop
Choose a base branch
from
develop_IFU_20251114
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+41,679
−13,862
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pytorch#167413) Second attempt for pytorch#167138 with fixes for name conflicts in downstream packages. Should slightly simplify pytorch#166342 Pull Request resolved: pytorch#167413 Approved by: https://github.com/Skylion007
This PR moves some implemented types from typing_extensions to typing due to the recent update to Python 3.10. Pull Request resolved: pytorch#167185 Approved by: https://github.com/janeyx99
i see failures like https://github.com/pytorch/pytorch/actions/runs/19189378182/job/54865171317?pr=167389 maybe this will fix it Pull Request resolved: pytorch#167417 Approved by: https://github.com/yf225
- The logger name in test_fully_shard_logging.py was wrong so the logs didn't happen. - The `device` variable in test_fully_shard_logging is expected to be a string, so quote it - `unittest.skipIf` is used so importing `unittest` instead of `unittest.mock` is required Pull Request resolved: pytorch#167312 Approved by: https://github.com/Skylion007, https://github.com/cyyever
This PR enables `UP035` rule of ruff. Pull Request resolved: pytorch#167307 Approved by: https://github.com/Lucaskabela
This PR applies new `Union` and `Optional` typing syntax to some files. Pull Request resolved: pytorch#167167 Approved by: https://github.com/XuehaiPan, https://github.com/mlazos
This PR fixes more context manager usage in Python code. Pull Request resolved: pytorch#167404 Approved by: https://github.com/mlazos
…orch#167405) ## Summary Previously fake/functionalized tensors that have `null` storage_ptr could segfault when checking for `.expired()` on weak storage ref, so handle `nullptr` storages separately, without checking their weakrefs. Diagnosis and PR created by codex ------ [Codex Task](https://chatgpt.com/codex/tasks/task_e_690ea8790054832f90eaffb37ee0d8c8) Pull Request resolved: pytorch#167405 Approved by: https://github.com/Skylion007
Summary: This is a reland of pytorch#165036, which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs. Test Plan: Inductor test (fbcode): `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"` Tritonbench (fbcode): `clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D86537373 Pull Request resolved: pytorch#167340 Approved by: https://github.com/jananisriram
# Motivation `torch.cuda.mem_get_info` and `torch.xpu.mem_get_info` are widely used in other popular repos, such as - https://github.com/sgl-project/sglang/blob/076313bd099ac1ee484ee77009eaae864eacf396/python/sglang/srt/utils.py#L378, - https://github.com/huggingface/accelerate/blob/7ecc2d7f394fc0686062a18d46128a8bd97c7dad/src/accelerate/utils/modeling.py#L822, - https://github.com/vllm-project/vllm/blob/7ba34b1241ada58f8212f350a8b17382cb412cf2/vllm/worker/worker.py#L150. - This PR introduces a unified API `torch.accelerator.get_memory_info` to cover this scenario. Pull Request resolved: pytorch#156812 Approved by: https://github.com/albanD
Getting some weird failures building cuda13, lets stick to what we know works Pull Request resolved: pytorch#167428 Approved by: https://github.com/jansel
Fixes TestOperatorsXPU.test_data_write_errors_under_transform_xpu intel/torch-xpu-ops#2237 Tests on other devices throw runtime error "_mutating directly with `.data` inside functorch transform is not allowed._", but XPU/HPU fails earlier on `_has_compatible_shallow_copy_type`. This check is not met only when calling tensor.data inside functorch call. ```cpp bool _has_compatible_shallow_copy_type(const Tensor& self, const Tensor& from) { return self.unsafeGetTensorImpl()->has_compatible_shallow_copy_type( from.key_set()); } ``` ### t.data | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU` | ### t.data inside functorch transform | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `ADInplaceOrView, AutogradOther, FuncTorchGradWrapper` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` | ### t.data inside functorch transform + XPU dispatch key | Tensor | Device | Dispatch Keys | |--------|---------|---------------| | `self` | `xpu` | `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU, FuncTorchGradWrapper` | | `from` | `cpu` | `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` | Pull Request resolved: pytorch#167095 Approved by: https://github.com/guangyey, https://github.com/albanD
Fixes pytorch#165130 Pull Request resolved: pytorch#165423 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/mlazos
Pass `dim_map` to `_requires_data_exchange` and return False if both spatial and channels dimensions are replicated Modify `test_conv1d` and `test_conv3d` to check values rather than just shape, and replicate `conv3d` across batch dimension In general, feels like current Convolution implementation was written to work only if tensor is sharded across last dimention Pull Request resolved: pytorch#167402 Approved by: https://github.com/ezyang
Should be merged after pytorch#166708 Pull Request resolved: pytorch#166711 Approved by: https://github.com/Skylion007, https://github.com/malfet
Sparse sparse mm op implementation Pull Request resolved: pytorch#167013 Approved by: https://github.com/malfet
…ytorch#162564) # Motivation Support XPU for `torch.accelerator.get_memory_info`. Pull Request resolved: pytorch#162564 Approved by: https://github.com/albanD ghstack dependencies: pytorch#156812
…ytorch#166740) We plan to use `StridedShard` to express `shard_order`. This PR adds the function to support the conversion between `StridedShard` and `shard_order`. I moved some test related function into torch/testing/_internal/common_utils.py. We may only care about **_dtensor_spec.py** and **test_utils.py** in this PR for the review. ### How to convert shard order to StridedShard: Considering the example: - placements = $[x_0, x_1, x_2, x_3, x_4]$, all $x_?$ are shard on the same tensor dim. Let's see how the shard order will impact the split_factor (sf). We loop from right to left in the placements to construct the split_factor by assuming different shard order. Starting from $x_4$, this should be a normal shard. Then $x_3$. There are two possibilities, $x_3$'s order can be before $x_4$. If so, $x_3$'s sf=1, because $x_3$ is before $x_4$ in the placements. Else $x_3$'s order is after $x_4$, then the $x_3$'s sf should be the mesh dim size of $x_4$, which is $T(x_4)$: <img width="820" height="431" alt="image" src="https://github.com/user-attachments/assets/f53b4b24-2523-42cc-ad6f-41f3c280db70" /> We can use this method to decide on the split factor for $x_2$, $x_1$ and so on. ### How to convert StridedShard to shard order: This follows the same method above. We check all possible paths and use the real split_factor to see which path matchs the split_factor. If no such matches, the StridedShard is unable to be converted to shard order. --- Pull Request resolved: pytorch#166740 Approved by: https://github.com/ezyang
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#167452 Approved by: https://github.com/pytorchbot
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#166844 Approved by: https://github.com/pytorchbot
`torch.ao.quantization` and `torch.fx.experimental` <img width="833" height="518" alt="Screenshot 2025-11-07 at 3 20 54 PM" src="https://github.com/user-attachments/assets/47b72f28-29bd-4bab-b41f-24d97419e411" /> <img width="892" height="560" alt="Screenshot 2025-11-07 at 3 20 45 PM" src="https://github.com/user-attachments/assets/129825ab-6706-41f2-964d-8774debab18c" /> Pull Request resolved: pytorch#167334 Approved by: https://github.com/janeyx99
Summary: Fix pytorch#166841. AOTI incorrectly generates a call to aoti_torch_cuda_scatter_reduce_two_out while the op should actually run on CPU. Fix by using the correct device when calling _generate_scatter_fallback in the wrapper codegen. Pull Request resolved: pytorch#167341 Approved by: https://github.com/yushangdi
Part of pytorch#164878 We can start narrowing the skips and remove them as PRs keep landing. This PR is just to setup the scaffolding, fix will be in follow up Pull Request resolved: pytorch#167360 Approved by: https://github.com/janeyx99
My understanding is this is needed for performance. Pull Request resolved: pytorch#167441 Approved by: https://github.com/oulgen
Discovered while enabling assertions on out-of-bounds accesses. Otherwise test fails with
```
ERROR: test_sdpa_mask_fp16_L6_S17_NH23_HS121 (__main__.TestSDPA.test_sdpa_mask_fp16_L6_S17_NH23_HS121)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3334, in wrapper
method(*args, **kwargs)
~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9494, in test_sdpa_mask_fp16_L6_S17_NH23_HS121
self._test_sdpa_mask(torch.float16, 7, 17, 23, 121)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9478, in _test_sdpa_mask
y_ref = F.scaled_dot_product_attention(q.cpu(), k.cpu(), v.cpu(), attn_mask=mask.cpu(), dropout_p=0.0, is_causal=False)
~~~~~^^
torch.AcceleratorError: index out of range
```
Pull Request resolved: pytorch#167444
Approved by: https://github.com/Skylion007, https://github.com/manuelcandales
Fixes pytorch#166918 The output device may not be on the same device as the predicate device. ``` python test/inductor/test_control_flow.py -k test_output_on_different_device ``` Pull Request resolved: pytorch#167354 Approved by: https://github.com/ydwu4, https://github.com/zou3519
…167330) See pytorch/test-infra#7446 for the paths Pull Request resolved: pytorch#167330 Approved by: https://github.com/huydhn
Add a "--virtual-local-rank" mode to torchrun. When used instead of passing the local rank in LOCAL_RANK it uses a LOCAL_RANK of "0" and adjusts CUDA_VISIBLE_DEVICES to reflect the desired GPU index. Testing: (tweaked run_train.sh to use `--log-dir`) ``` export NGPU=8 export CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 ``` And then comparing ranks: Without --virtual-local-rank gives a lot of differences like: ``` [rank#]: mul_1: "f32[8, 512, 256]" = torch.ops.aten.mul.Tensor(mul, view_9); mul = None -[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0)); mul_1 = None +[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=1)); mul_1 = None [rank#]: detach: "f32[8, 512, 1]" = torch.ops.aten.detach.default(rsqrt); rsqrt = None ``` With --virtual-local-rank makes those differences go away. Pull Request resolved: pytorch#166680 Approved by: https://github.com/ezyang
… unblock production (pytorch#167443) Summary: pytorch#166609 updates `node.is_impure` to consider a submodule as impure if submodule contains impure node. This in turn changes `graph.eliminate_dead_code()` function behavior, which does not eliminate nodes with side effects, see [pytorch documentation](https://docs.pytorch.org/docs/stable/fx.html#torch.fx.Graph.eliminate_dead_code) > Remove all dead code from the graph, based on each node’s number of users, and whether the nodes have any side effects. While this is correct that a submodule containing side-effectful ops is side-effectful and should not be dead code eliminated, some customers rely on the dead code elimination to eliminate submodules that contain impure ops which is the behavior before pytorch#166609 fix. Due to production environment constraints, we have to revert pytorch#166609 and move the side-effectful submodule check logic to `const_fold.py`, which will correctly **not** const-fold a submodule that contains impure ops. NOTE other call sites that use `node.is_impure()` to make decisions are still incorrectly eliminating side-effectful submodules, but we can't safely change that today. ## This pr - move `_subgraph_has_impure_op` into `fx/experimental/const_fold.py`, check and prevent const-folding an impure submodule - added a note in `node.is_impure` to highlight the incorrect behavior and context in case people go looking in the future. Test Plan: run test_fx_const_fold and all tests pass Differential Revision: D86641994 Pull Request resolved: pytorch#167443 Approved by: https://github.com/jfix71
Fixes pytorch#167037 Move the module definition outside of the unit test so when we run the unit test multiple times, the module is not re-compiled. Pull Request resolved: pytorch#167268 Approved by: https://github.com/angelayi
Failing test was `pytest test/export/test_export.py -k test_python_asserts_with_sym_int` Pull Request resolved: pytorch#167700 Approved by: https://github.com/bobrenjc93 ghstack dependencies: pytorch#167382, pytorch#167383, pytorch#167384, pytorch#167387, pytorch#167396, pytorch#167669
Currently, conv1d converts the 3D view to 4D before calling onednn::convolution(). However, this function converts the 4D tensor to a channel-last memory format for computation, resulting in incorrect return results (the correct result should be channel-first). This PR fixes this issue, ensuring that the output return value format is consistent with the expected format. Pull Request resolved: pytorch#162944 Approved by: https://github.com/EikanWang
Summary: Adds documentation for EventList, FunctionEvent and FunctionEventAvg. Closes pytorch#165907 Test Plan: N/A Documentation Differential Revision: D86913697 Pull Request resolved: pytorch#167688 Approved by: https://github.com/sanrise
## MOTIVATION To generalize Distributed test cases for non-CUDA devices ## CHANGES - Replaced hard coded device/backends with torch.accelerator.current_accelerator() and dist.get_default_backend_for_device - Use DistributedTestBase instead of MultiProcessTestCase to use common utilities - Remove instantiate_device_tests and make use of torch.accelerator.current_accelerator for test/distributed/test_c10d_object_collectives.py - fix deterministic context issue for non-cuda devices in test/distributed/optim/test_zero_redundancy_optimizer.py - use torch.accelerator.device_count() for multi-gpu check in torch/testing/_internal/distributed/_tensor/common_dtensor.py Pull Request resolved: pytorch#165067 Approved by: https://github.com/guangyey, https://github.com/albanD
…o "original_aten" node meta (pytorch#167749) Fixes pytorch#167706 - Add `torch.fx.experimental.proxy_tensor.set_original_aten_op()` around flex_atention HOP dispatch so we have `original_aten` populated for flex_attention - Update the usages of `original_aten` to also expect HOP in addition to OpOverload Pull Request resolved: pytorch#167749 Approved by: https://github.com/drisspg
) Summary: Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21. We are adding a workaround vectorized code, which improves conversion speed from smaller int data types. We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2: before: uint8->bfloat16_t ===> 319.433us int8->bfloat16_t ===> 320.216us int16->bfloat16_t ===> 326.899us int32->bfloat16_t ===> 327.925us after: uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput int8->bfloat16_t ===> 169.790us -----> 89% higher throughput int16->bfloat16_t ===> 180.744us -----> 81% higher throughput int32->bfloat16_t ===> 185.129us -----> 77% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D86207189 Pull Request resolved: pytorch#166958 Approved by: https://github.com/mcfi
Fixes pytorch#150477 ### Summary: - Added frame information (function name, file, line number) to all graph break/skip messages - Standardized message format: "torch.compile will skip tracing the frame <name> (<file> line <N>) and fall back to eager. Reason: <reason>" ### Impacts: module: dynamo Pull Request resolved: pytorch#167067 Approved by: https://github.com/williamwen42
… and add focused documentation (pytorch#165897) ## Summary This PR enriches OpenReg device management codes and adds focused documentation. ## Key Changes - Introduced device management documentation in `device.md`. - Updated `OpenRegFunctions.h` and `OpenRegFunctions.cpp` to use `DeviceIndex` and added error handling. - Implemented `check_device_index` function for validating device indices. - Enhanced Python bindings in `Module.cpp` for device management. - Added tests for invalid device index handling in `test_device.py`. Pull Request resolved: pytorch#165897 Approved by: https://github.com/fffrog
…ytorch#166573) We need to track all symbols, we used to skip u = item() and fail with ``` File "/home/lsakka/pytorch10/pytorch/torch/fx/passes/_tensorify_python_scalars.py", line 149, in _sympy_interp expr_to_sym_proxy[expr] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: KeyError: u0 ``` Pull Request resolved: pytorch#166573 Approved by: https://github.com/bobrenjc93
To support use case in pytorch/helion#1122, i.e. ``` @helion.kernel def foo( x: Tensor, group_name: str ): x_remotes = torch.ops.symm_mem.get_remote_tensors(x, group_name) for t in x_remotes: ... ```` Helion uses fake tensor to trace a program, thus we cannot use the following code in a Helion function: ``` hdl = rendezvous(tensor) remote_tensors = tuple( hdl.get_remote_tensor(peer, ...) for peer in range(world_size) ) ``` The reason is that when `tensor` is fake, the returned `hdl` is None, thus any subsequent call on it will fail. This PR wraps the above functionality as an op: ``` lib.define("get_remote_tensors(Tensor x, str group_name) -> Tensor[]") ``` so that things like `hdl` is not exposed to Helion. The op also provides a `meta` implementation so that Helion can trace it without actually running the rendezvous. Pull Request resolved: pytorch#167779 Approved by: https://github.com/yf225
Differential Revision: D86685546 Pull Request resolved: pytorch#167481 Approved by: https://github.com/eellison
Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93
This reverts commit c78e646. Reverted pytorch#167481 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167481 (comment)))
…h#165978) This PR implements `scaled_mm` for XPU. It enables the following data types: 1. TensorWise Scaling: `fp8_e4m3` and `fp8_e5m2` 2. RowWise Scaling: `fp8_e4m3` and `fp8_e5m2` It leaves the BlockWise Scaling to next PR, so that it will have less reviewing efforts. This is the first PR that only adds `scaled_mm_xpu` but does not registered. We separate this out for less reviewing efforts. Secondly, there is a `scaled_mm_v2` API in pytorch#164141 . We will align with it once the v1 is cleaned up. **Co-author:** @yuchengliu1, @carsonwang ## PR stack: - -> pytorch#165978 : implementation of XPU scaled_mm and oneDNN kernel - pytorch#167518 : implementation of XPU scaled_mm_v2 - pytorch#166056 : Op registration ## Test Status: 1. Relies on the changes in intel/torch-xpu-ops#1746, Otherwise the op will fallback to CPU. 2. This PR does not include tests, the tests are enabled in pytorch#166056. ## Credit: This work is based on @yuchengliu1's work at pytorch#140972 . The purpose that we created a new PR is to align with the API / checks with CUDA, so there will be less porting efforts. ## FP8 Task tracker: We will track all the scaled_mm related tasks in: pytorch#167170 Pull Request resolved: pytorch#165978 Approved by: https://github.com/liangan1, https://github.com/EikanWang Co-authored-by: Eikan Wang <[email protected]>
)" This reverts commit 50bf1f0. Reverted pytorch#167198 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#167198 (comment)))
…ytorch#164729) Fixes pytorch#163374. Here is the output from reproducible code: ``` W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1006 09:09:26.329000 2457 /home/fedora/github/pytorch/torch/distributed/run.py:811] ***************************************** aten::clamp_(dt: f32[][R], None, 2) redistribute_input(0, [P] -> [R]) redistribute_input(t: f32[], [P] -> [R]) _c10d_functional::all_reduce(t: f32[], sum, 0) _c10d_functional::wait_tensor(t: f32[]) aten::clamp_(t: f32[], None, 2) aten::view(t: f32[], []) (Replicate(),) tensor(2., device='cuda:0') ``` The behavior is now matching what you were expecting in issue pytorch#163374: Expected behavior (from the issue): 1. Placement should change from Partial(sum) to Replicate() 2. Value should be tensor(2.) instead of tensor(144.) Actual output from this build: 1. (Replicate(),) - placement is correct 2. tensor(2., device='cuda:0') - value is correct so the inplace operation now properly redistributes the partial DTensor to replicate before performing the clamp snd maintains the correct aliasing semantics. It also produces the expected clamped value. Pull Request resolved: pytorch#164729 Approved by: https://github.com/ezyang
This PR add a sm_121a flag for row-wise scaled matmuls on DGX Spark. Pull Request resolved: pytorch#167734 Approved by: https://github.com/eqy, https://github.com/cyyever
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#167338 Approved by: https://github.com/jamesjwu
This PR adds a basic spin configuration to allow for linting. It is designed as a drop-in replacement for the current Makefile based solution, i.e. it sets up and updates lintrunner based on the hashes of certain configuration files. Lintrunner is called via Uv's `uvx` command, separating its environment from the general development environment in an effort to reduce instances of competing requirements breaking environments. Pull Request resolved: pytorch#167226 Approved by: https://github.com/atalman, https://github.com/albanD
…sLtWorkspace" (pytorch#167722) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Note: this is a re-land of D86316117 / pytorch#167248 (see comments for details) Test Plan: Use a GPU OD, run multi-threaded tests (cuda_cublas_handle_pool_test) with TSAN: ``` buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 ``` https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 TSAN output (before synchronization was added): P2026731804 Differential Revision: D86964261 Pull Request resolved: pytorch#167722 Approved by: https://github.com/malfet
# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt
|
Jenkins build for b012f56ca2fd5edb6431a6e296a6006a9a9036fc commit finished as NOT_BUILT |
|
Jenkins build for b012f56ca2fd5edb6431a6e296a6006a9a9036fc commit finished as FAILURE |
b012f56 to
2903e7a
Compare
|
Jenkins build for 2903e7a671b6e093b002fbb9442adfb5914018b3 commit finished as FAILURE |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
rocm_base: 3d74218
Wheel Generated: http://rocm-ci.amd.com/view/preview/job/pytorch-latest-manylinux-wheels-preview/195/