Skip to content

Conversation

castigli
Copy link

@castigli castigli commented Sep 4, 2025

This PR aims at fixing the nvdsl examples which got a bit out of sync not being tested in the CI.

The fixed bugs were related to the following PRs:

Copy link

github-actions bot commented Sep 4, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@castigli castigli changed the title Fix nvdsl examples [NVGPU] Fix nvdsl examples Sep 4, 2025
@Wolfram70
Copy link
Contributor

Thanks for bringing this to our attention!

I looked into this a bit, and it does look like this crash is occurring in the --gpu-module-to-binary pass.
For Ch4.py, dumping the .mlir file and extracting the LLVMIR generated during --gpu-module-to-binary, we get
extracted-llvmir.txt. Running llc -mtriple=nvptx64 -mcpu=sm_90a -mattr=+ptx80 on this reproduces this crash exactly so it seems to be an issue during codegen. I am not sure why this is occurring in this specific case.

@durga4github @abhilash1910 Do you have any idea why this might be happening?

@abhilash1910
Copy link
Contributor

abhilash1910 commented Sep 5, 2025

Taking a look at the codegen. Thanks for highlighting. IR does not seem incorrect though at first glance.
Edit: Fix is in progress.

durga4github pushed a commit that referenced this pull request Sep 24, 2025
Context: Highlighted from #156830 , this is an Isel lowering issue in
the NVPTX backend for prefetch.tensormap intrinsic.

It is caused by unchecked pattern rewrite during infer-address-space
pass.
This intrinsic is valid only for const, param and generic
address-spaces.
Any other address space is invalid. Currently, this intrinsic gets
falsely
re-written to target AS(1), when the pointer-argument of the intrinsic
comes as an argument of a kernel function.

So, this patch adds a check on the correct address-spaces
before re-writing them.

cc @durga4github 

FYI: @Wolfram70 @rupprecht  @castigli
@castigli castigli marked this pull request as ready for review October 2, 2025 09:36
@castigli castigli requested a review from grypp as a code owner October 2, 2025 09:36
@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2025

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-nvgpu

@llvm/pr-subscribers-mlir-gpu

Author: Giacomo Castiglioni (castigli)

Changes

This PR aims at fixing the nvdsl examples which got a bit out of sync not being tested in the CI.

The fixed bugs were related to the following PRs:

  • move to nanobind #118583
  • split gpu module initialization #135478

There is one remaining bug that I think #153134 introduced. When running the Ch4 and Ch5 the nvvm.prefetch tensormap intrisic leads to the following error on sm_90a

LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.prefetch.tensormap
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: mlir-opt before.mlir --gpu-module-to-binary
1.	Running pass 'Function Pass Manager' on module 'LLVMDialectModule'.
2.	Running pass 'NVPTX DAG-&gt;DAG Pattern Instruction Selection' on function '@<!-- -->gemm_multistage_kernel'
...

Perahps @Wolfram70 or @grypp could help me out with the last bug?
Could be the solution to revert momentarily to inline ptx?
[edit] this was resolved in #159253


Full diff: https://github.com/llvm/llvm-project/pull/156830.diff

3 Files Affected:

  • (modified) mlir/test/Examples/NVGPU/Ch5.py (+1-1)
  • (modified) mlir/test/Examples/NVGPU/tools/nvdsl.py (+3-4)
  • (modified) mlir/test/Examples/NVGPU/tools/nvgpucompiler.py (+3-1)
diff --git a/mlir/test/Examples/NVGPU/Ch5.py b/mlir/test/Examples/NVGPU/Ch5.py
index f98cfd758a75f..91c346c837dda 100644
--- a/mlir/test/Examples/NVGPU/Ch5.py
+++ b/mlir/test/Examples/NVGPU/Ch5.py
@@ -156,7 +156,7 @@ def producer_loop(
 ):
     phase = const(True, ty=T.bool())
 
-    for iv, phase in scf.for_(0, (K // TILE_K), 1, [phase]):
+    for iv, phase, _ in scf.for_(0, (K // TILE_K), 1, [phase]):
         stage = iv % num_stages
         # Wait MMA to be done
         mbar_mma[stage].try_wait(phase)
diff --git a/mlir/test/Examples/NVGPU/tools/nvdsl.py b/mlir/test/Examples/NVGPU/tools/nvdsl.py
index 90dbb2355e1c8..d4c50fc9bc28d 100644
--- a/mlir/test/Examples/NVGPU/tools/nvdsl.py
+++ b/mlir/test/Examples/NVGPU/tools/nvdsl.py
@@ -84,8 +84,7 @@ def arrive(self, txcount: int = 0, predicate=None):
                 self.mbar_group_op, txcount_op, self.id_op, predicate=predicate
             )
         else:
-            nvgpu.mbarrier_arrive(
-                ir.Type.parse("!nvgpu.mbarrier.token"), self.mbar_group_op, self.id_op
+            nvgpu.mbarrier_arrive(self.mbar_group_op, self.id_op
             )
 
     def try_wait(self, phase: bool = False, ticks: int = 10000000):
@@ -144,7 +143,7 @@ def create_descriptor(self, device_ptr):
             device_ptr,
         )
         self.tma_descriptor = nvgpu.TmaCreateDescriptorOp(
-            tma_descriptor_ty, device_unranked_memref, map(const, self.tma_box_shape)
+            tma_descriptor_ty, device_unranked_memref, list(map(const, self.tma_box_shape))
         )
         return self.tma_descriptor.result
 
@@ -156,7 +155,7 @@ def load(self, dest, mbarrier: Mbarriers, coords=[0], predicate=None):
             dest,
             mbarrier.mbar_group_op,
             self.tma_descriptor,
-            coordinates=map(const, coords),
+            coordinates=list(map(const, coords)),
             mbarId=mbarrier.id_op,
             predicate=predicate,
         )
diff --git a/mlir/test/Examples/NVGPU/tools/nvgpucompiler.py b/mlir/test/Examples/NVGPU/tools/nvgpucompiler.py
index 1c9cc74fcd169..4b661f8df6a9f 100644
--- a/mlir/test/Examples/NVGPU/tools/nvgpucompiler.py
+++ b/mlir/test/Examples/NVGPU/tools/nvgpucompiler.py
@@ -35,9 +35,11 @@ def compile(self, module: ir.Module):
 
     def jit(self, module: ir.Module) -> execution_engine.ExecutionEngine:
         """Wraps the module in a JIT execution engine."""
-        return execution_engine.ExecutionEngine(
+        ee = execution_engine.ExecutionEngine(
             module, opt_level=self.opt_level, shared_libs=self.shared_libs
         )
+        ee.initialize()
+        return ee
 
     def compile_and_jit(self, module: ir.Module) -> execution_engine.ExecutionEngine:
         """Compiles and jits the module."""

@durga4github
Copy link
Contributor

@castigli , I updated the commit-msg since the prefetch.tensormap issue is resolved.

Could you please rebase and push once? I can initiate the workflows to run, to get CI results.

@castigli
Copy link
Author

castigli commented Oct 8, 2025

@castigli , I updated the commit-msg since the prefetch.tensormap issue is resolved.

Could you please rebase and push once? I can initiate the workflows to run, to get CI results.

Done!

@durga4github
Copy link
Contributor

@castigli , I updated the commit-msg since the prefetch.tensormap issue is resolved.
Could you please rebase and push once? I can initiate the workflows to run, to get CI results.

Done!

Thanks, initiated the CI.

Copy link

github-actions bot commented Oct 8, 2025

⚠️ Python code formatter, darker found issues in your code. ⚠️

You can test this locally with the following command:
darker --check --diff -r origin/main...HEAD mlir/test/Examples/NVGPU/Ch5.py mlir/test/Examples/NVGPU/tools/nvdsl.py mlir/test/Examples/NVGPU/tools/nvgpucompiler.py

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from darker here.
--- tools/nvdsl.py	2025-10-08 11:51:06.000000 +0000
+++ tools/nvdsl.py	2025-10-08 13:01:05.815271 +0000
@@ -82,12 +82,11 @@
             txcount_op = const(txcount)
             nvgpu.mbarrier_arrive_expect_tx(
                 self.mbar_group_op, txcount_op, self.id_op, predicate=predicate
             )
         else:
-            nvgpu.mbarrier_arrive(self.mbar_group_op, self.id_op
-            )
+            nvgpu.mbarrier_arrive(self.mbar_group_op, self.id_op)
 
     def try_wait(self, phase: bool = False, ticks: int = 10000000):
         ticks_op = const(ticks)
         phase_op = const(phase, T.bool())
         nvgpu.MBarrierTryWaitParityOp(
@@ -141,11 +140,13 @@
                 self.memref_ty.element_type, self.memref_ty.memory_space
             ),
             device_ptr,
         )
         self.tma_descriptor = nvgpu.TmaCreateDescriptorOp(
-            tma_descriptor_ty, device_unranked_memref, list(map(const, self.tma_box_shape))
+            tma_descriptor_ty,
+            device_unranked_memref,
+            list(map(const, self.tma_box_shape)),
         )
         return self.tma_descriptor.result
 
     def prefetch(self, predicate=None):
         nvgpu.tma_prefetch_descriptor(self.tma_descriptor, predicate=predicate)

@durga4github
Copy link
Contributor

@grypp , Kindly take a look when you get a chance.

Copy link
Member

@grypp grypp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these

@grypp
Copy link
Member

grypp commented Oct 8, 2025

I feel like we should enable LIT testing without running for these test so at least they can get compiled.

@castigli
Copy link
Author

castigli commented Oct 8, 2025

I feel like we should enable LIT testing without running for these test so at least they can get compiled.

In principle I agree, but I don't have good ideas to check both compile-only and compile-and-run without mucking the code too much.
What about something like

...
# RUN: env MLIR_RUN_CUDA_SM90_TESTS=%mlir_run_cuda_sm90_tests
...
if run_if_cuda_sm90_enabled(lambda: saxpy(x, y, alpha)) is None:
    #  4. Verify MLIR with reference computation
    ref = np.ones((M, N), np.float32)
    ref += x * alpha
    np.testing.assert_allclose(y, ref, rtol=5e-03, atol=1e-01)
print("PASS")
# CHECK: PASS

with the util defined as

def run_if_cuda_sm90_enabled(func, *args, **kwargs):
    """Execute a function if CUDA SM90 tests are enabled, otherwise print a warning."""
    mlir_run_cuda_sm90_tests = os.getenv("MLIR_RUN_CUDA_SM90_TESTS")
    if mlir_run_cuda_sm90_tests == "1" or mlir_run_cuda_sm90_tests is None:
        return func(*args, **kwargs)
    else:
        print("warning: skipping test execution")
        return -1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants