Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Feb 3, 2026

Two extensions to transpose benchmark in benchmarks/python/test_transpose.py

(1) Adds coverage for copy vs. view transpose
Previously, we only exercised view transpose, which returns a non-contiguous tensor and the pointwise scheduler is used. As a result, the transpose scheduler was never actually used.
This PR adds .contiguous() to enforce a contiguous output layout, which triggers a copy-based transpose.
For manually defined fusions, a segment_set was added to the fusion to avoid the pre-segmentation pass (AllocationDomainPass) changing transpose output layout to ensure the copy transpose path is taken.

For view transpose, the output has a allocation domain of (iS11{i0}, iS10{i1}) which is same as input

T5_g_float[iS10{i1}, iS11{i0}]
 logical domain : (iS10{i1}, iS11{i0})
 allocation domain : (iS11{i0}, iS10{i1})
 contiguity: t t
 loop domain : (iS10{i1}, iS11{i0})

Final fusion is:

Segmented_Fusion{ 
groups: 
  pointwise{0, 1, 2, 3}
edges: 

group details:
g{(pointwise)
group id: 0
inputs:
  T0_g_float[iS0{i0}, iS1{i1}] float
  T1_g_float[iS12{i0}, iS13{i1}] float
outputs:
  T5_g_float[iS10{i1}, iS11{i0}] float


T2_l_float[iS4{i0}, iS5{i1}]
   = T0_g_float[iS0{i0}, iS1{i1}]
   + T1_g_float[iS12{i0}, iS13{i1}];
(0)
T3_l_float[iS7{i1}, iS6{i0}]
   = Set.Permute( T2_l_float[iS4{i0}, iS5{i1}], cache_op=Streaming )
(1)
T4_l_bool[iS8{i1}, iS9{i0}]
   = T3_l_float[iS7{i1}, iS6{i0}]
   > double(0);
(2)
T5_g_float[iS10{i1}, iS11{i0}]
   = where(T4_l_bool[iS8{i1}, iS9{i0}]
  , T3_l_float[iS7{i1}, iS6{i0}]
  , double(0));
(3)
}

} //Segmented_Fusion

For copy transpose, the output is T6, it has a transposed allocation domain : (iS12{i1}, iS13{i0}):

T6_g_float[iS12{i1}, iS13{i0}]
   = SegmenterSet( T5_l_float[iS10{i1}, iS11{i0}] )

T5_l_float[iS10{i1}, iS11{i0}]
 logical domain : (iS10{i1}, iS11{i0})
 contiguity: t t
 loop domain : (iS10{i1}, iS11{i0})
T6_g_float[iS12{i1}, iS13{i0}]
 logical domain : (iS12{i1}, iS13{i0})
 allocation domain : (iS12{i1}, iS13{i0})
 contiguity: t t
 loop domain : (iS12{i1}, iS13{i0})

Final fusion is:

Segmented_Fusion{ 
groups: 
  transpose{0, 1, 2, 3, 4}
edges: 

group details:
g{(transpose)
group id: 0
inputs:
  T0_g_float[iS0{i0}, iS1{i1}] float
  T1_g_float[iS14{i0}, iS15{i1}] float
outputs:
  T6_g_float[iS12{i1}, iS13{i0}] float


T2_l_float[iS4{i0}, iS5{i1}]
   = T0_g_float[iS0{i0}, iS1{i1}]
   + T1_g_float[iS14{i0}, iS15{i1}];
(0)
T3_g_float[iS7{i1}, iS6{i0}]
   = Set.Permute( T2_l_float[iS4{i0}, iS5{i1}], cache_op=Streaming )
(1)
T4_g_bool[iS8{i1}, iS9{i0}]
   = T3_g_float[iS7{i1}, iS6{i0}]
   > double(0);
(2)
T5_g_float[iS10{i1}, iS11{i0}]
   = where(T4_g_bool[iS8{i1}, iS9{i0}]
  , T3_g_float[iS7{i1}, iS6{i0}]
  , double(0));
(3)
T6_g_float[iS12{i1}, iS13{i0}]
   = SegmenterSet( T5_g_float[iS10{i1}, iS11{i0}] )
(4)
}

} //Segmented_Fusion

(2) Generalizes fusion input ranks to 2D
Previously, fusion inputs were limited to 3D shapes, with roughly 100 test cases per data type. This PR expands coverage to include 2D input shapes as well.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Review updated until commit af22dc7

Description

  • Add support for 2D and 3D input tensors in transpose benchmark

  • Implement copy vs view transpose testing with contiguous() calls

  • Add segment_set to fusion definition for copy transpose path

  • Update test parametrization to cover both transpose modes

Changes walkthrough

Relevant files
Enhancement
test_transpose.py
Extend transpose benchmark with 2D inputs and copy transpose

benchmarks/python/test_transpose.py

  • Modified transpose_fusion function to support dynamic rank and copy
    transpose mode
  • Added segment_set logic to enforce copy transpose path when needed
  • Updated transpose_fwd_fn to handle contiguous() calls for copy
    transpose
  • Added _generate_transpose_params for 2D/3D parameter generation
  • Enhanced test parametrization with is_copy_transpose parameter
  • Updated both nvFuser and baseline benchmarks to test both modes
  • +65/-22 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Parameter Integration

    The new is_copy_transpose and rank parameters are properly integrated into the test framework, but verify that all test combinations (2D/3D × copy/view transpose) are correctly generated and executed without parameter conflicts.

    @pytest.mark.parametrize("size,axes,dims", _generate_transpose_params())
    @pytest.mark.parametrize("dtype", FLOAT_DTYPES)
    @pytest.mark.parametrize(
        "is_copy_transpose",
        [True, False],
        ids=["copy_transpose", "view_transpose"],
    )
    Segment Set Logic

    The segment_set operation is added only for copy transpose to enforce the transpose scheduler. Ensure this logic correctly distinguishes between scenarios where transpose scheduler vs pointwise scheduler should be used, and that the segment_set doesn't interfere with other operations.

    if is_copy_transpose:
        T10 = fd.ops.segment_set(T9)
        fd.add_output(T10)
    else:
        fd.add_output(T9)

    @liqiangxl liqiangxl marked this pull request as ready for review February 3, 2026 15:26
    @liqiangxl liqiangxl requested review from Priya2698 and naoyam February 3, 2026 15:26
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 3, 2026

    Greptile Overview

    Greptile Summary

    Extended transpose benchmark to test both copy and view transpose operations, and added 2D input coverage alongside existing 3D inputs.

    Key Changes:

    • Added is_copy_transpose parameter to toggle between copy (contiguous) and view (non-contiguous) transpose
    • For copy transpose: added segment_set operation in fusion definition to prevent presegmentation passes from optimizing to view, and .contiguous() call in eager function to materialize copy
    • Generalized input tensor rank from hardcoded 3D to dynamic 2D/3D using rank parameter
    • Introduced _generate_transpose_params() helper to generate test combinations of (size, axes, dims)
    • 2D inputs test only (0,1) axes while 3D inputs test (0,1), (0,2), (1,2) axes

    Impact:

    • Significantly expands test coverage (roughly doubles test cases with copy/view split, plus adds 2D coverage)
    • Ensures transpose scheduler is properly exercised (copy transpose) vs. pointwise scheduler (view transpose)
    • Better aligns benchmark with real-world usage patterns

    Confidence Score: 4/5

    • Safe to merge with minor review comments addressed
    • Code is well-structured with clear separation between copy and view transpose paths. The implementation correctly handles dynamic rank tensors and test parameter generation. Previous review comments about typos and formatting have been noted. No critical logical errors or security issues found.
    • No files require special attention beyond addressing formatting feedback from previous threads

    Important Files Changed

    Filename Overview
    benchmarks/python/test_transpose.py Extended transpose benchmark to cover 2D inputs and copy vs. view transpose, adding segment_set to enforce copy transpose path and .contiguous() for materialization

    Sequence Diagram

    sequenceDiagram
        participant Test as Test Function
        participant Gen as _generate_transpose_params
        participant Fusion as transpose_fusion
        participant Eager as transpose_fwd_fn
        participant Bench as run_benchmark
    
        Test->>Gen: Request test parameters
        Gen->>Gen: Generate params for dims=2,3
        Gen->>Gen: For each size, axes, dims
        Gen-->>Test: Return (size, axes, dims) tuples
    
        Test->>Test: Create input tensors (input1, input2)
        Test->>Test: Compute permute_axes from axes
    
        Test->>Fusion: Define fusion with rank, is_copy_transpose
        Fusion->>Fusion: Define tensors with dynamic rank
        Fusion->>Fusion: Add + Permute + ReLU ops
        alt is_copy_transpose
            Fusion->>Fusion: Apply segment_set(T9) → T10
            Fusion->>Fusion: add_output(T10)
        else view_transpose
            Fusion->>Fusion: add_output(T9)
        end
    
        opt Validation enabled
            Test->>Eager: Execute eager function
            Eager->>Eager: Add + Transpose + ReLU
            alt is_copy_transpose
                Eager->>Eager: Apply .contiguous()
            end
            Eager-->>Test: Return eager_output
            Test->>Fusion: Validate against eager
        end
    
        opt Benchmarking enabled
            Test->>Bench: Run nvFuser benchmark
            Bench->>Fusion: Execute fusion
        end
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @pytest.mark.parametrize("size,axes,dims", _generate_transpose_params())
    @pytest.mark.parametrize("dtype", FLOAT_DTYPES)
    @pytest.mark.parametrize("axes", [(0, 1), (0, 2), (1, 2)])
    @pytest.mark.parametrize(
    Copy link
    Collaborator

    @Priya2698 Priya2698 Feb 3, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Do we need to benchmark view transpose? Should we remove it instead?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I don't know, it's not an expensive benchmark, so I just leave it as is in this PR.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Got it. Please work with @xwang233 for dashboard integration.


    @pytest.mark.parametrize("executor", DEFAULT_EXECUTORS)
    @pytest.mark.parametrize("size", generate_input_sizes(dims=3))
    @pytest.mark.parametrize("size,axes,dims", _generate_transpose_params())
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    IIRC, I used 3D inputs to match C++ benchmark. If 2D inputs are sufficient for benchmarking, we should remove the 3D benchmarking. This should also simplify the dashboard for this benchmark

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    should keep 3D for different axes

    Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    Comment on lines 37 to 38
    # add segmenter set to avoid presegment passes setting the output as a view of the input without any data movement. It leads to pointwise instead of transpose scheduler.
    #we can also expose OptimizationPassGuard to python frontend and disable presegmentation passes to enforce output to be contiguous and then transpose scheduler will be used.
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Break these long comments into multiple lines for better readability

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    Copy link
    Collaborator

    @Priya2698 Priya2698 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM. If you find that the view transpose variant is not meaningful anymore, please remove it in a future follow-up.

    @liqiangxl
    Copy link
    Collaborator Author

    !build

    @liqiangxl
    Copy link
    Collaborator Author

    LGTM. If you find that the view transpose variant is not meaningful anymore, please remove it in a future follow-up.

    It provides apple-to-apple compare and ensure nvFuser is smart enough to detect and use view transpose.

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl merged commit 37d40a5 into main Feb 5, 2026
    18 checks passed
    @liqiangxl liqiangxl deleted the llu/extend_transpose_benchmark branch February 5, 2026 13:42
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants