65_geglu: relax atol/rtol from 1e-5 to 1e-4 by Copilot · Pull Request #207 · AlphaGPU/leetgpu-challenges

Copilot · 2026-03-04T08:20:30Z

tl.erf and torch.erf diverge by ~3e-5 on CUDA, causing correct pure Triton GEGLU implementations to fail validation against the PyTorch reference at the previous 1e-5 tolerance.

Change

challenges/easy/65_geglu/challenge.py: bump atol and rtol from 1e-05 to 1e-04

Original prompt

This section details on the original issue you should resolve

<issue_title>65_geglu: strict atol/rtol (1e-5) + torch.erf reference makes pure Triton fail (tl.erf differs by ~3e-5)</issue_title>
<issue_description>I’m reporting an issue with challenge 65_geglu (Gaussian Error Gated Linear Unit): a correct pure Triton implementation fails due to a consistent small numerical mismatch vs the reference implementation, even when matching the same formula.

Reference implementation (from challenge)

The grader reference is:
x1, x2 = input.chunk(2)
gelu = 0.5 * x2 * (1.0 + torch.erf(x2 / math.sqrt(2.0)))
output.copy_(x1 * gelu)
with tolerances:

atol=1e-05

rtol=1e-05

Problem

A pure Triton kernel computing the same formula using tl.erf fails with:

max abs diff = 3.0517578125e-05

This happens with:

fp32 compute

fp64 intermediates (still fails after casting to fp32 output)

hardcoded constants 0.7071067811865476

This strongly suggests tl.erf does not numerically match torch.erf on CUDA closely enough to satisfy 1e-5 tolerance for all random inputs (e.g., N=1024 uniform(-100,100)).

Why this matters

On the Triton leaderboard, the few “Triton” solutions appear to pass by bypassing Triton and using PyTorch ops instead. That implies the task is effectively not solvable with pure Triton under current grading tolerances/reference.

Repro (pure Triton)
import torch
import triton
import triton.language as tl


@triton.jit
def geglu(input, output, N, BLOCK_SIZE: tl.constexpr):
    middle = N // 2
    idx = tl.program_id(0)
    offset = idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offset < middle
    x = tl.load(input + offset, mask=mask, other=0.0)
    y = tl.load(input + offset + middle, mask=mask, other=0.0)

    
    res = x * 0.5 * y * (1.0 + tl.erf(y * tl.sqrt(2.0) * 0.5))
    tl.store(output + offset, res, mask=mask)


# input, output are tensors on the GPU
def solve(input: torch.Tensor, output: torch.Tensor, N: int):
    BLOCK_SIZE = 256
    grid = (triton.cdiv(N // 2, BLOCK_SIZE),)
    geglu[grid](input, output, N, BLOCK_SIZE=BLOCK_SIZE)
Request / Possible fixes

Could you please consider one of:

Relax tolerance for this challenge (e.g. atol=5e-5 or 1e-4), since different GPU erf implementations can differ slightly.

Change reference to a formulation that matches Triton’s available math more robustly (or compare against a CPU high-precision reference with looser tolerance).

Document that the Triton track is allowed to call torch.erf (if that’s intended), though it defeats the purpose of a Triton-only challenge.

Extra: what to change (suggested tolerance)

Given the observed diff is ~3.05e-05, a minimal safe fix is:

atol = 5e-05 (or 1e-4 to be robust across GPUs)
</issue_description>

<agent_instructions>bump the tol to 1e-4</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes 65_geglu: strict atol/rtol (1e-5) + torch.erf reference makes pure Triton fail (tl.erf differs by ~3e-5) #206

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: kunal-mansukhani <24417742+kunal-mansukhani@users.noreply.github.com>

Initial plan

68ce4bb

Copilot AI assigned Copilot and kunal-mansukhani Mar 4, 2026

Copilot started work on behalf of kunal-mansukhani March 4, 2026 08:20 View session

bump atol/rtol to 1e-4 for 65_geglu challenge

3f85845

Co-authored-by: kunal-mansukhani <24417742+kunal-mansukhani@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix numerical mismatch in 65_geglu Triton implementation~~ 65_geglu: relax atol/rtol from 1e-5 to 1e-4 Mar 4, 2026

Copilot finished work on behalf of kunal-mansukhani March 4, 2026 08:22

kunal-mansukhani marked this pull request as ready for review March 4, 2026 18:54

kunal-mansukhani requested review from ishaan-arya, kunal-mansukhani and shxjames as code owners March 4, 2026 18:54

kunal-mansukhani approved these changes Mar 4, 2026

View reviewed changes

kunal-mansukhani merged commit 7d503ed into main Mar 4, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

65_geglu: relax atol/rtol from 1e-5 to 1e-4#207

65_geglu: relax atol/rtol from 1e-5 to 1e-4#207
kunal-mansukhani merged 2 commits intomainfrom
copilot/fix-geglu-numerical-mismatch

Copilot AI commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Reference implementation (from challenge)

Problem

Why this matters

Repro (pure Triton)

Request / Possible fixes

Extra: what to change (suggested tolerance)

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 4, 2026 •

edited

Loading