gpt-oss sft megatron does not support sequence packing

**Describe the bug**

When set `sequence_packing=true` and use Megatron backend for fine-tuning gpt-oss. The debug info says no attention backend available.
If `sequence_packing=False`, only `UnfusedAttention` is available, `FlashAttention` disabled for softmax_type = learnable, `FusedAttention` disabled as no backend supports the provided input.

This is different from the fine-tuning in megatron-bridge, which support `FusedAttention` with learnable softmax.

Update: The different backends enabled are due to different versions of cuDNN in docker image. Latest nemo-rl nano image has cuDNN version = 91002, but in nemo image cuDNN version is 91310.

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss sft megatron does not support sequence packing #1685

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpt-oss sft megatron does not support sequence packing #1685

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions