[Feature]:  Remove the autotuner invocation from AD's nvfp4_gemm operator

### 🚀 The feature, motivation and pitch

The autotuner is used when AD record cuda-graphs, and per-shape tactics are cached. 
The operator should not explicitly use the auto-tuner context because that is the responsibility of the AD engine.
TRTLLM's nvfp4_gemm_runner is responsible for auto-tuning and caching a tactic and uses this tactic during inference, but only if the auto-tuner is not inside an auto-tuner context.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Remove the autotuner invocation from AD's nvfp4_gemm operator #9496

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Remove the autotuner invocation from AD's nvfp4_gemm operator #9496

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions