AWS Trainium2/3 MoE Kernel Challenge

MLSys 2026 Competition Track

Participants will write custom kernels with the Neuron Kernel Interface (NKI) for the Qwen3-30B-A3B Mixture of Experts model and optimize inference performance on AWS Trainium2/3 hardware.

For full details on the competition, read the competition guidelines.

To register your team, enter your information here (just one entry per team).

Getting Started

To learn NKI, follow the official NKI guide and various example NKI kernels from the nki-samples repository. Another tool to help with optimizing NKI kernels is NKI autotune.

Setup Steps

Create a Trainium2 instance with AWS Neuron SDK v2.27 using EC2 based on the setup guide.
Activate the Neuron virtual environment to run inference by running the appropriate activation command for your SDK version:
```
source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
```
Clone this repository and run cd [PATH]/nki-moe where [PATH] is the directory where you have performed the clone.
Download the Qwen3-30B-A3B model to a ~/qwen-30b-a3b/hf_model folder in your root directory. We recommend doing so using the Hugging Face CLI. You can install this by running pip3 install huggingface_hub[cli].

To run inference, navigate to [PATH]/nki-moe and run:

python3 main.py --mode generate --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"

NKI Kernel Development

This repository contains the standard model implementation in qwen.py.

Your task is to identify parts of the model (operators, fused operators, layers, or even the whole model) that can be implemented as NKI kernels and add them to create optimized versions of the model.

Sample NKI Kernels

This repository includes two NKI kernel examples to help you get started:

1. Tensor Add Example (`nki_tensor_add_example.py`)

A simple NKI kernel demonstrating basic tensor operations. This serves as a minimal reference implementation showing:

Basic NKI kernel structure with @nki.jit decorator
Tensor indexing and loading from HBM to SBUF
Element-wise operations
Storing results back to HBM

This example is not integrated into the model but provides a foundation for understanding NKI kernel development.

2. RMSNorm Kernel (`nki_custom_rmsnorm.py`)

A production-ready NKI RMSNorm implementation integrated into the Qwen model. This kernel follows the pattern from the official AWS NKI RMSNorm tutorial.

We also have qwen_with_nki.py which has model implementation with custom NKI kernels integrated. To test the different implementations:

# Standard inference (uses qwen.py)
python3 main.py --mode generate --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"

# With NKI RMSNorm kernel (uses qwen_with_nki.py)
python3 main.py --mode generate --enable-nki --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"

Important: When switching between NKI and standard modes, remove the traced model directory and compile cache to ensure proper recompilation:

rm -rf ~/qwen-30b-a3b/traced_model
rm -rf /var/tmp/neuron-compile-cache/*

The --enable-nki flag in main.py controls which model file is loaded:

Without flag: loads qwen.py (standard implementation)
With flag: loads qwen_with_nki.py (NKI-accelerated implementation)

Key areas to focus on:

MoE routing and expert selection logic
Expert computation (gate_proj, up_proj, down_proj)
Attention mechanisms with MoE-specific optimizations
Memory-efficient tensor operations for sparse expert execution

Evaluation and Scoring

The contest organizers will execute each team's submission across the twenty withheld benchmarks on a dedicated Trainium instance. The submissions will be evaluated on:

Accuracy of generated output vs. our reference implementation. Accuracy evaluation will be a binary assessor: Any benchmark that fails an accuracy threshold will result in a score of 0.
Latency (Time to first token (TTFT))
Throughput measured as output tokens / second
Amount of model written in NKI (measured as NKI FLOPS / total model FLOPS) (will be applied as a scaling factor for (b) and (c)). Note: NKI FLOPs measures the number of multiply-accumulate (MAC) operations.

Rankings will be established by calculating the total normalized number of points per team, where points are normalized against the baseline.

We define points as Accuracy (binary) * Reduced Latency * Increased Throughput * (1 + Normalized NKI FLOPS), where:

Accuracy = 1 if accuracy matches or exceeds a predetermined threshold, 0 otherwise
Reduced Latency = Reference implementation TTFT divided by submission TTFT
Increased Throughput = Submission tokens/sec divided by reference implementation tokens/sec
Normalized NKI FLOPS = Submission NKI FLOPS divided by total model FLOPS

For example, a submission that is sufficiently accurate, with 10x reduced latency, 2x increased throughput, and 0.85 normalized NKI FLOPS would obtain 1 * 10 * 2 * 1.85 = 37 points.

Additional Tools

Profiling: If you would like to profile your implementation in order to get a better understanding of performance bottlenecks and opportunities for optimization, you can use the Neuron Explorer.
Benchmarking: You can also leverage the NKI benchmarking API to retrieve execution latency statistics.

Contact

Email: nki-mlsys-2026@amazon.com

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTEST.md		CONTEST.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
main.py		main.py
nki_custom_rmsnorm.py		nki_custom_rmsnorm.py
nki_tensor_add_example.py		nki_tensor_add_example.py
prompt_data_trn2.txt		prompt_data_trn2.txt
prompts.txt		prompts.txt
qwen.py		qwen.py
qwen_with_nki.py		qwen_with_nki.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Trainium2/3 MoE Kernel Challenge

Getting Started

Setup Steps

NKI Kernel Development

Sample NKI Kernels

1. Tensor Add Example (`nki_tensor_add_example.py`)

2. RMSNorm Kernel (`nki_custom_rmsnorm.py`)

Evaluation and Scoring

Additional Tools

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

aws-neuron/nki-moe

Folders and files

Latest commit

History

Repository files navigation

AWS Trainium2/3 MoE Kernel Challenge

Getting Started

Setup Steps

NKI Kernel Development

Sample NKI Kernels

1. Tensor Add Example (nki_tensor_add_example.py)

2. RMSNorm Kernel (nki_custom_rmsnorm.py)

Evaluation and Scoring

Additional Tools

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

1. Tensor Add Example (`nki_tensor_add_example.py`)

2. RMSNorm Kernel (`nki_custom_rmsnorm.py`)

Packages