MLSys 2026 Competition Track
Participants will write custom kernels with the Neuron Kernel Interface (NKI) for the Qwen3-30B-A3B Mixture of Experts model and optimize inference performance on AWS Trainium2/3 hardware.
For full details on the competition, read the competition guidelines.
To register your team, enter your information here (just one entry per team).
To learn NKI, follow the official NKI guide and various example NKI kernels from the nki-samples repository. Another tool to help with optimizing NKI kernels is NKI autotune.
- Create a Trainium2 instance with AWS Neuron SDK v2.27 using EC2 based on the setup guide.
- Activate the Neuron virtual environment to run inference by running the appropriate activation command for your SDK version:
source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate - Clone this repository and run
cd [PATH]/nki-moewhere[PATH]is the directory where you have performed the clone. - Download the Qwen3-30B-A3B model to a
~/qwen-30b-a3b/hf_modelfolder in your root directory. We recommend doing so using the Hugging Face CLI. You can install this by runningpip3 install huggingface_hub[cli]. - To run inference, navigate to
[PATH]/nki-moeand run:python3 main.py --mode generate --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"
This repository contains the standard model implementation in qwen.py.
Your task is to identify parts of the model (operators, fused operators, layers, or even the whole model) that can be implemented as NKI kernels and add them to create optimized versions of the model.
This repository includes two NKI kernel examples to help you get started:
A simple NKI kernel demonstrating basic tensor operations. This serves as a minimal reference implementation showing:
- Basic NKI kernel structure with
@nki.jitdecorator - Tensor indexing and loading from HBM to SBUF
- Element-wise operations
- Storing results back to HBM
This example is not integrated into the model but provides a foundation for understanding NKI kernel development.
A production-ready NKI RMSNorm implementation integrated into the Qwen model. This kernel follows the pattern from the official AWS NKI RMSNorm tutorial.
We also have qwen_with_nki.py which has model implementation with custom NKI kernels integrated. To test the different implementations:
# Standard inference (uses qwen.py)
python3 main.py --mode generate --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"
# With NKI RMSNorm kernel (uses qwen_with_nki.py)
python3 main.py --mode generate --enable-nki --model-path ~/qwen-30b-a3b/hf_model --compiled-model-path ~/qwen-30b-a3b/traced_model --prompt "What is the capital of France?"Important: When switching between NKI and standard modes, remove the traced model directory and compile cache to ensure proper recompilation:
rm -rf ~/qwen-30b-a3b/traced_model
rm -rf /var/tmp/neuron-compile-cache/*The --enable-nki flag in main.py controls which model file is loaded:
- Without flag: loads
qwen.py(standard implementation) - With flag: loads
qwen_with_nki.py(NKI-accelerated implementation)
Key areas to focus on:
- MoE routing and expert selection logic
- Expert computation (gate_proj, up_proj, down_proj)
- Attention mechanisms with MoE-specific optimizations
- Memory-efficient tensor operations for sparse expert execution
The contest organizers will execute each team's submission across the twenty withheld benchmarks on a dedicated Trainium instance. The submissions will be evaluated on:
- Accuracy of generated output vs. our reference implementation. Accuracy evaluation will be a binary assessor: Any benchmark that fails an accuracy threshold will result in a score of 0.
- Latency (Time to first token (TTFT))
- Throughput measured as output tokens / second
- Amount of model written in NKI (measured as NKI FLOPS / total model FLOPS) (will be applied as a scaling factor for (b) and (c)). Note: NKI FLOPs measures the number of multiply-accumulate (MAC) operations.
Rankings will be established by calculating the total normalized number of points per team, where points are normalized against the baseline.
We define points as Accuracy (binary) * Reduced Latency * Increased Throughput * (1 + Normalized NKI FLOPS), where:
- Accuracy = 1 if accuracy matches or exceeds a predetermined threshold, 0 otherwise
- Reduced Latency = Reference implementation TTFT divided by submission TTFT
- Increased Throughput = Submission tokens/sec divided by reference implementation tokens/sec
- Normalized NKI FLOPS = Submission NKI FLOPS divided by total model FLOPS
For example, a submission that is sufficiently accurate, with 10x reduced latency, 2x increased throughput, and 0.85 normalized NKI FLOPS would obtain 1 * 10 * 2 * 1.85 = 37 points.
- Profiling: If you would like to profile your implementation in order to get a better understanding of performance bottlenecks and opportunities for optimization, you can use the Neuron Explorer.
- Benchmarking: You can also leverage the NKI benchmarking API to retrieve execution latency statistics.
Email: nki-mlsys-2026@amazon.com