Releases · huggingface/optimum-neuron

11 Feb 10:28

dacorvo

v0.4.5

abf5985

v0.4.5: serving LLM Embeddings models Latest

Latest

What's Changed

doc: add a guide to explain how vLLM deployment on Inference Endpoints by @tengomucho in #1057
Add Qwen embedding guide and notebook by @pinak-p in #1045
Serve embedding models using vLLM by @dacorvo in #1072

Other changes

Update container URIs by @dacorvo in #1056
Implement predownload and instance type detection for trn2 example by @jimburtoft in #1041
Fix vllm IE Images by @tengomucho in #1058
Fixed broken link to Neuron setup in README.md by @mlopezr in #1059
Update LLM deployment documentation by @dacorvo in #1060
doc: remove last mentions of TGI by @dacorvo in #1061
Optimize cache lookup by @dacorvo in #1062
Update vLLM container for Sagemaker to v0.4.4 in documentation by @tengomucho in #1063
Optimize lookup by @dacorvo in #1066
Fix cache registry for embedding models by @dacorvo in #1067
dlc doc vllm new tag by @pagezyhf in #1069
chore: add agentic instructions by @dacorvo in #1070
Ci sequential and cache by @tengomucho in #1075
Fix ci sanity in PRs by @tengomucho in #1077
fix(exporter): remove deprecation warning by @dacorvo in #1076
feat(docker): vllm container uses uv to install package by @tengomucho in #1071

New Contributors

@pinak-p made their first contribution in #1045
@mlopezr made their first contribution in #1059

Full Changelog: v0.4.4...v0.4.5

Contributors

dacorvo, tengomucho, and 4 other contributors

Assets 2

12 Jan 10:14

tengomucho

v0.4.4

f4bc3c5

v0.4.4: improved vLLM perf with on-device-sampling disable, fix speculation algo, PEFT update for GRPO

What's Changed

Inference

vLLM searches local configuration files by @tengomucho in #1046
Fix speculation algorithm by @dacorvo in #1047
Simplify LLM inference modeling and use longer sequences in tests by @dacorvo in #1052
Remove Inf1 support by @tengomucho in #1054
Improve vLLM performance when on-device-sampling is disabled by @dacorvo in #1055

Training

PEFT update for GRPO by @michaelbenayoun in #1044
Collective ops in optimum/neuron/accelerate by @michaelbenayoun in #1042
Gradient checkpointing fix by @michaelbenayoun in #1043

Other

Update pyproject.toml for uv by @michaelbenayoun in #1040
chore: update pyproject.toml to fix incoherences by @tengomucho in #1050

Full Changelog: v0.4.3...v0.4.4

Contributors

dacorvo, tengomucho, and michaelbenayoun

Assets 2

10 Dec 16:23

tengomucho

v0.4.3

9b9eb9a

v0.4.3: fix for on Llama4, device memory usage details, vLLM container accepts params

What's Changed

Inference

Reduce llm tests by @dacorvo in #1033
(Re-)disable hard-coded check in vLLM ModelConfig (fix for llama4) by @dacorvo in #1035
fix: Flux timeout issue + nxd implementation refactoring by @JingyaHuang in #1022
vLLM docker takes params by @tengomucho in #1039

Other

AWS Neuron SDK 2.6.1 by @dacorvo in #1037
Device memory usage by @dacorvo in #1036
Cleanup CI workflows and bump development version by @dacorvo in #1034

Full Changelog: v0.4.2...v0.4.3

Contributors

dacorvo, tengomucho, and JingyaHuang

Assets 2

20 Nov 10:30

tengomucho

v0.4.2

0b07298

v0.4.2: Training cache fixes, Qwen3 Embedding support added, vLLM v1 API

What's Changed

Inference

Fix input slots exhaustion in vLLM plugin by @dacorvo in #1028
Agentic example by @tengomucho in #1030
perf: move accuracy benchmark to vllm by @dacorvo in #1031
Add support for Qwen3 embedding model by @dacorvo in #1023
Update vllm version to 0.11.0 by @dacorvo in #1027
feat: Add encode and similarity of Sentence transformers by @JingyaHuang in #1012

Training

Metrics for training by @michaelbenayoun in #982
Update trl version to the latest release 0.11.4 -> 0.24.0 by @michaelbenayoun in #1000
Add cache features to the NeuronTrainer by @michaelbenayoun in #1026

Other

Sync with transformers 4.57.1 by @michaelbenayoun in #1016
ci(vllm): login to docker by @tengomucho in #1010
Fix small typos by @tengomucho in #1021
Bump optimum to 2.0 by @JingyaHuang in #1018
Unpin protobuf version by @JingyaHuang in #1014
Fixing link in error message by @jimburtoft in #1029
fix(vllm): fix base_neuron_llm_config fixture by @tengomucho in #1032

Full Changelog: v0.4.1...v0.4.2

Contributors

dacorvo, tengomucho, and 3 other contributors

Assets 2

23 Oct 16:06

tengomucho

v0.4.1

a3ded34

v0.4.1: Xet High Performance transfers, vLLM served model name

What's Changed

chore: bump huggingface_hub version, set HF_XET_HIGH_PERFORMANCE by @tengomucho in #998

Inference

Cleanup llm export and serving by @dacorvo in #999
Midsize models trn2 benchmarks by @dacorvo in #1002
Refactor autofill cache tools by @dacorvo in #1003
Isolate Trainium 1 and Trainium 2 autofill cache workflows by @dacorvo in #1005
Fix: wav2vec2 export + reduce transformers test workload by @JingyaHuang in #1006
add --served_model_name to vLLM by @tengomucho in #1004
NxD backend refactoring by @dacorvo in #1007
Lookup without neuronx by @tengomucho in #1009

Training

ZeRO-1 precision args for the NeuronTrainer by @michaelbenayoun in #997

Documentation:

DLC doc by @pagezyhf in #995
doc: update doc to avoid using image_uri when optimum neuron not avail by @tengomucho in #994

Full Changelog: v0.4.0...v0.4.1

Contributors

dacorvo, tengomucho, and 3 other contributors

Assets 2

10 Oct 10:05

dacorvo

v0.4.0

6d497e4

v0.4.0: AWS Neuron SDK 2.6, Trainium 2 support, Qwen3-MoE, Llama4 (text)

What's Changed

Use AWS Neuron SDK 2.26 by @dacorvo in #977

Inference

Add Flux Inpaint support by @JingyaHuang in #909
chore: Bump diffusers to 0.35.* by @JingyaHuang in #935
fix: flux neuron cache detection by @JingyaHuang in #937
Add flux inpaint to supported by @JingyaHuang in #932
feat: Support flux kontext for text2img by @jlonge4 in #916
allow safetensors to be downloaded for flux by @Abdennacer-Badaoui in #939
Add support for SmolLM3 models by @dacorvo in #934
Add support for Qwen3Moe models by @dacorvo in #945
Cleanup inference backend modules by @dacorvo in #948
Remove unsupported modeling flags by @dacorvo in #950
Remove optimized model dependency in LLM models by @dacorvo in #955
Add tests for modules used in inference by @tengomucho in #957
Add support for text generation in Llama4 models by @dacorvo in #959
test(inference): add tests to check decoder layer accuracy by @tengomucho in #962
Add vLLM docker image by @dacorvo in #967
Add trn1 vllm llama benchmark by @dacorvo in #970
Enable CPU compilation by @Abdennacer-Badaoui in #961
Enable instance_type tag to export by @JingyaHuang in #974
Add trn2 benchmarks for a few big models by @dacorvo in #991
Improve DX when exporting and deploying LLM neuron models by @dacorvo in #986
ECR Image URI retrieval by @tengomucho in #985
Add support for Trainium 2 for decoder models by @dacorvo in #988
Automatically detect platform when serving models by @dacorvo in #989
Add trn1 qwen3 and llama4 moe vLLM benchmark by @dacorvo in #973

Training

Trainers refactor by @michaelbenayoun in #918
fix: uses processing_class instead of tokenizer in base trainer by @michaelbenayoun in #927
fix: fixes barrier issue at the end of training with hub sync by @michaelbenayoun in #925
Sync training custom modeling to transformers=4.55.4 by @michaelbenayoun in #954
Trainer simplification by @michaelbenayoun in #938
Fix attention implementation argument in custom modeling by @michaelbenayoun in #963
ZeRO-1 and mixed-precision by @michaelbenayoun in #956
PEFT and PP by @michaelbenayoun in #964
Fix async save by @michaelbenayoun in #976

Documentation

docs: remove finetune with AWS guide by @michaelbenayoun in #905
Contribute custom modeling by @michaelbenayoun in #908
docs: remove sagemaker guide by @michaelbenayoun in #906
Supported architectures page by @michaelbenayoun in #907
Update cache system guide by @michaelbenayoun in #910
[docs] Getting started page by @michaelbenayoun in #911
[docs] Move inference API section by @michaelbenayoun in #913
[docs] Tutorial sections by @michaelbenayoun in #914
[docs] Update the link for the card images by @michaelbenayoun in #915
[docs] Quickstart page by @michaelbenayoun in #912
chore: remove doc-builder dependency from quality extra by @tengomucho in #917
[docs] Trainers api by @michaelbenayoun in #922
[docs] Distributed training guide by @michaelbenayoun in #921
[docs] Transformations specs api ref by @michaelbenayoun in #923
[docs] Lora API reference page by @michaelbenayoun in #924
Update pipelines.mdx by @Abdennacer-Badaoui in #942
[docs] LLama tutorial: adapt the Llama tutorial to the new format by @michaelbenayoun in #919
Add whitepaper by @pagezyhf in #958
Add vllm install instructions to documentation by @jimburtoft in #952

New Contributors

@Abdennacer-Badaoui made their first contribution in #939

Full Changelog: v0.3.0...v0.4.0

Contributors

dacorvo, tengomucho, and 6 other contributors

Assets 2

18 Jul 12:20

tengomucho

v0.3.0

759b5eb

v0.3.0: vLLM plugin, FLUX support, SDK 2.24

What's Changed

chore: bump aws neuron sdk version to 2.24.0 by @JingyaHuang in #856
Add BlackForest Flux Support by @JingyaHuang in #815

Inference

[LLM] Reenable on device sampling for (almost) all configurations by @dacorvo in #886
Add vLLM plugin by @dacorvo in #888
Move NEURON_FUSE_SOFTMAX and NEURON_CUSTOM_SILU env var to diffusers model loading by @JingyaHuang in #889
Update LLM benchmarks by @dacorvo in #895
Bump accelerate to 1.3.0 + peft to 0.15.2+diffusers>=0.31.0 by @JingyaHuang in #901
chore: move inference modeling code by @JingyaHuang in #902

Training

Few inference fixes by @tengomucho in #880
Auto model classes for custom modeling by @michaelbenayoun in #883
Finetune llm example by @michaelbenayoun in #894

General

Remove is_torch_xla_available and is_neuronx_available by @michaelbenayoun in #884
Type hint cleaning by @michaelbenayoun in #887

Documentation

doc(vllm): change reco for models that are not cached by @dacorvo in #899
Remove example scripts by @michaelbenayoun in #893
ci: align doc workflow on doc-pr by @dacorvo in #896
Update README by @michaelbenayoun in #900
Benchmark on TGI + optimum-neuron by @jlonge4 in #904

Full Changelog: v0.2.2...v0.3.0

Contributors

dacorvo, tengomucho, and 3 other contributors

Assets 2

01 Jul 15:57

dacorvo

v0.2.2

90bbf62

release: 0.2.2 - Fix LLM inference modeling

What's Changed

The LLM inference code led to compilation error for models whose head_dim is not equal to hidden_size // num_attention_heads like Qwen3-0.6B and Qwen3-32B.

[Inference] Fix head_dim usage in modeling by @dacorvo in #879

Full Changelog: v0.2.1...v0.2.2

Contributors

dacorvo

Assets 2

27 Jun 07:58

dacorvo

v0.2.1

c2f557c

v0.2.1: NxD refactoring

What's Changed

Inference

Add qwen2 nxd by @dacorvo in #863
Support Qwen3 by @jlonge4 in #847
Add support for phi3 models using the nxd backend by @dacorvo in #867
Add pixart models to cache CI by @JingyaHuang in #869
Add granite nxd modeling and remove HLO backend by @dacorvo in #873
chore(mixtral): align compile options to NXDi by @tengomucho in #875
Refactoring T5 implementation for NxD support by @JingyaHuang in #876
Improve diffusers cache CIs by @JingyaHuang in #872

Training

Initial PR for peft by @michaelbenayoun in #839
Support for PP with custom modeling by @michaelbenayoun in #857
Cleanup legacy parallelism support by @michaelbenayoun in #866
Fix workflows for training by @tengomucho in #874
Remove optimum/neuron/distributed by @michaelbenayoun in #877

General

feat: add a tool to decode binary HLOs by @dacorvo in #870

Documentation

update guidellm version to reproduce examples properly by @jlonge4 in #852
Tutorial for Qwen3 Fine-tuning by @tengomucho in #865

New Contributors

@jlonge4 made their first contribution in #852

Full Changelog: v0.2.0...v0.2.1

Contributors

dacorvo, tengomucho, and 3 other contributors

Assets 2

06 Jun 15:03

dacorvo

v0.2.0

13bac85

v0.2.0

What's Changed

Bump to AWS neuron sdk 2.22 by @JingyaHuang in #828
chore: bump AMI base version for Neuron SDK 2.22 by @dacorvo in #831

Inference

Cache granite and phi4 models by @dacorvo in #809
Refactor hub neuronx cache by @dacorvo in #829
Add Whisper for the task "automatic-speech-recognition" w/o. KV cache by @JingyaHuang in #789
Add support for Modern BERT by @JingyaHuang in #818
Set task to none for multi models cache entry by @dacorvo in #832
ci: add cv2 to workaround transformers spurious import by @dacorvo in #834
Refactor decoder modeling by @dacorvo in #835
Refactor decoder export by @dacorvo in #837
Add decoder custom modeling for inference based on NxD by @dacorvo in #840
Activate continuous batching for Llama on NxD by @dacorvo in #848
Tgi integration by @dacorvo in #855
Avoid loading weights when exporting an NxD model using the CLI by @dacorvo in #860
test(speculation): do not load weights during export by @dacorvo in #861

Training

Training remove gpt neo models support by @tengomucho in #807
chore(test): add test comparing Linear and RowParallelLinear outputs by @tengomucho in #814
More training tests updates by @tengomucho in #808
test(training): add flash attention test by @tengomucho in #824
Granite modeling for training by @tengomucho in #830
Cache Hub API Changes by @tengomucho in #836
Custom modeling for training by @michaelbenayoun in #801
🪨 Granite Training by @tengomucho in #845
Training granite warning flash attention by @michaelbenayoun in #849
Add Qwen3 modeling for training by @tengomucho in #850

Documentation

latest available tgi dlc uri by @pagezyhf in #812
Add guidelines on EC2 creation with the DLAMI by @pagezyhf in #795
Add per service section in tutorials and a first example for tutorial > inference > SageMaker by @pagezyhf in #796
Mixtral Sagemaker Inference tutorial by @pagezyhf in #820
spelling nit in pipelines.mdx by @jimburtoft in #823
Initial PR for the documentation refactoring by @JingyaHuang in #791
training dlc doc by @pagezyhf in #844
Adding environment options explanation by @jimburtoft in #798
Update the list of supported LLM models by @dacorvo in #859
Update Llama benchmarks by @dacorvo in #858
feat: Add Continuous pre-training example for SageMaker hyperpod by @Captainia in #842
Fix typos by @omahs in #846

Bug fixes

Fix broken cache for traced models & fix runtime error of diffusion models when batch_size > 1 by @JingyaHuang in #811
Fix doc ci by @JingyaHuang in #838

New Contributors

@omahs made their first contribution in #846
@Captainia made their first contribution in #842

Full Changelog: v0.1.0...v0.2.0

Contributors

dacorvo, tengomucho, and 6 other contributors

Assets 2

Releases: huggingface/optimum-neuron

v0.4.5: serving LLM Embeddings models

What's Changed

Other changes

New Contributors

Contributors

Uh oh!

v0.4.4: improved vLLM perf with on-device-sampling disable, fix speculation algo, PEFT update for GRPO

What's Changed

Inference

Training

Other

Contributors

Uh oh!

v0.4.3: fix for on Llama4, device memory usage details, vLLM container accepts params

What's Changed

Inference

Other

Contributors

Uh oh!

v0.4.2: Training cache fixes, Qwen3 Embedding support added, vLLM v1 API

What's Changed

Inference

Training

Other

Contributors

Uh oh!

v0.4.1: Xet High Performance transfers, vLLM served model name

What's Changed

Inference

Training

Documentation:

Contributors

Uh oh!

v0.4.0: AWS Neuron SDK 2.6, Trainium 2 support, Qwen3-MoE, Llama4 (text)

What's Changed

Inference

Training

Documentation

New Contributors

Contributors

Uh oh!

v0.3.0: vLLM plugin, FLUX support, SDK 2.24

What's Changed

Inference

Training

General

Documentation

Contributors

Uh oh!

release: 0.2.2 - Fix LLM inference modeling

What's Changed

Contributors

Uh oh!

v0.2.1: NxD refactoring

What's Changed

Inference

Training

General

Documentation

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

Inference

Training

Documentation

Bug fixes

New Contributors

Contributors

Uh oh!