Releases: huggingface/optimum-neuron
Releases · huggingface/optimum-neuron
v0.4.5: serving LLM Embeddings models
What's Changed
- doc: add a guide to explain how vLLM deployment on Inference Endpoints by @tengomucho in #1057
- Add Qwen embedding guide and notebook by @pinak-p in #1045
- Serve embedding models using vLLM by @dacorvo in #1072
Other changes
- Update container URIs by @dacorvo in #1056
- Implement predownload and instance type detection for trn2 example by @jimburtoft in #1041
- Fix vllm IE Images by @tengomucho in #1058
- Fixed broken link to Neuron setup in README.md by @mlopezr in #1059
- Update LLM deployment documentation by @dacorvo in #1060
- doc: remove last mentions of TGI by @dacorvo in #1061
- Optimize cache lookup by @dacorvo in #1062
- Update vLLM container for Sagemaker to v0.4.4 in documentation by @tengomucho in #1063
- Optimize lookup by @dacorvo in #1066
- Fix cache registry for embedding models by @dacorvo in #1067
- dlc doc vllm new tag by @pagezyhf in #1069
- chore: add agentic instructions by @dacorvo in #1070
- Ci sequential and cache by @tengomucho in #1075
- Fix ci sanity in PRs by @tengomucho in #1077
- fix(exporter): remove deprecation warning by @dacorvo in #1076
- feat(docker): vllm container uses uv to install package by @tengomucho in #1071
New Contributors
Full Changelog: v0.4.4...v0.4.5
v0.4.4: improved vLLM perf with on-device-sampling disable, fix speculation algo, PEFT update for GRPO
What's Changed
Inference
- vLLM searches local configuration files by @tengomucho in #1046
- Fix speculation algorithm by @dacorvo in #1047
- Simplify LLM inference modeling and use longer sequences in tests by @dacorvo in #1052
- Remove Inf1 support by @tengomucho in #1054
- Improve vLLM performance when on-device-sampling is disabled by @dacorvo in #1055
Training
- PEFT update for GRPO by @michaelbenayoun in #1044
- Collective ops in
optimum/neuron/accelerateby @michaelbenayoun in #1042 - Gradient checkpointing fix by @michaelbenayoun in #1043
Other
- Update pyproject.toml for uv by @michaelbenayoun in #1040
- chore: update pyproject.toml to fix incoherences by @tengomucho in #1050
Full Changelog: v0.4.3...v0.4.4
v0.4.3: fix for on Llama4, device memory usage details, vLLM container accepts params
What's Changed
Inference
- Reduce llm tests by @dacorvo in #1033
- (Re-)disable hard-coded check in vLLM ModelConfig (fix for llama4) by @dacorvo in #1035
- fix: Flux timeout issue + nxd implementation refactoring by @JingyaHuang in #1022
- vLLM docker takes params by @tengomucho in #1039
Other
- AWS Neuron SDK 2.6.1 by @dacorvo in #1037
- Device memory usage by @dacorvo in #1036
- Cleanup CI workflows and bump development version by @dacorvo in #1034
Full Changelog: v0.4.2...v0.4.3
v0.4.2: Training cache fixes, Qwen3 Embedding support added, vLLM v1 API
What's Changed
Inference
- Fix input slots exhaustion in vLLM plugin by @dacorvo in #1028
- Agentic example by @tengomucho in #1030
- perf: move accuracy benchmark to vllm by @dacorvo in #1031
- Add support for Qwen3 embedding model by @dacorvo in #1023
- Update vllm version to 0.11.0 by @dacorvo in #1027
- feat: Add
encodeandsimilarityof Sentence transformers by @JingyaHuang in #1012
Training
- Metrics for training by @michaelbenayoun in #982
- Update
trlversion to the latest release0.11.4->0.24.0by @michaelbenayoun in #1000 - Add cache features to the
NeuronTrainerby @michaelbenayoun in #1026
Other
- Sync with transformers 4.57.1 by @michaelbenayoun in #1016
- ci(vllm): login to docker by @tengomucho in #1010
- Fix small typos by @tengomucho in #1021
- Bump optimum to 2.0 by @JingyaHuang in #1018
- Unpin protobuf version by @JingyaHuang in #1014
- Fixing link in error message by @jimburtoft in #1029
- fix(vllm): fix base_neuron_llm_config fixture by @tengomucho in #1032
Full Changelog: v0.4.1...v0.4.2
v0.4.1: Xet High Performance transfers, vLLM served model name
What's Changed
- chore: bump huggingface_hub version, set HF_XET_HIGH_PERFORMANCE by @tengomucho in #998
Inference
- Cleanup llm export and serving by @dacorvo in #999
- Midsize models trn2 benchmarks by @dacorvo in #1002
- Refactor autofill cache tools by @dacorvo in #1003
- Isolate Trainium 1 and Trainium 2 autofill cache workflows by @dacorvo in #1005
- Fix: wav2vec2 export + reduce transformers test workload by @JingyaHuang in #1006
- add
--served_model_nameto vLLM by @tengomucho in #1004 - NxD backend refactoring by @dacorvo in #1007
- Lookup without neuronx by @tengomucho in #1009
Training
- ZeRO-1 precision args for the NeuronTrainer by @michaelbenayoun in #997
Documentation:
- DLC doc by @pagezyhf in #995
- doc: update doc to avoid using image_uri when optimum neuron not avail by @tengomucho in #994
Full Changelog: v0.4.0...v0.4.1
v0.4.0: AWS Neuron SDK 2.6, Trainium 2 support, Qwen3-MoE, Llama4 (text)
What's Changed
Inference
- Add Flux Inpaint support by @JingyaHuang in #909
- chore: Bump diffusers to 0.35.* by @JingyaHuang in #935
- fix: flux neuron cache detection by @JingyaHuang in #937
- Add flux inpaint to supported by @JingyaHuang in #932
- feat: Support flux kontext for text2img by @jlonge4 in #916
- allow safetensors to be downloaded for flux by @Abdennacer-Badaoui in #939
- Add support for SmolLM3 models by @dacorvo in #934
- Add support for Qwen3Moe models by @dacorvo in #945
- Cleanup inference backend modules by @dacorvo in #948
- Remove unsupported modeling flags by @dacorvo in #950
- Remove optimized model dependency in LLM models by @dacorvo in #955
- Add tests for modules used in inference by @tengomucho in #957
- Add support for text generation in Llama4 models by @dacorvo in #959
- test(inference): add tests to check decoder layer accuracy by @tengomucho in #962
- Add vLLM docker image by @dacorvo in #967
- Add trn1 vllm llama benchmark by @dacorvo in #970
- Enable CPU compilation by @Abdennacer-Badaoui in #961
- Enable
instance_typetag to export by @JingyaHuang in #974 - Add trn2 benchmarks for a few big models by @dacorvo in #991
- Improve DX when exporting and deploying LLM neuron models by @dacorvo in #986
- ECR Image URI retrieval by @tengomucho in #985
- Add support for Trainium 2 for decoder models by @dacorvo in #988
- Automatically detect platform when serving models by @dacorvo in #989
- Add trn1 qwen3 and llama4 moe vLLM benchmark by @dacorvo in #973
Training
- Trainers refactor by @michaelbenayoun in #918
- fix: uses processing_class instead of tokenizer in base trainer by @michaelbenayoun in #927
- fix: fixes barrier issue at the end of training with hub sync by @michaelbenayoun in #925
- Sync training custom modeling to transformers=4.55.4 by @michaelbenayoun in #954
- Trainer simplification by @michaelbenayoun in #938
- Fix attention implementation argument in custom modeling by @michaelbenayoun in #963
- ZeRO-1 and mixed-precision by @michaelbenayoun in #956
- PEFT and PP by @michaelbenayoun in #964
- Fix async save by @michaelbenayoun in #976
Documentation
- docs: remove finetune with AWS guide by @michaelbenayoun in #905
- Contribute custom modeling by @michaelbenayoun in #908
- docs: remove sagemaker guide by @michaelbenayoun in #906
- Supported architectures page by @michaelbenayoun in #907
- Update cache system guide by @michaelbenayoun in #910
- [docs] Getting started page by @michaelbenayoun in #911
- [docs] Move inference API section by @michaelbenayoun in #913
- [docs] Tutorial sections by @michaelbenayoun in #914
- [docs] Update the link for the card images by @michaelbenayoun in #915
- [docs] Quickstart page by @michaelbenayoun in #912
- chore: remove doc-builder dependency from quality extra by @tengomucho in #917
- [docs] Trainers api by @michaelbenayoun in #922
- [docs] Distributed training guide by @michaelbenayoun in #921
- [docs] Transformations specs api ref by @michaelbenayoun in #923
- [docs] Lora API reference page by @michaelbenayoun in #924
- Update pipelines.mdx by @Abdennacer-Badaoui in #942
- [docs] LLama tutorial: adapt the Llama tutorial to the new format by @michaelbenayoun in #919
- Add whitepaper by @pagezyhf in #958
- Add vllm install instructions to documentation by @jimburtoft in #952
New Contributors
- @Abdennacer-Badaoui made their first contribution in #939
Full Changelog: v0.3.0...v0.4.0
v0.3.0: vLLM plugin, FLUX support, SDK 2.24
What's Changed
- chore: bump aws neuron sdk version to 2.24.0 by @JingyaHuang in #856
- Add BlackForest Flux Support by @JingyaHuang in #815
Inference
- [LLM] Reenable on device sampling for (almost) all configurations by @dacorvo in #886
- Add vLLM plugin by @dacorvo in #888
- Move
NEURON_FUSE_SOFTMAXandNEURON_CUSTOM_SILUenv var to diffusers model loading by @JingyaHuang in #889 - Update LLM benchmarks by @dacorvo in #895
- Bump accelerate to 1.3.0 + peft to 0.15.2+diffusers>=0.31.0 by @JingyaHuang in #901
- chore: move inference modeling code by @JingyaHuang in #902
Training
- Few inference fixes by @tengomucho in #880
- Auto model classes for custom modeling by @michaelbenayoun in #883
- Finetune llm example by @michaelbenayoun in #894
General
- Remove
is_torch_xla_availableandis_neuronx_availableby @michaelbenayoun in #884 - Type hint cleaning by @michaelbenayoun in #887
Documentation
- doc(vllm): change reco for models that are not cached by @dacorvo in #899
- Remove example scripts by @michaelbenayoun in #893
- ci: align doc workflow on doc-pr by @dacorvo in #896
- Update README by @michaelbenayoun in #900
- Benchmark on TGI + optimum-neuron by @jlonge4 in #904
Full Changelog: v0.2.2...v0.3.0
release: 0.2.2 - Fix LLM inference modeling
What's Changed
The LLM inference code led to compilation error for models whose head_dim is not equal to hidden_size // num_attention_heads like Qwen3-0.6B and Qwen3-32B.
Full Changelog: v0.2.1...v0.2.2
v0.2.1: NxD refactoring
What's Changed
Inference
- Add qwen2 nxd by @dacorvo in #863
- Support Qwen3 by @jlonge4 in #847
- Add support for phi3 models using the nxd backend by @dacorvo in #867
- Add pixart models to cache CI by @JingyaHuang in #869
- Add granite nxd modeling and remove HLO backend by @dacorvo in #873
- chore(mixtral): align compile options to NXDi by @tengomucho in #875
- Refactoring T5 implementation for NxD support by @JingyaHuang in #876
- Improve diffusers cache CIs by @JingyaHuang in #872
Training
- Initial PR for peft by @michaelbenayoun in #839
- Support for PP with custom modeling by @michaelbenayoun in #857
- Cleanup legacy parallelism support by @michaelbenayoun in #866
- Fix workflows for training by @tengomucho in #874
- Remove
optimum/neuron/distributedby @michaelbenayoun in #877
General
Documentation
- update guidellm version to reproduce examples properly by @jlonge4 in #852
- Tutorial for Qwen3 Fine-tuning by @tengomucho in #865
New Contributors
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- Bump to AWS neuron sdk 2.22 by @JingyaHuang in #828
- chore: bump AMI base version for Neuron SDK 2.22 by @dacorvo in #831
Inference
- Cache granite and phi4 models by @dacorvo in #809
- Refactor hub neuronx cache by @dacorvo in #829
- Add Whisper for the task "automatic-speech-recognition" w/o. KV cache by @JingyaHuang in #789
- Add support for Modern BERT by @JingyaHuang in #818
- Set task to none for multi models cache entry by @dacorvo in #832
- ci: add cv2 to workaround transformers spurious import by @dacorvo in #834
- Refactor decoder modeling by @dacorvo in #835
- Refactor decoder export by @dacorvo in #837
- Add decoder custom modeling for inference based on NxD by @dacorvo in #840
- Activate continuous batching for Llama on NxD by @dacorvo in #848
- Tgi integration by @dacorvo in #855
- Avoid loading weights when exporting an NxD model using the CLI by @dacorvo in #860
- test(speculation): do not load weights during export by @dacorvo in #861
Training
- Training remove gpt neo models support by @tengomucho in #807
- chore(test): add test comparing Linear and RowParallelLinear outputs by @tengomucho in #814
- More training tests updates by @tengomucho in #808
- test(training): add flash attention test by @tengomucho in #824
- Granite modeling for training by @tengomucho in #830
- Cache Hub API Changes by @tengomucho in #836
- Custom modeling for training by @michaelbenayoun in #801
- 🪨 Granite Training by @tengomucho in #845
- Training granite warning flash attention by @michaelbenayoun in #849
- Add Qwen3 modeling for training by @tengomucho in #850
Documentation
- latest available tgi dlc uri by @pagezyhf in #812
- Add guidelines on EC2 creation with the DLAMI by @pagezyhf in #795
- Add per service section in tutorials and a first example for tutorial > inference > SageMaker by @pagezyhf in #796
- Mixtral Sagemaker Inference tutorial by @pagezyhf in #820
- spelling nit in pipelines.mdx by @jimburtoft in #823
- Initial PR for the documentation refactoring by @JingyaHuang in #791
- training dlc doc by @pagezyhf in #844
- Adding environment options explanation by @jimburtoft in #798
- Update the list of supported LLM models by @dacorvo in #859
- Update Llama benchmarks by @dacorvo in #858
- feat: Add Continuous pre-training example for SageMaker hyperpod by @Captainia in #842
- Fix typos by @omahs in #846
Bug fixes
- Fix broken cache for traced models & fix runtime error of diffusion models when batch_size > 1 by @JingyaHuang in #811
- Fix doc ci by @JingyaHuang in #838
New Contributors
- @omahs made their first contribution in #846
- @Captainia made their first contribution in #842
Full Changelog: v0.1.0...v0.2.0