Implement custom NXDI modeling for Qwen3 embedding models #1001
+244
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement custom NXDI modeling for Qwen3 embedding models
Details
When targeting neuron for compilation, any file with a
config_sentence_transformers.jsonwill be targeted with the sentence_transformers library and traced with no support for NXDI or TP. This happens because of the exporter logic inexporters/tasks.py:1994:Original Behavior
Running a Qwen3Embedding model with the default path:
optimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B --batch_size 1 --sequence_length 1024 --auto_cast matmul --instance_type trn2 --tensor_parallel_size 4 qwen3-embedding-0.6b-neuron/Results in the sentence_transformers model being traced, without NXDI / tensor parallelism support.
Solution
I've implemented the Qwen3 embedding model using decoder modeling code with up to 4x improvement in throughput for the same model and seqlen, with tensor parallelism enabled: https://github.com/tonywngzon/optimum-neuron
Changes Made
Added optional
embedding_modelparameter to NxDNeuronConfig (optimum-neuron/optimum/neuron/models/inference/backend/config.py):Truein the Qwen3 embedding modelembedding_model=True:Changed export order (
optimum-neuron/optimum/exporters/neuron/__main__.py):Registered the embedding model (
optimum-neuron/optimum/neuron/models/inference/auto_models.py:109):Created new
modeling_qwen3_embedding.pyfile that overrides the original decoder foward function to support embeddings:NxDDecoderModelwith a customNxDQwen3EmbeddingModelthat overrides the forward pass to return hidden states instead of logitsQwen3NxDModelForCausalLMEmbeddingthat wraps the model with embedding-specific methodsforward()method returns hidden states directlycontinuous_batchingandon_device_samplingfor embeddingsencode()method for getting embeddings with proper position_ids handlingResult
With these changes:
optimum/exporters/tasks.py:1834)modules.jsonorconfig_sentence_transformers.json- the exporter now correctly handles these modelsoptimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B --batch_size 1 --sequence_length 1024 --auto_cast matmul --instance_type trn2 --tensor_parallel_size 4 qwen3-embedding-0.6b-neuron/This solves the issue where models with
config_sentence_transformers.jsonwere forced down the sentence_transformers path, preventing the use of optimum-neuron's more performant NXDI implementation with tensor parallelism support.Testing
Tested locally by: