Skip to content

Conversation

@williambarberjr
Copy link

@williambarberjr williambarberjr commented Jan 30, 2026

What does this PR do?

This PR adds support for voyageai/voyage-4-nano, a Qwen3-based embedding model that uses bidirectional attention and a projection layer.

Changes

1. Bidirectional Attention Support

  • Added use_bidirectional_attention config field (default: false)
  • When true, disables causal masking in the attention mechanism
  • voyage-4-nano and similar embedding models use bidirectional attention to see the full context

2. Projection Layer Support

  • Added num_labels config field for output projection dimension
  • When set, loads linear.weight from safetensors root level and applies projection after final normalization
  • voyage-4-nano projects from hidden_size=1024 to output_dim=2048

Model Configuration

Models using these features should have in their config.json:

{
  "use_bidirectional_attention": true,
  "num_labels": 2048
}

Testing

Tested with voyageai/voyage-4-nano:

  • ✅ Output dimension: 2048 (correct)
  • ✅ Cosine similarity vs HuggingFace transformers: 0.999965
  • ✅ Inference time: ~9ms on L4 GPU (vs 35ms with transformers)

Files Changed

  • backends/candle/src/models/flash_qwen3.rs - CUDA/flash attention implementation
  • backends/candle/src/models/qwen3.rs - CPU/Metal implementation + config struct
  • backends/candle/Cargo.toml - Added cudarc dev-dependency for CUDA tests
  • backends/candle/tests/test_voyage_nano.rs - CPU test with snapshots
  • backends/candle/tests/test_flash_voyage_nano.rs - CUDA test with snapshots
  • README.md - Added voyage-4-nano to supported models table
  • docs/source/en/supported_models.md - Added voyage-4-nano to docs

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil @alvarobartt - This adds two new config fields to support voyage-4-nano embedding model. The changes are backwards compatible (both fields default to disabled behavior).

@williambarberjr williambarberjr force-pushed the voyage-4-nano-support branch 6 times, most recently from bd2bc16 to 539f322 Compare January 30, 2026 22:59
Add two new config fields to Qwen3 to support voyage-4-nano and similar models:

- `use_bidirectional_attention`: When true, disables causal masking
  for embedding models that use full bidirectional attention
- `num_labels`: When set, loads projection layer from linear.weight
  at safetensors root level (e.g., 1024 -> 2048 for voyage-4-nano)

Both fields are backwards compatible, defaulting to disabled behavior.

Changes:
- backends/candle/src/models/qwen3.rs: Add config fields and CPU impl
- backends/candle/src/models/flash_qwen3.rs: Add CUDA/flash-attn impl
- backends/candle/tests/test_voyage_nano.rs: CPU tests with snapshots
- backends/candle/tests/test_flash_voyage_nano.rs: CUDA tests
- README.md, docs/source/en/supported_models.md: Add voyage-4-nano

Tested with voyageai/voyage-4-nano:
- Output dimension: 2048 (correct)
- Cosine similarity vs transformers: 0.999965
- Inference time: ~9ms on L4 GPU (vs 35ms with transformers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant