feat: Add `from_raw_parts()` constructor #33

zharinov · 2026-01-10T17:59:43Z

Adds from_raw_parts() for constructing models from pre-parsed components, from_pretrained() now delegates to it.

Also fixes a bug where loading would fail if the tokenizer doesn't define an unk_token (not all tokenizers have one).

…w data

- `from_pretrained` now delegates to `from_raw_parts` - Fixes BPE tokenizer support (unk_token_id now optional)

Pringled

Thanks for making this PR @zharinov! This is a nice functionality to have I think, and good catch about the unk_token. I have two small (but nice to have) improvements; if you could implement those this is good to go. Thanks for updating the tests as well 👍

Pringled · 2026-01-11T08:36:00Z

src/model.rs

+            .get("model")
+            .and_then(|m| m.get("unk_token"))
+            .and_then(Value::as_str);
+        let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize);


This line is too permissive; I agree that the previous logic was too strict, but I prefer a middleground:

if unk_token is absent: unk_token_id = None

if unk_token is present but not found in vocab: error

if unk_token is present and in vocab: use the unk_token

So something like this should work I think:

let unk_token_id = match unk_token { None => None, // Allow None if tokenizer does not declare one Some(tok) => match tokenizer.token_to_id(tok) { Some(id) => Some(id as usize), None => { return Err(anyhow!( "tokenizer declares unk_token='{tok}' but it isn't in the vocab" )) } }, };

Pringled · 2026-01-11T10:41:06Z

src/model.rs

+            .and_then(Value::as_str);
+        let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize);
+
+        let embeddings = Array2::from_shape_vec((rows, cols), embeddings.to_vec())


This does an extra full copy of the embeddings which adds up for larger models. I think the easiest solution is to change embeddings in from_raw_parts from embeddings: &[f32, to embeddings: Vec[f32] and then this becomes:

let embeddings = Array2::from_shape_vec((rows, cols), embeddings) .context("failed to build embeddings array")?;

And then Self::from_raw_parts(tokenizer, &floats, rows, cols, normalize, weights, token_mapping) becomes Self::from_raw_parts(tokenizer, floats, rows, cols, normalize, weights, token_mapping). This way from_pretained can copy the embeddings over directly.

Pringled · 2026-01-11T10:48:48Z

@zharinov one additional comment, could you also run clippy to fix the formatting issues?

zharinov added 3 commits January 10, 2026 14:35

feat: Add from_bytes() and from_raw_parts() to load model from ra…

2918b6b

…w data

chore: Add test for from_raw_parts

a08059e

feat: Add from_raw_parts() constructor

e992ae0

- `from_pretrained` now delegates to `from_raw_parts` - Fixes BPE tokenizer support (unk_token_id now optional)

Pringled requested changes Jan 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add `from_raw_parts()` constructor #33

feat: Add `from_raw_parts()` constructor #33

Uh oh!

zharinov commented Jan 10, 2026

Uh oh!

Pringled left a comment

Uh oh!

Pringled Jan 11, 2026

Uh oh!

Pringled Jan 11, 2026

Uh oh!

Pringled commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add from_raw_parts() constructor #33

Are you sure you want to change the base?

feat: Add from_raw_parts() constructor #33

Uh oh!

Conversation

zharinov commented Jan 10, 2026

Uh oh!

Pringled left a comment

Choose a reason for hiding this comment

Uh oh!

Pringled Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

Pringled commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add `from_raw_parts()` constructor #33

feat: Add `from_raw_parts()` constructor #33