-
Notifications
You must be signed in to change notification settings - Fork 14
feat: Add from_raw_parts() constructor
#33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- `from_pretrained` now delegates to `from_raw_parts` - Fixes BPE tokenizer support (unk_token_id now optional)
Pringled
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR @zharinov! This is a nice functionality to have I think, and good catch about the unk_token. I have two small (but nice to have) improvements; if you could implement those this is good to go. Thanks for updating the tests as well 👍
| .get("model") | ||
| .and_then(|m| m.get("unk_token")) | ||
| .and_then(Value::as_str); | ||
| let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is too permissive; I agree that the previous logic was too strict, but I prefer a middleground:
- if unk_token is absent: unk_token_id = None
- if unk_token is present but not found in vocab: error
- if unk_token is present and in vocab: use the unk_token
So something like this should work I think:
let unk_token_id = match unk_token {
None => None, // Allow None if tokenizer does not declare one
Some(tok) => match tokenizer.token_to_id(tok) {
Some(id) => Some(id as usize),
None => {
return Err(anyhow!(
"tokenizer declares unk_token='{tok}' but it isn't in the vocab"
))
}
},
};| .and_then(Value::as_str); | ||
| let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize); | ||
|
|
||
| let embeddings = Array2::from_shape_vec((rows, cols), embeddings.to_vec()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does an extra full copy of the embeddings which adds up for larger models. I think the easiest solution is to change embeddings in from_raw_parts from embeddings: &[f32, to embeddings: Vec[f32] and then this becomes:
let embeddings = Array2::from_shape_vec((rows, cols), embeddings)
.context("failed to build embeddings array")?;And then Self::from_raw_parts(tokenizer, &floats, rows, cols, normalize, weights, token_mapping) becomes Self::from_raw_parts(tokenizer, floats, rows, cols, normalize, weights, token_mapping). This way from_pretained can copy the embeddings over directly.
|
@zharinov one additional comment, could you also run clippy to fix the formatting issues? |
Adds
from_raw_parts()for constructing models from pre-parsed components,from_pretrained()now delegates to it.Also fixes a bug where loading would fail if the tokenizer doesn't define an
unk_token(not all tokenizers have one).