Skip to content

Conversation

@zharinov
Copy link

Adds from_raw_parts() for constructing models from pre-parsed components, from_pretrained() now delegates to it.

Also fixes a bug where loading would fail if the tokenizer doesn't define an unk_token (not all tokenizers have one).

- `from_pretrained` now delegates to `from_raw_parts`
- Fixes BPE tokenizer support (unk_token_id now optional)
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR @zharinov! This is a nice functionality to have I think, and good catch about the unk_token. I have two small (but nice to have) improvements; if you could implement those this is good to go. Thanks for updating the tests as well 👍

.get("model")
.and_then(|m| m.get("unk_token"))
.and_then(Value::as_str);
let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too permissive; I agree that the previous logic was too strict, but I prefer a middleground:

  • if unk_token is absent: unk_token_id = None
  • if unk_token is present but not found in vocab: error
  • if unk_token is present and in vocab: use the unk_token

So something like this should work I think:

let unk_token_id = match unk_token {
    None => None, // Allow None if tokenizer does not declare one
    Some(tok) => match tokenizer.token_to_id(tok) {
        Some(id) => Some(id as usize),
        None => {
            return Err(anyhow!(
                "tokenizer declares unk_token='{tok}' but it isn't in the vocab"
            ))
        }
    },
};

.and_then(Value::as_str);
let unk_token_id = unk_token.and_then(|tok| tokenizer.token_to_id(tok)).map(|id| id as usize);

let embeddings = Array2::from_shape_vec((rows, cols), embeddings.to_vec())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does an extra full copy of the embeddings which adds up for larger models. I think the easiest solution is to change embeddings in from_raw_parts from embeddings: &[f32, to embeddings: Vec[f32] and then this becomes:

let embeddings = Array2::from_shape_vec((rows, cols), embeddings)
    .context("failed to build embeddings array")?;

And then Self::from_raw_parts(tokenizer, &floats, rows, cols, normalize, weights, token_mapping) becomes Self::from_raw_parts(tokenizer, floats, rows, cols, normalize, weights, token_mapping). This way from_pretained can copy the embeddings over directly.

@Pringled
Copy link
Member

@zharinov one additional comment, could you also run clippy to fix the formatting issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants