Skip to content

Allow decoding Control Tokens to String #826

@ju6ge

Description

@ju6ge

Currently it is not possible to decode control code tokens:

pub fn token_to_bytes_with_size(
&self,
token: LlamaToken,
buffer_size: usize,
special: Special,
lstrip: Option<NonZeroU16>,
) -> Result<Vec<u8>, TokenToStringError> {
if token == self.token_nl() {
return Ok(b"\n".to_vec());
}
// unsure what to do with this in the face of the 'special' arg + attr changes
let attrs = self.token_attr(token);
if attrs.is_empty()
|| attrs
.intersects(LlamaTokenAttr::Unknown | LlamaTokenAttr::Byte | LlamaTokenAttr::Unused)
|| attrs.contains(LlamaTokenAttr::Control)
&& (token == self.token_bos() || token == self.token_eos())
{
return Ok(Vec::new());
}
let special = match special {
Special::Tokenize => true,
Special::Plaintext => false,
};
let string = CString::new(vec![b'*'; buffer_size]).expect("no null");
let len = string.as_bytes().len();
let len = c_int::try_from(len).expect("length fits into c_int");
let buf = string.into_raw();
let lstrip = lstrip.map_or(0, |it| i32::from(it.get()));
let size = unsafe {
llama_cpp_sys_2::llama_token_to_piece(
self.vocab_ptr(),
token.0,
buf,
len,
lstrip,
special,
)
};
match size {
0 => Err(TokenToStringError::UnknownTokenType),
i if i.is_negative() => Err(TokenToStringError::InsufficientBufferSpace(i)),
size => {
let string = unsafe { CString::from_raw(buf) };
let mut bytes = string.into_bytes();
let len = usize::try_from(size).expect("size is positive and fits into usize");
bytes.truncate(len);
Ok(bytes)
}
}
}
)

The reason being that token_to_bytes_with_size returns an empty bytes vector when trying to get the corresponding string for an token like model.token_eos() … if I comment it out I can decode the token just fine. Is there a particular reason for this behavior? The comment in the code is a bit unclear why this is handled this way.

The reason this would be useful is that if I want to generate a grammar for the response of model i need to include the eos token in order to terminate the generation gracefully. But different models use different eos tokens so without knowing what string the eos token corresponds to I am unable to generate the correct grammar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions