-
Notifications
You must be signed in to change notification settings - Fork 112
Description
Currently it is not possible to decode control code tokens:
)llama-cpp-rs/llama-cpp-2/src/model.rs
Lines 382 to 436 in 1ff8410
pub fn token_to_bytes_with_size( &self, token: LlamaToken, buffer_size: usize, special: Special, lstrip: Option<NonZeroU16>, ) -> Result<Vec<u8>, TokenToStringError> { if token == self.token_nl() { return Ok(b"\n".to_vec()); } // unsure what to do with this in the face of the 'special' arg + attr changes let attrs = self.token_attr(token); if attrs.is_empty() || attrs .intersects(LlamaTokenAttr::Unknown | LlamaTokenAttr::Byte | LlamaTokenAttr::Unused) || attrs.contains(LlamaTokenAttr::Control) && (token == self.token_bos() || token == self.token_eos()) { return Ok(Vec::new()); } let special = match special { Special::Tokenize => true, Special::Plaintext => false, }; let string = CString::new(vec![b'*'; buffer_size]).expect("no null"); let len = string.as_bytes().len(); let len = c_int::try_from(len).expect("length fits into c_int"); let buf = string.into_raw(); let lstrip = lstrip.map_or(0, |it| i32::from(it.get())); let size = unsafe { llama_cpp_sys_2::llama_token_to_piece( self.vocab_ptr(), token.0, buf, len, lstrip, special, ) }; match size { 0 => Err(TokenToStringError::UnknownTokenType), i if i.is_negative() => Err(TokenToStringError::InsufficientBufferSpace(i)), size => { let string = unsafe { CString::from_raw(buf) }; let mut bytes = string.into_bytes(); let len = usize::try_from(size).expect("size is positive and fits into usize"); bytes.truncate(len); Ok(bytes) } } }
The reason being that token_to_bytes_with_size
returns an empty bytes vector when trying to get the corresponding string for an token like model.token_eos()
… if I comment it out I can decode the token just fine. Is there a particular reason for this behavior? The comment in the code is a bit unclear why this is handled this way.
The reason this would be useful is that if I want to generate a grammar for the response of model i need to include the eos
token in order to terminate the generation gracefully. But different models use different eos
tokens so without knowing what string the eos token corresponds to I am unable to generate the correct grammar.