Allow decoding Control Tokens to String

Currently it is not possible to decode control code tokens:

> https://github.com/utilityai/llama-cpp-rs/blob/1ff8410d2609dce9cdb50aa81b9d0d3dcc05f289/llama-cpp-2/src/model.rs#L382-L436)

The reason being that `token_to_bytes_with_size` returns an empty bytes vector when trying to get the corresponding string for an token like `model.token_eos()` … if I comment it out I can decode the token just fine. Is there a particular reason for this behavior? The comment in the code is a bit unclear why this is handled this way.

The reason this would be useful is that if I want to generate a grammar for the response of model i need to include the `eos` token in order to terminate the generation gracefully. But different models use different `eos` tokens so without knowing what string the eos token corresponds to I am unable to generate the correct grammar.



	pub fn token_to_bytes_with_size(
	&self,
	token: LlamaToken,
	buffer_size: usize,
	special: Special,
	lstrip: Option<NonZeroU16>,
	) -> Result<Vec<u8>, TokenToStringError> {
	if token == self.token_nl() {
	return Ok(b"\n".to_vec());
	}

	// unsure what to do with this in the face of the 'special' arg + attr changes
	let attrs = self.token_attr(token);
	if attrs.is_empty()
	\|\| attrs
	.intersects(LlamaTokenAttr::Unknown \| LlamaTokenAttr::Byte \| LlamaTokenAttr::Unused)
	\|\| attrs.contains(LlamaTokenAttr::Control)
	&& (token == self.token_bos() \|\| token == self.token_eos())
	{
	return Ok(Vec::new());
	}

	let special = match special {
	Special::Tokenize => true,
	Special::Plaintext => false,
	};

	let string = CString::new(vec![b'*'; buffer_size]).expect("no null");
	let len = string.as_bytes().len();
	let len = c_int::try_from(len).expect("length fits into c_int");
	let buf = string.into_raw();
	let lstrip = lstrip.map_or(0, \|it\| i32::from(it.get()));
	let size = unsafe {
	llama_cpp_sys_2::llama_token_to_piece(
	self.vocab_ptr(),
	token.0,
	buf,
	len,
	lstrip,
	special,
	)
	};

	match size {
	0 => Err(TokenToStringError::UnknownTokenType),
	i if i.is_negative() => Err(TokenToStringError::InsufficientBufferSpace(i)),
	size => {
	let string = unsafe { CString::from_raw(buf) };
	let mut bytes = string.into_bytes();
	let len = usize::try_from(size).expect("size is positive and fits into usize");
	bytes.truncate(len);
	Ok(bytes)
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow decoding Control Tokens to String #826

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow decoding Control Tokens to String #826

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions