Conversation
lib/tokenizers/decode_stream.ex
Outdated
| @doc """ | ||
| Steps through the decode stream with the given tokenizer and token ID. | ||
|
|
||
| Returns `{:ok, string}` if there's a decoded string, or `{:ok, nil}` if there's nothing more to decode. |
There was a problem hiding this comment.
I would likely do :done for when there is nothing more to decode.
There was a problem hiding this comment.
There is a problem here. While it looks like it's stream process, it's actually not. It's not exhausted. One can call
step(12345678) and then call step(0) - it's valid from original library perspective. Or one can call step(n) multiple times (this is even not idempotent).
https://docs.rs/tokenizers/0.21.1/tokenizers/tokenizer/struct.DecodeStream.html
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("data/roberta.json").unwrap();
let mut decode_stream = tokenizer.decode_stream(false);
assert_eq!(decode_stream.step(713).unwrap(), Some("This".to_string()));
assert_eq!(decode_stream.step(16).unwrap(), Some(" is".to_string()));
assert_eq!(decode_stream.step(41).unwrap(), Some(" an".to_string()));
assert_eq!(
decode_stream.step(1246).unwrap(),
Some(" example".to_string())
);From this perspective, :done looks inappropriate for having semantics of "last action", while it's not
There was a problem hiding this comment.
Well
Enum.at/2 returns nil for out of range.
Erlang's libs (like array) raise argumentError.
I've never seen :out_of_range anywhere across ecosystem, so may be it will not work good.
There was a problem hiding this comment.
Enum.at does not wrap it in a tuple though. Enum.fetch, which returns {:ok, _} or :error, is a better example but there are only two states. My understanding is that this code has three states:
- In range
- Out of range
- Error when decoding
So you need three states accordingly. I don't mind what we call it but I'd say the three states should be distinct.
There was a problem hiding this comment.
Switched to :out_of_range
| case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do | ||
| {:ok, result} -> {:ok, result} | ||
| {:error, reason} -> {:error, reason} | ||
| end |
There was a problem hiding this comment.
| case Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) do | |
| {:ok, result} -> {:ok, result} | |
| {:error, reason} -> {:error, reason} | |
| end | |
| Tokenizers.Native.decoder_stream_step(decode_stream, tokenizer, id) |
There was a problem hiding this comment.
Is it actually efficient going step after step by giving specific indexes? Or does the upstream code works better by calling something .next? Is this something we should benchmark before?
There was a problem hiding this comment.
In this PR I translated API as is. Some abstraction can be definitely written on top of it using Stream, but it seems to be a part of separate effort.
Decoding large chunks this way looks better for BEAM because control is given back from NIF. There is a possibility to write decoding without dirty, but i'm not sure somebody will invest into it.
josevalim
left a comment
There was a problem hiding this comment.
Thank you, I have dropped some comments. :)
Co-authored-by: José Valim <jose.valim@gmail.com>
|
@jonatanklosko Can you review please? |
| let tk = tokenizer.resource.0.clone(); | ||
| let mut ds = decode_stream.resource.inner.write().unwrap(); |
There was a problem hiding this comment.
If each step involves cloning, it may result in bad performance, even compared to decoding at once.
This may actually be a good reason to make the public API be a more high-level stream. For example, DecodeStream.new(...) can return %DecodeStream{} and it would implement Enumerable API, where the step NIF is called internally and we don't need to clone multiple times, since intermediate resources are not exposed to the user.
Thoughts?
There was a problem hiding this comment.
Gah, the ids need to be an input, so it's not so much of Enumerable, but more like Stream.transform on top of a stream of ids. But I can see such API being cumbersome where the user doesn't want to block :<
There was a problem hiding this comment.
@Virviil Actually, does ds.step mutate the tokenizer itself? Maybe there's no need for that particular clone in the first place.
|
@jonatanklosko It was effectively possible to remove cloning. |
jonatanklosko
left a comment
There was a problem hiding this comment.
It was effectively possible to remove cloning.
@Virviil great!
Bump a version of tokenizers
Bring new functionality. Mostly port from https://github.com/huggingface/tokenizers/pull/1678/files - here.