Conversation
940ab5a to
610ca05
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
65b6950 to
a09f44b
Compare
| } | ||
|
|
||
| group.finish(); | ||
| } |
There was a problem hiding this comment.
bench a full encode_batch please
a09f44b to
37b2036
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
---- pre_tokenizers::whitespace::tests::assert_equivalent_xnli stdout ----
thread 'pre_tokenizers::whitespace::tests::assert_equivalent_xnli' (40277) panicked at src/pre_tokenizers/whitespace.rs:270:61:
called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
failures:
pre_tokenizers::whitespace::tests::assert_equivalent_xnli
test result: FAILED. 199 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.08s
Probably because I refactor big text and etc
e478098 to
298001b
Compare
|
Opened https://huggingface.co/datasets/hf-internal-testing/tokenizers-test-data/discussions/2 to have the |
93685f4 to
60165a1
Compare
Building on the work in #1822 and #1841, I've implemented and tested a faster whitespace split pretokenizer. Tested on xnli, no regressions observed, results are coherent with the regex split.
I've setup a benchmark to validate this and created a correctness test as well. We can probably refactor or remove these bits as they depened on a hardcoded path to a
data/xnli.txtfile, but leaving them there if someone wants to reproduce the results.regexis the current impl andmanualis the new implementation. It works by simply iterating on individual chars and handles unicode properly.Closes #1821, #1825