HuggingFaceTokenizer took 90ms to process 10^5 length text #2846

Rfank2021 · 2023-11-10T06:13:04Z

Rfank2021
Nov 10, 2023

Feel it's slow for roberta-base model, maybe it's more for batch tokenization rather than single text.

frankfliu · 2023-11-10T06:16:19Z

frankfliu
Nov 10, 2023

@Rfank2021

What's your expectation? What are you comparing with? Do you have benchmark for both python and DJL's implementation?

2 replies

Rfank2021 Nov 10, 2023
Author

@frankfliu Hm, tested AutoTokenizer in python, it took 600ms..

In DJL, tokenizer took 100ms, and the roberta base model took 70ms, so I feel tokenizer took too much time.

Rfank2021 Nov 10, 2023
Author

I didn't trunc the message to 10^5 when testing in python, so it's actually 45ms, better than DJL's 100ms.

In my test, only one message is passed to tokenizer.

Rfank2021 · 2023-11-10T14:50:43Z

Rfank2021
Nov 10, 2023
Author

Don't know about the algorithm, but why not stop when already got 256 tokens?

2 replies

frankfliu Nov 10, 2023

You can control padding and truncate behavior, see: https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizer.java#L660-L694

Rfank2021 Nov 10, 2023
Author

@frankfliu, this is my code, I trunc the text to 10^5 before passing to tokenizer, otherwise it's too slow, but I feel trunc to 10^4 may be also OK?

private val tokenizer = HuggingFaceTokenizer.builder()
        .optTokenizerName("roberta-base")
        .optMaxLength(256)
        .optPadToMaxLength()
        .optTruncation(true)
        .build()

frankfliu · 2023-11-16T19:01:46Z

frankfliu
Nov 16, 2023

@Rfank2021

I'm able to reproduce your issue. Here is what I found:

HuggingFace tokenizer truncation is a post processing, it won't truncate the string before tokenization. I think you have to cut the string by yourself before use tokenizer.
DJL tokenizer implementation returns everything (token_type_ids, attention_mask, special_tokens_mask, offset_mask and overflowing_tokens, in short sequence input this is not a problem, but when you input a large text, this become a problem. I will see how I can provide a option not return overflowing_tokens.

0 replies

frankfliu · 2023-11-17T16:24:01Z

frankfliu
Nov 17, 2023

@Rfank2021

I created a PR to address this issue: #2857

1 reply

Rfank2021 Nov 30, 2023
Author

That's great!

HuggingFaceTokenizer took 90ms to process 10^5 length text #2846

Uh oh!

Uh oh!

Rfank2021 Nov 10, 2023

Replies: 4 comments · 5 replies

Uh oh!

frankfliu Nov 10, 2023

Uh oh!

Uh oh!

Rfank2021 Nov 10, 2023 Author

Uh oh!

Uh oh!

Rfank2021 Nov 10, 2023 Author

Uh oh!

Rfank2021 Nov 10, 2023 Author

Uh oh!

frankfliu Nov 10, 2023

Uh oh!

Uh oh!

Rfank2021 Nov 10, 2023 Author

Uh oh!

frankfliu Nov 16, 2023

Uh oh!

frankfliu Nov 17, 2023

Uh oh!

Rfank2021 Nov 30, 2023 Author

Rfank2021
Nov 10, 2023

Replies: 4 comments 5 replies

frankfliu
Nov 10, 2023

Rfank2021 Nov 10, 2023
Author

Rfank2021 Nov 10, 2023
Author

Rfank2021
Nov 10, 2023
Author

Rfank2021 Nov 10, 2023
Author

frankfliu
Nov 16, 2023

frankfliu
Nov 17, 2023

Rfank2021 Nov 30, 2023
Author