Skip to content

Commit 87fa30b

Browse files
authored
Add the tokeniers version in fastertokenizer (#1461)
1 parent eb5385b commit 87fa30b

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

examples/faster/faster_tokenizer/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
# 飞桨FasterTokenizer性能测试
22

33
在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器,简称飞桨FasterTokenizer。为了验证飞桨FasterTokenizer的性能快的特点,PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较,主要进行性能参考的是HuggingFace BertTokenizer, Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比,在中文的数据下进行性能对比实验,下面是具体实验设置信息:
4-
* [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers):
4+
* [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers):
55

66
```python
77
from transformers import AutoTokenizer
88

99
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=False)
1010
```
1111

12-
* [HuggingFace Tokenizers(Rust)](https://github.com/huggingface/tokenizers):
12+
* [HuggingFace Tokenizers(Rust)](https://github.com/huggingface/tokenizers):
1313

1414
```python
1515
from transformers import AutoTokenizer
@@ -45,6 +45,8 @@ faster_tokenizer = FasterTokenizer.from_pretrained("bert-base-chinese")
4545

4646
* transformers == 4.11.3
4747

48+
* tokenizers == 0.10.3
49+
4850
* tensorflow_text == 2.5.0
4951

5052

@@ -69,5 +71,3 @@ python perf.py
6971
<center><img width="1343" alt="图片" src="https://user-images.githubusercontent.com/16698950/145664356-0b766d5a-9ff1-455a-bb85-1ee51e2ad77d.png"></center>
7072

7173
飞桨FasterTokenizer与其他框架性能的对比,是在固定文本长度在不同batch size下的分词吞吐量。纵坐标是对数坐标,单位是1w tokens/秒。随着batch size的增大,飞桨FasterTokenizer速度会远远超过其他同类产品的实现,尤其是在大batch文本上飞桨框架能充分发挥多核机器的优势,取得领先的效果。
72-
73-

0 commit comments

Comments
 (0)