File tree Expand file tree Collapse file tree 1 file changed +4
-4
lines changed
examples/faster/faster_tokenizer Expand file tree Collapse file tree 1 file changed +4
-4
lines changed Original file line number Diff line number Diff line change 1
1
# 飞桨FasterTokenizer性能测试
2
2
3
3
在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器,简称飞桨FasterTokenizer。为了验证飞桨FasterTokenizer的性能快的特点,PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较,主要进行性能参考的是HuggingFace BertTokenizer, Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比,在中文的数据下进行性能对比实验,下面是具体实验设置信息:
4
- * [ HuggingFace Tokenizers(Python)] ( https://github.com/huggingface/tokenizers ) :
4
+ * [ HuggingFace Tokenizers(Python)] ( https://github.com/huggingface/tokenizers ) :
5
5
6
6
``` python
7
7
from transformers import AutoTokenizer
8
8
9
9
hf_tokenizer = AutoTokenizer.from_pretrained(" bert-base-chinese" , use_fast = False )
10
10
```
11
11
12
- * [ HuggingFace Tokenizers(Rust)] ( https://github.com/huggingface/tokenizers ) :
12
+ * [ HuggingFace Tokenizers(Rust)] ( https://github.com/huggingface/tokenizers ) :
13
13
14
14
``` python
15
15
from transformers import AutoTokenizer
@@ -45,6 +45,8 @@ faster_tokenizer = FasterTokenizer.from_pretrained("bert-base-chinese")
45
45
46
46
* transformers == 4.11.3
47
47
48
+ * tokenizers == 0.10.3
49
+
48
50
* tensorflow_text == 2.5.0
49
51
50
52
@@ -69,5 +71,3 @@ python perf.py
69
71
<center ><img width =" 1343 " alt =" 图片 " src =" https://user-images.githubusercontent.com/16698950/145664356-0b766d5a-9ff1-455a-bb85-1ee51e2ad77d.png " ></center >
70
72
71
73
飞桨FasterTokenizer与其他框架性能的对比,是在固定文本长度在不同batch size下的分词吞吐量。纵坐标是对数坐标,单位是1w tokens/秒。随着batch size的增大,飞桨FasterTokenizer速度会远远超过其他同类产品的实现,尤其是在大batch文本上飞桨框架能充分发挥多核机器的优势,取得领先的效果。
72
-
73
-
You can’t perform that action at this time.
0 commit comments