Skip to content

Commit 707a94d

Browse files
authored
[FastTokenizer] Update fast_tokenizer doc (#3787)
* update link * rename demo to examples * add python readme * update the cmakelist * add ernie python demo * add status print after compiling * Add README for ernie fast tokenizer * Add clip fast tokenizer cpp readme * Update docs * Add some tips * add deps * add deps * update * update * update shell
1 parent 0ce21ea commit 707a94d

File tree

18 files changed

+301
-113
lines changed

18 files changed

+301
-113
lines changed

fast_tokenizer/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
# ⚡ FastTokenizer:高性能文本处理库
23

34
------------------------------------------------------------------------------------------
@@ -19,12 +20,12 @@ FastTokenizer是一款简单易用、功能强大的跨平台高性能文本预
1920

2021
- 高性能。由于底层采用C++实现,所以其性能远高于目前常规Python实现的Tokenizer。在文本分类任务上,FastTokenizer对比Python版本Tokenizer加速比最高可达20倍。支持多线程加速多文本批处理分词。默认使用单线程分词。
2122
- 跨平台。FastTokenizer可在不同的系统平台上使用,目前已支持Windows x64,Linux x64以及MacOS 10.14+平台上使用。
22-
- 多编程语言支持。FastTokenizer提供在C++、Python语言上开发的能力
23+
- 多编程语言支持。FastTokenizer提供在[C++](./docs/cpp/README.md)[Python](./docs/python/README.md)语言上开发的能力
2324
- 灵活性强。用户可以通过指定不同的FastTokenizer组件定制满足需求的Tokenizer。
2425

2526
## 快速开始
2627

27-
下面将介绍Python版本FastTokenizer的使用方式,C++版本的使用方式可参考[FastTokenizer C++ Demo](./fast_tokenizer/demo/README.md)
28+
下面将介绍Python版本FastTokenizer的使用方式,C++版本的使用方式可参考[FastTokenizer C++ 库使用教程](./docs/cpp/README.md)
2829

2930
### 环境依赖
3031

@@ -128,3 +129,7 @@ A:可以通过调用 `fast_tokenizer.set_thread_num(xxx)` 使用多线程进
128129
## 相关文档
129130

130131
[FastTokenizer编译指南](docs/compile/README.md)
132+
133+
[FastTokenizer C++ 库使用教程](./docs/cpp/README.md)
134+
135+
[FastTokenizer Python 库使用教程](./docs/python/README.md)

fast_tokenizer/docs/cpp/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# FastTokenizer C++ 库使用教程
2+
3+
## 1. 快速安装
4+
5+
当前版本FastTokenizer C++库支持不同的操作系统以及硬件平台,并为以下平台提供预编译包:
6+
|系统|下载地址|
7+
|---|---|
8+
|Linux-x64| [fast_tokenizer-linux-x64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz) |
9+
|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.0.tgz) |
10+
|Windows| [fast_tokenizer-win-x64-1.0.0.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.0.zip) |
11+
|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.0.tgz) |
12+
|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.0.tgz) |
13+
14+
### 环境依赖
15+
16+
#### 系统环境要求
17+
|系统|版本|
18+
|---|---|
19+
|Linux|Ubuntu 16.04+,CentOS 7+|
20+
|Windows|10|
21+
|MacOS| 11.4+|
22+
23+
24+
#### Linux,Mac编译环境要求
25+
|依赖|版本|
26+
|---|---|
27+
|cmake|>=16.0|
28+
|gcc|>=8.2.0|
29+
30+
#### Windows编译环境要求
31+
|依赖|版本|
32+
|---|---|
33+
|cmake|>=16.0|
34+
|VisualStudio|2019|
35+
36+
### 下载解压
37+
38+
```shell
39+
wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz
40+
41+
tar xvfz fast_tokenizer-linux-x64-1.0.0.tgz
42+
# 解压后为fast_tokenizer目录
43+
```
44+
45+
解压后得到fast_tokenizer目录,该目录的结构如下:
46+
47+
```shell
48+
49+
fast_tokenizer
50+
|__ commit.log # 编译时的commit id
51+
|__ FastTokenizer.cmake # FastTokenizer CMake文件,定义了头文件目录、动态链接库目录变量
52+
|__ include # FastTokenizer的头文件目录
53+
|__ lib # FastTokenizer的动态链接库目录
54+
|__ third_party # FastTokenizer依赖的第三方库目录
55+
56+
```
57+
58+
推荐用户直接使用cmake方式引入FastTokenizer库。在CMake引入FastTokenizer时,只需添加一行 `include(FastTokenizer.cmake)`,即可获取FastTokenizer的预定义的CMake变量`FAST_TOKENIZER_INCS``FAST_TOKENIZER_LIBS`,分别指定FastTokenizer的头文件目录以及动态链接库目录。
59+
60+
61+
## 2. 快速开始
62+
63+
目前FastTokenizer提供了以下C++使用示例。
64+
65+
[ErnieFastTokenizer C++示例](../../examples/ernie/)
66+
[ClipFastTokenizer C++示例](../../examples/clip/)
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# FastTokenizer Python库 使用教程
2+
3+
## 1. 快速安装
4+
5+
### 环境依赖
6+
7+
- Windows 64位系统
8+
- Linux x64系统
9+
- MacOS 10.14+系统(m1芯片的MacOS,需要使用x86_64版本的Anaconda作为python环境方可安装使用)
10+
- Python 3.6 ~ 3.10
11+
12+
### 安装
13+
14+
```shell
15+
pip install --upgrade fast_tokenizer
16+
```

fast_tokenizer/examples/clip/README.md

Whitespace-only changes.
Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,7 @@
11
cmake_minimum_required(VERSION 3.10)
22
project(cpp_fast_tokenizer_demo CXX C)
3-
43
option(FAST_TOKENIZER_INSTALL_DIR "Path of downloaded fast_tokenizer sdk.")
54

6-
# Download ernie vocab for demo
7-
set(ERNIE_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/ernie_vocab.txt)
8-
if (EXISTS ${ERNIE_VOCAB_PATH})
9-
message(STATUS "The ${ERNIE_VOCAB_PATH} exists already.")
10-
else()
11-
file(DOWNLOAD "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt" ${ERNIE_VOCAB_PATH} SHOW_PROGRESS)
12-
message(STATUS "Already download the vocab.txt of ernie to ${CMAKE_CURRENT_BINARY_DIR} for demo.")
13-
endif()
14-
155
# Download clip vocab and merge files
166
set(CLIP_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_vocab.json)
177
set(CLIP_MERGES_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_merges.txt)
@@ -32,10 +22,7 @@ endif()
3222

3323
# Get FAST_TOKENIZER_INCS and FAST_TOKENIZER_LIBS
3424
include(${FAST_TOKENIZER_INSTALL_DIR}/FastTokenizer.cmake)
35-
3625
include_directories(${FAST_TOKENIZER_INCS})
3726

38-
add_executable(ernie_fast_tokenizer_demo ${PROJECT_SOURCE_DIR}/ernie_fast_tokenizer_demo.cc)
39-
add_executable(clip_fast_tokenizer_demo ${PROJECT_SOURCE_DIR}/clip_fast_tokenizer_demo.cc)
40-
target_link_libraries(ernie_fast_tokenizer_demo ${FAST_TOKENIZER_LIBS})
41-
target_link_libraries(clip_fast_tokenizer_demo ${FAST_TOKENIZER_LIBS})
27+
add_executable(demo ${PROJECT_SOURCE_DIR}/demo.cc)
28+
target_link_libraries(demo ${FAST_TOKENIZER_LIBS})
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# ClipFastTokenizer C++ 示例
2+
3+
## 1. 快速安装
4+
5+
当前版本FastTokenizer C++库支持不同的操作系统以及硬件平台,用户可以根据实际的使用环境,从以下选择合适的预编译包:
6+
|系统|下载地址|
7+
|---|---|
8+
|Linux-x64| [fast_tokenizer-linux-x64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz) |
9+
|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.0.tgz) |
10+
|Windows| [fast_tokenizer-win-x64-1.0.0.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.0.zip) |
11+
|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.0.tgz) |
12+
|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.0.tgz) |
13+
14+
### 环境依赖
15+
16+
#### 系统环境要求
17+
|系统|版本|
18+
|---|---|
19+
|Linux|Ubuntu 16.04+,CentOS 7+|
20+
|Windows|10|
21+
|MacOS| 11.4+|
22+
23+
24+
#### Linux,Mac编译环境要求
25+
|依赖|版本|
26+
|---|---|
27+
|cmake|>=16.0|
28+
|gcc|>=8.2.0|
29+
30+
#### Windows编译环境要求
31+
|依赖|版本|
32+
|---|---|
33+
|cmake|>=16.0|
34+
|VisualStudio|2019|
35+
36+
## 2. 快速开始
37+
38+
以下以Linux平台为例, 介绍如何使用FastTokenizer C++预编译包完成demo示例编译及运行。该示例会生成一个名为`demo`的可执行文件。
39+
40+
### 2.1 下载解压
41+
42+
```shell
43+
wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz
44+
45+
tar xvfz fast_tokenizer-linux-x64-1.0.0.tgz
46+
# 解压后为fast_tokenizer目录
47+
```
48+
49+
解压后得到fast_tokenizer目录,该目录的结构如下:
50+
51+
```shell
52+
53+
fast_tokenizer
54+
|__ commit.log # 编译时的commit id
55+
|__ FastTokenizer.cmake # FastTokenizer CMake文件,定义了头文件目录、动态链接库目录变量
56+
|__ include # FastTokenizer的头文件目录
57+
|__ lib # FastTokenizer的动态链接库目录
58+
|__ third_party # FastTokenizer依赖的第三方库目录
59+
60+
```
61+
62+
推荐用户直接使用cmake方式引入FastTokenizer库。在CMake引入FastTokenizer时,只需添加一行 `include(FastTokenizer.cmake)`,即可获取FastTokenizer的预定义的CMake变量`FAST_TOKENIZER_INCS``FAST_TOKENIZER_LIBS`,分别指定FastTokenizer的头文件目录以及动态链接库目录。
63+
64+
65+
### 2.2 编译
66+
67+
示例提供简单的CMakeLists.txt, 用户仅需指定fast_tokenizer包的路径,即可完成编译。
68+
69+
```shell
70+
71+
# 创建编译目录
72+
mkdir build
73+
cd build
74+
75+
# 运行cmake,通过指定fast_tokenizer包的路径,构建Makefile
76+
cmake .. -DFAST_TOKENIZER_INSTALL_DIR=/path/to/fast_tokenizer
77+
78+
# 编译
79+
make
80+
81+
```
82+
83+
### 2.3 运行
84+
85+
```shell
86+
./demo
87+
```
88+
89+
90+
### 2.4 样例输出
91+
92+
输出包含原始文本的输入,以及分词后的ids序列结果(含padding)。
93+
94+
```shell
95+
96+
text = "a photo of an astronaut riding a horse on mars"
97+
ids = [49406, 320, 1125, 539, 550, 18376, 6765, 320, 4558, 525, 7496, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]
98+
99+
```

fast_tokenizer/fast_tokenizer/demo/clip_fast_tokenizer_demo.cc renamed to fast_tokenizer/examples/clip/cpp/demo.cc

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
See the License for the specific language governing permissions and
1313
limitations under the License. */
1414

15-
#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h"
1615
#include <iostream>
1716
#include <vector>
17+
#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h"
1818
using namespace paddlenlp;
1919

2020
template <typename T>
@@ -35,10 +35,10 @@ fast_tokenizer::tokenizers_impl::ClipFastTokenizer CreateClipFastTokenizer(
3535
const std::string& vocab_path,
3636
const std::string& merge_path,
3737
uint32_t max_length,
38-
bool pad = true) {
38+
bool pad_to_max_length = true) {
3939
fast_tokenizer::tokenizers_impl::ClipFastTokenizer tokenizer(
4040
vocab_path, merge_path, max_length);
41-
if (pad) {
41+
if (pad_to_max_length) {
4242
tokenizer.EnablePadMethod(fast_tokenizer::core::RIGHT,
4343
tokenizer.GetPadTokenId(),
4444
0,
@@ -51,16 +51,20 @@ fast_tokenizer::tokenizers_impl::ClipFastTokenizer CreateClipFastTokenizer(
5151

5252
int main() {
5353
// 1. Define a clip fast tokenizer
54-
auto tokenizer =
55-
CreateClipFastTokenizer("clip_vocab.json", "clip_merges.txt", 77, true);
54+
auto tokenizer = CreateClipFastTokenizer("clip_vocab.json",
55+
"clip_merges.txt",
56+
/*max_length = */ 77,
57+
/* pad_to_max_length = */ true);
5658
// 2. Tokenize the input strings
5759
std::vector<fast_tokenizer::core::Encoding> encodings;
5860
std::vector<std::string> texts = {
5961
"a photo of an astronaut riding a horse on mars"};
6062
tokenizer.EncodeBatchStrings(texts, &encodings);
61-
for (auto&& encoding : encodings) {
62-
auto ids = encoding.GetIds();
63-
std::cout << ids << std::endl;
63+
64+
for (int i = 0; i < texts.size(); ++i) {
65+
std::cout << "text = \"" << texts[i] << "\"" << std::endl;
66+
std::cout << "ids = " << encodings[i].GetIds() << std::endl;
6467
}
68+
6569
return 0;
66-
}
70+
}

fast_tokenizer/examples/clip/python/README.md

Whitespace-only changes.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# ErnieFastTokenizer分词示例
2+
3+
FastTokenizer库在C++、Python端提供ErnieFastTokenizer接口,用户只需传入模型相应的词表即可调用该接口,完成高效分词操作。该接口底层使用`WordPiece`算法进行分词。针对`WordPiece`算法,FastTokenizer实现了"Fast WordPiece Tokenization"提出的基于`MinMaxMatch``FastWordPiece`算法。原有`WordPiece`算法的时间复杂度与序列长度为二次方关系,在对长文本进行分词操作时,时间开销比较大。而`FastWordPiece`算法通过`Aho–Corasick `算法将`WordPiece`算法的时间复杂度降低为与序列长度的线性关系,大大提升了分词效率。`ErnieFastTokenizer`除了支持ERNIE模型的分词以外,还支持其他基于`WordPiece`算法分词的模型,比如`BERT`, `TinyBert`等,详细的模型列表如下:
4+
5+
## 支持的模型列表
6+
7+
- ERNIE
8+
- BERT
9+
- TinyBERT
10+
- ERNIE Gram
11+
- ERNIE ViL
12+
13+
## 详细分词示例文档
14+
15+
[C++ 分词示例](./cpp)
16+
[Python 分词示例](./python)
17+
18+
## 参考文献
19+
20+
- Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021

0 commit comments

Comments
 (0)