Skip to content

Commit ec922b4

Browse files
authored
Merge pull request #171 from 123malin/readme
word2vec readme
2 parents c9c2a69 + 34eddbc commit ec922b4

File tree

12 files changed

+612
-29
lines changed

12 files changed

+612
-29
lines changed

core/trainers/framework/dataset.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ def get_dataloader(self, context, dataset_name, dataloader):
6868
reader_ins = SlotReader(context["config_yaml"])
6969
if hasattr(reader_ins, 'generate_batch_from_trainfiles'):
7070
dataloader.set_sample_list_generator(reader)
71+
elif hasattr(reader_ins, 'batch_tensor_creator'):
72+
dataloader.set_batch_generator(reader)
7173
else:
7274
dataloader.set_sample_generator(reader, batch_size)
7375
return dataloader

core/utils/dataloader_instance.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ def gen_batch_reader():
8383

8484
if hasattr(reader, 'generate_batch_from_trainfiles'):
8585
return gen_batch_reader()
86+
87+
if hasattr(reader, "batch_tensor_creator"):
88+
return reader.batch_tensor_creator(gen_reader)
89+
8690
return gen_reader
8791

8892

doc/imgs/w2v_train.png

223 KB
Loading

models/recall/word2vec/README.md

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# Skip-Gram W2V
2+
3+
以下是本例的简要目录结构及说明:
4+
5+
```
6+
├── data #样例数据
7+
├── train
8+
├── convert_sample.txt
9+
├── test
10+
├── sample.txt
11+
├── dict
12+
├── word_count_dict.txt
13+
├── word_id_dict.txt
14+
├── preprocess.py # 数据预处理文件
15+
├── __init__.py
16+
├── README.md # 文档
17+
├── model.py #模型文件
18+
├── config.yaml #配置文件
19+
├── data_prepare.sh #一键数据处理脚本
20+
├── w2v_reader.py #训练数据reader
21+
├── w2v_evaluate_reader.py # 预测数据reader
22+
├── infer.py # 自定义预测脚本
23+
├── utils.py # 自定义预测中用到的reader等工具
24+
```
25+
26+
注:在阅读该示例前,建议您先了解以下内容:
27+
28+
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
29+
30+
31+
---
32+
## 内容
33+
34+
- [模型简介](#模型简介)
35+
- [数据准备](#数据准备)
36+
- [运行环境](#运行环境)
37+
- [快速开始](#快速开始)
38+
- [论文复现](#论文复现)
39+
- [进阶使用](#进阶使用)
40+
- [FAQ](#FAQ)
41+
42+
## 模型简介
43+
本例实现了skip-gram模式的word2vector模型,如下图所示:
44+
<p align="center">
45+
<img align="center" src="../../../doc/imgs/word2vec.png">
46+
<p>
47+
以每一个词为中心词X,然后在窗口内和临近的词Y组成样本对(X,Y)用于网络训练。在实际训练过程中还会根据自定义的负采样率生成负样本来加强训练的效果
48+
具体的训练思路如下:
49+
<p align="center">
50+
<img align="center" src="../../../doc/imgs/w2v_train.png">
51+
<p>
52+
53+
推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/124377)教程获取更详细的信息。
54+
55+
本模型配置默认使用demo数据集,若进行精度验证,请参考[论文复现](#论文复现)部分。
56+
57+
本项目支持功能
58+
59+
训练:单机CPU、本地模拟参数服务器训练、增量训练,配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)
60+
61+
预测:单机CPU;配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md)
62+
63+
## 数据处理
64+
为和样例数据路径区分,全量训练数据、测试数据、词表文件会依次保存在data/all_train, data/all_test, data/all_dict文件夹中。
65+
```
66+
mkdir -p data/all_dict
67+
mkdir -p data/all_train
68+
mkdir -p data/all_test
69+
```
70+
本示例中全量数据处理共包含三步:
71+
- Step1: 数据下载。
72+
```
73+
# 全量训练集
74+
mkdir raw_data
75+
wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar
76+
tar xvf 1-billion-word-language-modeling-benchmark-r13output.tar
77+
mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ raw_data/
78+
79+
# 测试集
80+
wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
81+
tar xzvf test_dir.tar -C raw_data
82+
mv raw_data/data/test_dir/* data/all_test/
83+
```
84+
85+
- Step2: 训练据预处理。包含三步,第一步,根据英文语料生成词典,中文语料可以通过修改text_strip方法自定义处理方法。
86+
```
87+
python preprocess.py --build_dict --build_dict_corpus_dir raw_data/training-monolingual.tokenized.shuffled --dict_path raw_data/word_count_dict.txt
88+
```
89+
得到的词典格式为词<空格>词频,低频词用'UNK'表示,如下所示:
90+
```
91+
the 1061396
92+
of 593677
93+
and 416629
94+
one 411764
95+
in 372201
96+
a 325873
97+
<UNK> 324608
98+
to 316376
99+
zero 264975
100+
nine 250430
101+
```
102+
第二步,根据词典将文本转成id, 同时进行downsample,按照概率过滤常见词, 同时生成word和id映射的文件,文件名为词典+"word_to_id"。
103+
```
104+
python preprocess.py --filter_corpus --dict_path raw_data/word_count_dict.txt --input_corpus_dir raw_data/training-monolingual.tokenized.shuffled --output_corpus_dir raw_data/convert_text8 --min_count 5 --downsample 0.001
105+
```
106+
第三步,为更好地利用多线程进行训练加速,我们需要将训练文件分成多个子文件,默认拆分成1024个文件。
107+
```
108+
python preprocess.py --data_resplit --input_corpus_dir=raw_data/convert_text8 --output_corpus_dir=data/all_train
109+
```
110+
- Step3: 路径整理。
111+
```
112+
mv raw_data/word_count_dict.txt data/all_dict/
113+
mv raw_data/word_count_dict.txt_word_to_id_ data/all_dict/word_id_dict.txt
114+
rm -rf raw_data
115+
```
116+
方便起见, 我们提供了一键式数据处理脚本:
117+
```
118+
sh data_prepare.sh
119+
```
120+
121+
## 运行环境
122+
123+
PaddlePaddle>=1.7.2
124+
125+
python 2.7/3.5/3.6/3.7
126+
127+
PaddleRec >=0.1
128+
129+
os : windows/linux/macos
130+
131+
## 快速开始
132+
133+
### 单机训练
134+
135+
CPU环境
136+
137+
在config.yaml文件中设置好设备,epochs等。
138+
139+
```
140+
# select runner by name
141+
mode: [single_cpu_train, single_cpu_infer]
142+
# config of each runner.
143+
# runner is a kind of paddle training class, which wraps the train/infer process.
144+
runner:
145+
- name: single_cpu_train
146+
class: train
147+
# num of epochs
148+
epochs: 5
149+
# device to run training or infer
150+
device: cpu
151+
save_checkpoint_interval: 1 # save model interval of epochs
152+
save_inference_interval: 1 # save inference
153+
save_checkpoint_path: "increment_w2v" # save checkpoint path
154+
save_inference_path: "inference_w2v" # save inference path
155+
save_inference_feed_varnames: [] # feed vars of save inference
156+
save_inference_fetch_varnames: [] # fetch vars of save inference
157+
init_model_path: "" # load model path
158+
print_interval: 1
159+
phases: [phase1]
160+
```
161+
### 单机预测
162+
我们通过词类比(Word Analogy)任务来检验word2vec模型的训练效果。输入四个词A,B,C,D,假设存在一种关系relation, 使得relation(A, B) = relation(C, D),然后通过A,B,C去预测D,emb(D) = emb(B) - emb(A) + emb(C)。
163+
164+
CPU环境
165+
166+
PaddleRec预测配置:
167+
168+
在config.yaml文件中设置好epochs、device等参数。
169+
170+
```
171+
- name: single_cpu_infer
172+
class: infer
173+
# device to run training or infer
174+
device: cpu
175+
init_model_path: "increment_w2v" # load model path
176+
print_interval: 1
177+
phases: [phase2]
178+
```
179+
180+
为复现论文效果,我们提供了一个自定义预测脚本,在自定义预测中,我们会跳过预测结果是输入A,B,C的情况,然后计算预测准确率。执行命令如下:
181+
```
182+
python infer.py --test_dir ./data/test --dict_path ./data/dict/word_id_dict.txt --batch_size 20000 --model_dir ./increment_w2v/ --start_index 0 --last_index 5 --emb_size 300
183+
```
184+
185+
### 运行
186+
```
187+
python -m paddlerec.run -m paddlerec.models.recall.word2vec
188+
```
189+
190+
### 结果展示
191+
192+
样例数据训练结果展示:
193+
194+
```
195+
Running SingleStartup.
196+
Running SingleRunner.
197+
W0813 11:36:16.129736 43843 build_strategy.cc:170] fusion_group is not enabled for Windows/MacOS now, and only effective when running with CUDA GPU.
198+
batch: 1, LOSS: [3.618 3.684 3.698 3.653 3.736]
199+
batch: 2, LOSS: [3.394 3.453 3.605 3.487 3.553]
200+
batch: 3, LOSS: [3.411 3.402 3.444 3.387 3.357]
201+
batch: 4, LOSS: [3.557 3.196 3.304 3.209 3.299]
202+
batch: 5, LOSS: [3.217 3.141 3.168 3.114 3.315]
203+
batch: 6, LOSS: [3.342 3.219 3.124 3.207 3.282]
204+
batch: 7, LOSS: [3.19 3.207 3.136 3.322 3.164]
205+
epoch 0 done, use time: 0.119026899338, global metrics: LOSS=[3.19 3.207 3.136 3.322 3.164]
206+
...
207+
epoch 4 done, use time: 0.097608089447, global metrics: LOSS=[2.734 2.66 2.763 2.804 2.809]
208+
```
209+
样例数据预测结果展示:
210+
```
211+
Running SingleInferStartup.
212+
Running SingleInferRunner.
213+
load persistables from increment_w2v/4
214+
batch: 1, acc: [1.]
215+
batch: 2, acc: [1.]
216+
batch: 3, acc: [1.]
217+
Infer phase2 of epoch 4 done, use time: 4.89376211166, global metrics: acc=[1.]
218+
...
219+
Infer phase2 of epoch 3 done, use time: 4.43099021912, global metrics: acc=[1.]
220+
```
221+
222+
## 论文复现
223+
224+
1. 用原论文的完整数据复现论文效果需要在config.yaml修改超参:
225+
- name: dataset_train
226+
batch_size: 100 # 1. 修改batch_size为100
227+
type: DataLoader
228+
data_path: "{workspace}/data/all_train" # 2. 修改数据为全量训练数据
229+
word_count_dict_path: "{workspace}/data/all_dict/ word_count_dict.txt" # 3. 修改词表为全量词表
230+
data_converter: "{workspace}/w2v_reader.py"
231+
232+
- name: single_cpu_train
233+
- epochs: # 4. 修改config.yaml中runner的epochs为5。
234+
235+
修改后运行方案:修改config.yaml中的'workspace'为config.yaml的目录位置,执行
236+
```
237+
python -m paddlerec.run -m /home/your/dir/config.yaml #调试模式 直接指定本地config的绝对路径
238+
```
239+
240+
2. 使用自定义预测程序预测全量测试集:
241+
```
242+
python infer.py --test_dir ./data/all_test --dict_path ./data/all_dict/word_id_dict.txt --batch_size 20000 --model_dir ./increment_w2v/ --start_index 0 --last_index 5 --emb_size 300
243+
```
244+
245+
结论:使用cpu训练5轮,自定义预测准确率为0.540,每轮训练时间7小时左右。
246+
## 进阶使用
247+
248+
## FAQ

models/recall/word2vec/config.yaml

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ dataset:
2222
word_count_dict_path: "{workspace}/data/dict/word_count_dict.txt"
2323
data_converter: "{workspace}/w2v_reader.py"
2424
- name: dataset_infer # name
25-
batch_size: 50
25+
batch_size: 2000
2626
type: DataLoader # or QueueDataset
2727
data_path: "{workspace}/data/test"
2828
word_id_dict_path: "{workspace}/data/dict/word_id_dict.txt"
@@ -42,38 +42,40 @@ hyper_parameters:
4242
window_size: 5
4343

4444
# select runner by name
45-
mode: train_runner
45+
mode: [single_cpu_train, single_cpu_infer]
4646
# config of each runner.
4747
# runner is a kind of paddle training class, which wraps the train/infer process.
4848
runner:
49-
- name: train_runner
49+
- name: single_cpu_train
5050
class: train
5151
# num of epochs
52-
epochs: 2
52+
epochs: 5
5353
# device to run training or infer
5454
device: cpu
5555
save_checkpoint_interval: 1 # save model interval of epochs
5656
save_inference_interval: 1 # save inference
57-
save_checkpoint_path: "increment" # save checkpoint path
58-
save_inference_path: "inference" # save inference path
57+
save_checkpoint_path: "increment_w2v" # save checkpoint path
58+
save_inference_path: "inference_w2v" # save inference path
5959
save_inference_feed_varnames: [] # feed vars of save inference
6060
save_inference_fetch_varnames: [] # fetch vars of save inference
6161
init_model_path: "" # load model path
62-
print_interval: 1
63-
- name: infer_runner
62+
print_interval: 1000
63+
phases: [phase1]
64+
- name: single_cpu_infer
6465
class: infer
6566
# device to run training or infer
6667
device: cpu
67-
init_model_path: "increment/0" # load model path
68+
init_model_path: "increment_w2v" # load model path
6869
print_interval: 1
70+
phases: [phase2]
6971

7072
# runner will run all the phase in each epoch
7173
phase:
7274
- name: phase1
7375
model: "{workspace}/model.py" # user-defined model
7476
dataset_name: dataset_train # select dataset by name
77+
thread_num: 5
78+
- name: phase2
79+
model: "{workspace}/model.py" # user-defined model
80+
dataset_name: dataset_infer # select dataset by name
7581
thread_num: 1
76-
# - name: phase2
77-
# model: "{workspace}/model.py" # user-defined model
78-
# dataset_name: dataset_infer # select dataset by name
79-
# thread_num: 1

models/recall/word2vec/data_prepare.sh

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,25 +14,26 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17+
mkdir -p data/all_dict
18+
mkdir -p data/all_train
19+
mkdir -p data/all_test
1720

1821
# download train_data
1922
mkdir raw_data
2023
wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar
2124
tar xvf 1-billion-word-language-modeling-benchmark-r13output.tar
2225
mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ raw_data/
2326

27+
# download test data
28+
wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
29+
tar xzvf test_dir.tar -C raw_data
30+
mv raw_data/data/test_dir/* data/all_test/
31+
2432
# preprocess data
2533
python preprocess.py --build_dict --build_dict_corpus_dir raw_data/training-monolingual.tokenized.shuffled --dict_path raw_data/word_count_dict.txt
2634
python preprocess.py --filter_corpus --dict_path raw_data/word_count_dict.txt --input_corpus_dir raw_data/training-monolingual.tokenized.shuffled --output_corpus_dir raw_data/convert_text8 --min_count 5 --downsample 0.001
27-
mv raw_data/word_count_dict.txt data/dict/
28-
mv raw_data/word_id_dict.txt data/dict/
35+
python preprocess.py --data_resplit --input_corpus_dir=raw_data/convert_text8 --output_corpus_dir=data/all_train
2936

30-
rm -rf data/train/*
31-
rm -rf data/test/*
32-
python preprocess.py --data_resplit --input_corpus_dir=raw_data/convert_text8 --output_corpus_dir=data/train
33-
34-
# download test data
35-
wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
36-
tar xzvf test_dir.tar -C raw_data
37-
mv raw_data/data/test_dir/* data/test/
37+
mv raw_data/word_count_dict.txt data/all_dict/
38+
mv raw_data/word_count_dict.txt_word_to_id_ data/all_dict/word_id_dict.txt
3839
rm -rf raw_data

0 commit comments

Comments
 (0)