Skip to content

Commit f6891b0

Browse files
woailaosangljshou
authored andcommitted
update Tutorial_zh_CN.md (#31)
* Update Tutorial.md * Update Tutorial.md * fix CPU train --> GPU test, GPU train --> CPU test. But CPU train --> multi-GPU test will report error. * remove ujson * remove glove for model_zoo/demo/conf.json * modify glove for Tutorial.md * add nltk.download('punkt') in problem.py * Create README_zh_CN.md * modify README * Update README_zh_CN.md * Update README.md * Update README.md * Update README_zh_CN.md * add requirements_new.txt * modify requirements_new.txt * fix CPU train --> multi GPUs test. * Add CPU/GPU table for README.md * Add CPU/GPU table for README_zh_CN.md * Update README.md * add Chinese support * add license for core/Stopwords.py * modify conf_chinese_sentiment_analysis_bilstm.json * add nltk.download('stopwords') * Update problem.py * Update README.md * Update README_zh_CN.md * add language supported check * modify dataset of chinese sentiment analysis * Update Tutorial.md * Update Tutorial.md * Update Tutorial.md * Update Tutorial_zh_CN.md
1 parent 7df5da1 commit f6891b0

File tree

1 file changed

+33
-1
lines changed

1 file changed

+33
-1
lines changed

Tutorial_zh_CN.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
* [快速开始](#quick-start)
77
* [如何设计 NLP 模型](#design-model)
88
* [定义模型配置文件](#define-conf)
9+
* [中文支持](#chinese-support)
910
* [模型可视化](#visualize)
1011
* [NLP 任务 Model Zoo](#model-zoo)
1112
* [任务 1: 文本分类](#task-1)
@@ -18,6 +19,7 @@
1819
2. [文本匹配的模型压缩](#task-6.2)
1920
3. [槽填充的模型压缩](#task-6.3)
2021
4. [机器阅读理解模型的模型压缩](#task-6.4)
22+
* [任务 7: 中文情感分析](#task-7)
2123
* [高阶用法](#advanced-usage)
2224
* [额外的feature](#extra-feature)
2325
* [学习率衰减](#lr-decay)
@@ -70,6 +72,8 @@ python predict.py --conf_path=model_zoo/demo/conf.json
7072
*[PROJECTROOT/model_zoo/demo/conf.json](./model_zoo/demo/conf.json)* 为例 (便于说明工具包的用法,这个展示用例的网络结构并不是一个实际的结构)。 这个配置文件定义的任务是问答对匹配问题, 也就是判断一个答案是否可以回答对应的问题。 相关的样例数据保存在 *[PROJECTROOT/dataset/demo/](./dataset/demo/)*.
7173

7274
配置文件的架构如下:
75+
76+
- **language**. [optional, default: English] Firstly define language type here, we support English and Chinese now.
7377
- **inputs**. This part defines the input configuration.
7478
- ***use_cache***. If *use_cache* is true, the toolkit would make cache at the first time so that we can accelerate the training process at the next time.
7579
- ***dataset_type***. Declare the task type here. Currently, we support classification, regression and so on.
@@ -135,6 +139,7 @@ python predict.py --conf_path=model_zoo/demo/conf.json
135139
- ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches.
136140
- ***max_epoch***. [necessary for training] The maximum number of epochs to train.
137141
- ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple times in case to miss the best state of our model. The default value is 1.
142+
- ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese.
138143
- **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks).
139144

140145
- ***Embedding layer***. The first layer of this example (as shown below) defines the embedding layer, which is composed of one type of embedding: "word" (word embedding) and the dimension of "word" are 300. You need to keep this dimension and the dimension of pre-trained embeddings consistent if you specify pre-trained embeddings in *inputs/data_paths/pre_trained_emb*.
@@ -184,9 +189,13 @@ python predict.py --conf_path=model_zoo/demo/conf.json
184189
185190
*Tips: The [optional] and [necessary] mark means corresponding node in the configuration file is optional or necessary for training/test/prediction. If there is no mark, it means the node is necessary all the time. Actually, it would be more convenient to prepare a configuration file that contains all the configurations for training, test and prediction.*
186191
192+
### <span id="chinese-support">中文支持</span>
193+
194+
在使用中文数据时,JSON配置里的*language*应被设置为'Chinese'。中文默认使用jieba分词。中文任务示例参见[任务 7: 中文情感分析](#task-7)。
187195
196+
另外,我们也支持中文预处理词向量。首先从[Chinese Word Vectors](https://github.com/Embedding/Chinese-Word-Vectors#pre-trained-chinese-word-vectors)下载中文词向量并解压,然后将其放置在某一文件夹下(例如 *dataset/chinese_word_vectors/* ),最后在JSON配置里定义 *inputs/data_paths/pre_trained_emb* 。
188197
189-
## <span id="visualize">模型可视化</span>
198+
### <span id="visualize">模型可视化</span>
190199
191200
本项目提供了一个模型可视化工具,用于模型的可视化和模型配置文件的语法正确性检查。请参考 [Model Visualizer README](./model_visualizer/README.md).
192201
@@ -492,6 +501,29 @@ This task is to train a query-passage regression model to learn from a heavy tea
492501
#### <span id="task-6.3">6.3: 槽填充的模型压缩 (ongoing)</span>
493502
#### <span id="task-6.4">6.4: 机器阅读理解模型的模型压缩 (ongoing)</span>
494503
504+
### <span id="task-7">任务 7: 中文情感分析</span>
505+
506+
这里给出一个中文情感分析的示例。
507+
508+
- ***数据集***
509+
510+
*PROJECT_ROOT/dataset/chinese_sentiment_analysis* 是中文情感分析的样例数据。
511+
512+
- ***用法***
513+
514+
1. 训练中文情感分析模型。
515+
```bash
516+
cd PROJECT_ROOT
517+
python train.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json
518+
```
519+
2. 测试模型。
520+
```bash
521+
cd PROJECT_ROOT
522+
python test.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json
523+
```
524+
*提示:您可以通过运行不同的JSON配置文件来尝试不同的模型。当训练完成后,模型文件和训练日志文件可以在JSON配置的outputs/save_base_dir目录中找到。*
525+
526+
495527
## <span id="advanced-usage">高阶用法</span>
496528
497529
After building a model, the next goal is to train a model with good performance. It depends on a highly expressive model and tricks of the model training. NeuronBlocks provides some tricks of model training.

0 commit comments

Comments
 (0)