|
6 | 6 | * [快速开始](#quick-start) |
7 | 7 | * [如何设计 NLP 模型](#design-model) |
8 | 8 | * [定义模型配置文件](#define-conf) |
| 9 | + * [中文支持](#chinese-support) |
9 | 10 | * [模型可视化](#visualize) |
10 | 11 | * [NLP 任务 Model Zoo](#model-zoo) |
11 | 12 | * [任务 1: 文本分类](#task-1) |
|
18 | 19 | 2. [文本匹配的模型压缩](#task-6.2) |
19 | 20 | 3. [槽填充的模型压缩](#task-6.3) |
20 | 21 | 4. [机器阅读理解模型的模型压缩](#task-6.4) |
| 22 | + * [任务 7: 中文情感分析](#task-7) |
21 | 23 | * [高阶用法](#advanced-usage) |
22 | 24 | * [额外的feature](#extra-feature) |
23 | 25 | * [学习率衰减](#lr-decay) |
@@ -70,6 +72,8 @@ python predict.py --conf_path=model_zoo/demo/conf.json |
70 | 72 | 以 *[PROJECTROOT/model_zoo/demo/conf.json](./model_zoo/demo/conf.json)* 为例 (便于说明工具包的用法,这个展示用例的网络结构并不是一个实际的结构)。 这个配置文件定义的任务是问答对匹配问题, 也就是判断一个答案是否可以回答对应的问题。 相关的样例数据保存在 *[PROJECTROOT/dataset/demo/](./dataset/demo/)*. |
71 | 73 |
|
72 | 74 | 配置文件的架构如下: |
| 75 | + |
| 76 | +- **language**. [optional, default: English] Firstly define language type here, we support English and Chinese now. |
73 | 77 | - **inputs**. This part defines the input configuration. |
74 | 78 | - ***use_cache***. If *use_cache* is true, the toolkit would make cache at the first time so that we can accelerate the training process at the next time. |
75 | 79 | - ***dataset_type***. Declare the task type here. Currently, we support classification, regression and so on. |
@@ -135,6 +139,7 @@ python predict.py --conf_path=model_zoo/demo/conf.json |
135 | 139 | - ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches. |
136 | 140 | - ***max_epoch***. [necessary for training] The maximum number of epochs to train. |
137 | 141 | - ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple times in case to miss the best state of our model. The default value is 1. |
| 142 | + - ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese. |
138 | 143 | - **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks). |
139 | 144 |
|
140 | 145 | - ***Embedding layer***. The first layer of this example (as shown below) defines the embedding layer, which is composed of one type of embedding: "word" (word embedding) and the dimension of "word" are 300. You need to keep this dimension and the dimension of pre-trained embeddings consistent if you specify pre-trained embeddings in *inputs/data_paths/pre_trained_emb*. |
@@ -184,9 +189,13 @@ python predict.py --conf_path=model_zoo/demo/conf.json |
184 | 189 |
|
185 | 190 | *Tips: The [optional] and [necessary] mark means corresponding node in the configuration file is optional or necessary for training/test/prediction. If there is no mark, it means the node is necessary all the time. Actually, it would be more convenient to prepare a configuration file that contains all the configurations for training, test and prediction.* |
186 | 191 |
|
| 192 | +### <span id="chinese-support">中文支持</span> |
| 193 | +
|
| 194 | +在使用中文数据时,JSON配置里的*language*应被设置为'Chinese'。中文默认使用jieba分词。中文任务示例参见[任务 7: 中文情感分析](#task-7)。 |
187 | 195 |
|
| 196 | +另外,我们也支持中文预处理词向量。首先从[Chinese Word Vectors](https://github.com/Embedding/Chinese-Word-Vectors#pre-trained-chinese-word-vectors)下载中文词向量并解压,然后将其放置在某一文件夹下(例如 *dataset/chinese_word_vectors/* ),最后在JSON配置里定义 *inputs/data_paths/pre_trained_emb* 。 |
188 | 197 |
|
189 | | -## <span id="visualize">模型可视化</span> |
| 198 | +### <span id="visualize">模型可视化</span> |
190 | 199 |
|
191 | 200 | 本项目提供了一个模型可视化工具,用于模型的可视化和模型配置文件的语法正确性检查。请参考 [Model Visualizer README](./model_visualizer/README.md). |
192 | 201 |
|
@@ -492,6 +501,29 @@ This task is to train a query-passage regression model to learn from a heavy tea |
492 | 501 | #### <span id="task-6.3">6.3: 槽填充的模型压缩 (ongoing)</span> |
493 | 502 | #### <span id="task-6.4">6.4: 机器阅读理解模型的模型压缩 (ongoing)</span> |
494 | 503 |
|
| 504 | +### <span id="task-7">任务 7: 中文情感分析</span> |
| 505 | +
|
| 506 | +这里给出一个中文情感分析的示例。 |
| 507 | +
|
| 508 | +- ***数据集*** |
| 509 | +
|
| 510 | + *PROJECT_ROOT/dataset/chinese_sentiment_analysis* 是中文情感分析的样例数据。 |
| 511 | +
|
| 512 | +- ***用法*** |
| 513 | +
|
| 514 | + 1. 训练中文情感分析模型。 |
| 515 | + ```bash |
| 516 | + cd PROJECT_ROOT |
| 517 | + python train.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json |
| 518 | + ``` |
| 519 | + 2. 测试模型。 |
| 520 | + ```bash |
| 521 | + cd PROJECT_ROOT |
| 522 | + python test.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json |
| 523 | + ``` |
| 524 | + *提示:您可以通过运行不同的JSON配置文件来尝试不同的模型。当训练完成后,模型文件和训练日志文件可以在JSON配置的outputs/save_base_dir目录中找到。* |
| 525 | +
|
| 526 | +
|
495 | 527 | ## <span id="advanced-usage">高阶用法</span> |
496 | 528 |
|
497 | 529 | After building a model, the next goal is to train a model with good performance. It depends on a highly expressive model and tricks of the model training. NeuronBlocks provides some tricks of model training. |
|
0 commit comments