|
6 | 6 | * [Quick Start](#quick-start) |
7 | 7 | * [How to Design Your NLP Model](#design-model) |
8 | 8 | * [Define the Model Configuration File](#define-conf) |
| 9 | + * [Chinese Support](#chinese-support) |
9 | 10 | * [Visualize Your Model](#visualize) |
10 | 11 | * [Model Zoo for NLP Tasks](#model-zoo) |
11 | 12 | * [Task 1: Text Classification](#task-1) |
|
18 | 19 | 2. [Compression for Text Matching Model](#task-6.2) |
19 | 20 | 3. [Compression for Slot Filling Model](#task-6.3) |
20 | 21 | 4. [Compression for MRC Model](#task-6.4) |
| 22 | + * [Task 7: Chinese Sentiment Analysis](#task-7) |
21 | 23 | * [Advanced Usage](#advanced-usage) |
22 | 24 | * [Extra Feature Support](#extra-feature) |
23 | 25 | * [Learning Rate Decay](#lr-decay) |
@@ -80,6 +82,8 @@ Take *[PROJECTROOT/model_zoo/demo/conf.json](./model_zoo/demo/conf.json)* as an |
80 | 82 | The sample data lies in *[PROJECTROOT/dataset/demo/](./dataset/demo/)*. |
81 | 83 |
|
82 | 84 | The architecture of the configuration file is: |
| 85 | + |
| 86 | +- **language**. [optional, default: English] Firstly define language type here, we support English and Chinese now. |
83 | 87 | - **inputs**. This part defines the input configuration. |
84 | 88 | - ***use_cache***. If *use_cache* is true, the toolkit would make cache at the first time so that we can accelerate the training process at the next time. |
85 | 89 | - ***dataset_type***. Declare the task type here. Currently, we support classification, regression and so on. |
@@ -145,6 +149,7 @@ The architecture of the configuration file is: |
145 | 149 | - ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches. |
146 | 150 | - ***max_epoch***. [necessary for training] The maximum number of epochs to train. |
147 | 151 | - ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple times in case to miss the best state of our model. The default value is 1. |
| 152 | + - ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese. |
148 | 153 | - **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks). |
149 | 154 |
|
150 | 155 | - ***Embedding layer***. The first layer of this example (as shown below) defines the embedding layer, which is composed of one type of embedding: "word" (word embedding) and the dimension of "word" are 300. You need to keep this dimension and the dimension of pre-trained embeddings consistent if you specify pre-trained embeddings in *inputs/data_paths/pre_trained_emb*. |
@@ -195,8 +200,14 @@ The architecture of the configuration file is: |
195 | 200 | *Tips: The [optional] and [necessary] mark means corresponding node in the configuration file is optional or necessary for training/test/prediction. If there is no mark, it means the node is necessary all the time. Actually, it would be more convenient to prepare a configuration file that contains all the configurations for training, test and prediction.* |
196 | 201 |
|
197 | 202 |
|
| 203 | +### <span id="chinese-support">Chinese Support</span> |
| 204 | +
|
| 205 | +When using Chinese data, *language* in JSON config should be set to 'Chinese'. By default, Chinese uses the jieba tokenizer. For an example, see [Task 7: Chinese Sentiment Analysis](#task-7). |
| 206 | +
|
| 207 | +In addition, we also support pre-trained Chinese word vectors. Firstly download word vectors from [Chinese Word Vectors](https://github.com/Embedding/Chinese-Word-Vectors#pre-trained-chinese-word-vectors) and *bunzip* , then place it in a directory (e.g. *dataset/chinese_word_vectors/*). Finally remember to define *inputs/data_paths/pre_trained_emb* in JSON config. |
198 | 208 |
|
199 | | -## <span id="visualize">Visualize Your Model</span> |
| 209 | +
|
| 210 | +### <span id="visualize">Visualize Your Model</span> |
200 | 211 |
|
201 | 212 | A model visualizer is provided for visualization and configuration correctness checking, please refer to [Model Visualizer README](./model_visualizer/README.md). |
202 | 213 |
|
@@ -501,6 +512,28 @@ This task is to train a query-passage regression model to learn from a heavy tea |
501 | 512 | #### <span id="task-6.3">6.3: Compression for Slot Filling Model (ongoing)</span> |
502 | 513 | #### <span id="task-6.4">6.4: Compression for MRC (ongoing)</span> |
503 | 514 |
|
| 515 | +### <span id="task-7">Task 7: Chinese Sentiment Analysis</span> |
| 516 | +
|
| 517 | +Here is an example using Chinese data, for sentiment analysis task. |
| 518 | +
|
| 519 | +- ***Dataset*** |
| 520 | +
|
| 521 | + *PROJECT_ROOT/dataset/chinese_sentiment_analysis* is sample data of Chinese sentiment analysis. |
| 522 | +
|
| 523 | +- ***Usage*** |
| 524 | +
|
| 525 | + 1. Train Chinese sentiment analysis model. |
| 526 | + ```bash |
| 527 | + cd PROJECT_ROOT |
| 528 | + python train.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json |
| 529 | + ``` |
| 530 | + 2. Test your model. |
| 531 | + ```bash |
| 532 | + cd PROJECT_ROOT |
| 533 | + python test.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json |
| 534 | + ``` |
| 535 | + *Tips: you can try different models by running different JSON config files. The model file and train log file can be found in JOSN config file's outputs/save_base_dir after you finish training.* |
| 536 | + |
504 | 537 |
|
505 | 538 | ## <span id="advanced-usage">Advanced Usage</span> |
506 | 539 |
|
|
0 commit comments