Skip to content

Commit 7df5da1

Browse files
woailaosangljshou
authored andcommitted
Add Chinese usage in Tutorial.md (#27)
1 parent c638eff commit 7df5da1

File tree

1 file changed

+34
-1
lines changed

1 file changed

+34
-1
lines changed

Tutorial.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
* [Quick Start](#quick-start)
77
* [How to Design Your NLP Model](#design-model)
88
* [Define the Model Configuration File](#define-conf)
9+
* [Chinese Support](#chinese-support)
910
* [Visualize Your Model](#visualize)
1011
* [Model Zoo for NLP Tasks](#model-zoo)
1112
* [Task 1: Text Classification](#task-1)
@@ -18,6 +19,7 @@
1819
2. [Compression for Text Matching Model](#task-6.2)
1920
3. [Compression for Slot Filling Model](#task-6.3)
2021
4. [Compression for MRC Model](#task-6.4)
22+
* [Task 7: Chinese Sentiment Analysis](#task-7)
2123
* [Advanced Usage](#advanced-usage)
2224
* [Extra Feature Support](#extra-feature)
2325
* [Learning Rate Decay](#lr-decay)
@@ -80,6 +82,8 @@ Take *[PROJECTROOT/model_zoo/demo/conf.json](./model_zoo/demo/conf.json)* as an
8082
The sample data lies in *[PROJECTROOT/dataset/demo/](./dataset/demo/)*.
8183

8284
The architecture of the configuration file is:
85+
86+
- **language**. [optional, default: English] Firstly define language type here, we support English and Chinese now.
8387
- **inputs**. This part defines the input configuration.
8488
- ***use_cache***. If *use_cache* is true, the toolkit would make cache at the first time so that we can accelerate the training process at the next time.
8589
- ***dataset_type***. Declare the task type here. Currently, we support classification, regression and so on.
@@ -145,6 +149,7 @@ The architecture of the configuration file is:
145149
- ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches.
146150
- ***max_epoch***. [necessary for training] The maximum number of epochs to train.
147151
- ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple times in case to miss the best state of our model. The default value is 1.
152+
- ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese.
148153
- **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks).
149154

150155
- ***Embedding layer***. The first layer of this example (as shown below) defines the embedding layer, which is composed of one type of embedding: "word" (word embedding) and the dimension of "word" are 300. You need to keep this dimension and the dimension of pre-trained embeddings consistent if you specify pre-trained embeddings in *inputs/data_paths/pre_trained_emb*.
@@ -195,8 +200,14 @@ The architecture of the configuration file is:
195200
*Tips: The [optional] and [necessary] mark means corresponding node in the configuration file is optional or necessary for training/test/prediction. If there is no mark, it means the node is necessary all the time. Actually, it would be more convenient to prepare a configuration file that contains all the configurations for training, test and prediction.*
196201
197202
203+
### <span id="chinese-support">Chinese Support</span>
204+
205+
When using Chinese data, *language* in JSON config should be set to 'Chinese'. By default, Chinese uses the jieba tokenizer. For an example, see [Task 7: Chinese Sentiment Analysis](#task-7).
206+
207+
In addition, we also support pre-trained Chinese word vectors. Firstly download word vectors from [Chinese Word Vectors](https://github.com/Embedding/Chinese-Word-Vectors#pre-trained-chinese-word-vectors) and *bunzip* , then place it in a directory (e.g. *dataset/chinese_word_vectors/*). Finally remember to define *inputs/data_paths/pre_trained_emb* in JSON config.
198208
199-
## <span id="visualize">Visualize Your Model</span>
209+
210+
### <span id="visualize">Visualize Your Model</span>
200211
201212
A model visualizer is provided for visualization and configuration correctness checking, please refer to [Model Visualizer README](./model_visualizer/README.md).
202213
@@ -501,6 +512,28 @@ This task is to train a query-passage regression model to learn from a heavy tea
501512
#### <span id="task-6.3">6.3: Compression for Slot Filling Model (ongoing)</span>
502513
#### <span id="task-6.4">6.4: Compression for MRC (ongoing)</span>
503514
515+
### <span id="task-7">Task 7: Chinese Sentiment Analysis</span>
516+
517+
Here is an example using Chinese data, for sentiment analysis task.
518+
519+
- ***Dataset***
520+
521+
*PROJECT_ROOT/dataset/chinese_sentiment_analysis* is sample data of Chinese sentiment analysis.
522+
523+
- ***Usage***
524+
525+
1. Train Chinese sentiment analysis model.
526+
```bash
527+
cd PROJECT_ROOT
528+
python train.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json
529+
```
530+
2. Test your model.
531+
```bash
532+
cd PROJECT_ROOT
533+
python test.py --conf_path=model_zoo/nlp_tasks/chinese_sentiment_analysis/conf_chinese_sentiment_analysis_bilstm.json
534+
```
535+
*Tips: you can try different models by running different JSON config files. The model file and train log file can be found in JOSN config file's outputs/save_base_dir after you finish training.*
536+
504537

505538
## <span id="advanced-usage">Advanced Usage</span>
506539

0 commit comments

Comments
 (0)