You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add new config about knowledge distillation for query binary classifier
* remove inferenced result in knowledge distillation for query binary classifier
* Add AUC.py in tools folder
* Add test_data_path into conf_kdqbc_bilstmattn_cnn.json
* Modify AUC.py
* Rename AUC.py into calculate_AUC.py
* Modify test&calculate AUC commands for Knowledge Distillation for Query Binary Classifier
* Add cpu_thread_num parameter in conf.training_params
* Rename cpu_thread_num into cpu_num_workers
* update comments in ModelConf.py
* Add cup_num_workers in model_zoo/advanced/conf.json
* Add the description of cpu_num_workers in Tutorial.md
* Update inference speed of compressed model
* Add ProcessorsScheduler Class
* Add license in ProcessorScheduler.py
* use lazy loading instead of one-off loading
* Remove Debug Info in problem.py
* use open instead of codecs.open
* update the inference of build dictionary for classification
* add md5 function in common_utils.py
* add merge_encode_* function
* update typo
* update typo
* reorg the logical flow in train.py
* remove dummy comments in problem.py
* add encoding cache mechanism
* add lazy-load mechanism for training phase
* enumerate problem types in problem.py
* remove data_encoding.py
* add lazy load train logic
* Modify comment and remove debug code
* Judge if test_path exists
* fix parameter missing when use char embedding
* merge master
* add file_column_num in problem.py
* merge add_encoding_cache branch
* add SST-2 in .gitignore
* merge master
* use steps_per_validation instead of valid_times_per_epoch
* Fix Learning Rate decay logic bug
* add log of calculating md5 of training data
* fix multi-gpu char_emb OOM problem & add char leval fix_lengths
* Modify batch_num_to_show_results in multi-gpu
* Modify batch_num_to_show_results
* delete deepcopy in get_batches
* add new parameters chunk_size and max_building_lines in conf and update tutorials
logging.info("configuration[training_params][valid_times_per_epoch] is deprecated, please use configuration[training_params][steps_per_validation] instead")
Copy file name to clipboardExpand all lines: Tutorial.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,10 +147,12 @@ The architecture of the configuration file is:
147
147
CUDA_VISIBLE_DEVICES= python train.py
148
148
```
149
149
- ***cpu_num_workers***. [default: -1] Define the number of processes to preprocess the dataset. The number of processes is equal to that of logical cores CPU supports if value is negtive or 0, otherwise it is equal to *cpu_num_workers*.
150
+
- ***chunk_size***. [default: 1000000] Define the chunk size of files that NB reads every time for avoiding out of memory and the mechanism of lazy-loading.
150
151
- ***batch_size***. Define the batch size here. If there are multiple GPUs, *batch_size* is the batch size of each GPU.
151
152
- ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches.
152
153
- ***max_epoch***. [necessary for training] The maximum number of epochs to train.
153
-
- ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple timesincase to miss the best state of our model. The default value is 1.
154
+
- ~~***valid_times_per_epoch***~~. [**deprecated**] Please use steps_per_validation instead.
155
+
- ***steps_per_validation***. [default: 10] Define how many steps does each validation take place.
154
156
- ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese.
155
157
- **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks).
156
158
@@ -729,5 +731,7 @@ To solve the above problems, NeuronBlocks supports *fixing embedding weight* (em
729
731
730
732
***training_params/vocabulary/max_vocabulary***. [int, optional for training, default: 800,000] The max size of corpus vocabulary. If corpus vocabulary size is larger than *max_vocabulary*, it will be cut according to word frequency.
731
733
734
+
***training_params/vocabulary/max_building_lines***. [int, optional for training, default: 1,000,000] The max lines NB will read from every file to build vocabulary
- ***cpu_num_workers***. [default: -1] Define the number of processes to preprocess the dataset. The number of processes is equal to that of logical cores CPU supports if value is negtive or 0, otherwise it is equal to *cpu_num_workers*.
140
+
- ***chunk_size***. [default: 1000000] Define the chunk size of files that NB reads every time for avoiding out of memory and the mechanism of lazy-loading.
140
141
- ***batch_size***. Define the batch size here. If there are multiple GPUs, *batch_size* is the batch size of each GPU.
141
142
- ***batch_num_to_show_results***. [necessary for training] During the training process, show the results every batch_num_to_show_results batches.
142
143
- ***max_epoch***. [necessary for training] The maximum number of epochs to train.
143
-
- ***valid_times_per_epoch***. [optional for training, default: 1] Define how many times to conduct validation per epoch. Usually, we conduct validation after each epoch, but for a very large corpus, we'd better validate multiple timesincase to miss the best state of our model. The default value is 1.
144
+
- ~~***valid_times_per_epoch***~~. [**deprecated**] Please use steps_per_validation instead.
145
+
- ***steps_per_validation***. [default: 10] Define how many steps does each validation take place.
144
146
- ***tokenizer***. [optional] Define tokenizer here. Currently, we support 'nltk' and 'jieba'. By default, 'nltk' for English and 'jieba' for Chinese.
145
147
- **architecture**. Define the model architecture. The node is a list of layers (blocks) in block_zoo to represent a model. The supported layers of this toolkit are given in [block_zoo overview](https://microsoft.github.io/NeuronBlocks).
146
148
@@ -719,4 +721,6 @@ To solve the above problems, NeuronBlocks supports *fixing embedding weight* (em
719
721
720
722
***training_params/vocabulary/max_vocabulary***. [int, optional for training, default: 800,000] The max size of corpus vocabulary. If corpus vocabulary size is larger than *max_vocabulary*, it will be cut according to word frequency.
721
723
724
+
***training_params/vocabulary/max_building_lines***. [int, optional for training, default: 1,000,000] The max lines NB will read from every file to build vocabulary
Copy file name to clipboardExpand all lines: model_zoo/nlp_tasks/knowledge_distillation/query_binary_classifier_compression/conf_kdqbc_bilstmattn_cnn.json
0 commit comments