-
Notifications
You must be signed in to change notification settings - Fork 59
UDA Example #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jrxk
wants to merge
91
commits into
asyml:master
Choose a base branch
from
jrxk:uda_example
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
UDA Example #343
Changes from 106 commits
Commits
Show all changes
91 commits
Select commit
Hold shift + click to select a range
1ca7f31
draft of back translation
haoyuLucas f166160
Add a backtranslation augmenter
haoyuLucas babda7a
after merge
haoyuLucas 4e1e866
rebase on the new base classes
haoyuLucas 31b326e
Update data_augment_processor.py
haoyuLucas 681f15c
Update data_augment_processor.py
haoyuLucas 2142587
Update data_augment_processor.py
haoyuLucas d6f3a4f
Delete base_augmenter.py
haoyuLucas 653a00e
Delete dictionary_replacement_augmenter.py
haoyuLucas b85190f
Delete text_generation_augment_processor.py
haoyuLucas 610e99a
Delete dictionary_replacement_augmenter_test.py
haoyuLucas 47d0805
Delete text_generation_augment_processor_test.py
haoyuLucas d7b9a5b
change the configs to a Texar Config
haoyuLucas 917090c
Merge branch 'bt' of github.com:haoyuLucas/forte into bt
haoyuLucas a6b6169
abstract a machine translator class
haoyuLucas 5d17c13
Update machine_translator.py
haoyuLucas 45f0100
add the transformer to requirements
haoyuLucas fcd33d0
add an extra space
haoyuLucas 5445fbe
add the transformers to travis yml
haoyuLucas 07d7b27
add travis yml
haoyuLucas 05947ab
Merge branch 'master' into bt
haoyuLucas a229a58
add text classifier
jrxk ae0ffa5
add list
jrxk 04e92fa
fix main
jrxk 2dc228c
fix
jrxk 1d1a577
delete some files
jrxk 3c46e9b
Merge branch 'master' into imdb_classifier
jrxk 8c6e000
first commit of uda
haoyuLucas a386c62
Merge branch 'master' of https://github.com/asyml/forte into imdb_cla…
jrxk f8c0995
Merge branch 'master' into imdb_classifier
hunterhector 26de46b
add the bool return value
haoyuLucas ceff3f5
add some comments
haoyuLucas c14024a
switch to texar-pytorch
jrxk 2082208
Merge branch 'imdb_classifier' of https://github.com/jrxk/forte into …
jrxk 8758d09
fix travis
jrxk eeaf010
Merge branch 'master' into bt
hunterhector 7669a58
modify the setup
haoyuLucas 7b98a5d
Merge branch 'bt' of github.com:haoyuLucas/forte into bt
haoyuLucas e9416e7
modify travis config
haoyuLucas 7422994
initial version of UDA iterator
haoyuLucas e9b4379
Merge branch 'master' into UDA
haoyuLucas 04e7dfc
fix mypy error
haoyuLucas 7a2216c
Merge branch 'UDA' of github.com:haoyuLucas/forte into UDA
haoyuLucas fed4d5e
rerun travis
haoyuLucas 0f6b1c4
Merge branch 'UDA' of https://github.com/haoyuLucas/forte into uda_ex…
jrxk 608977f
Merge branch 'bt' of https://github.com/haoyuLucas/forte into uda_exp…
jrxk a0f1470
add bt pipeline
jrxk 26e7207
Merge branch 'auto_align_replace' of https://github.com/haoyuLucas/fo…
jrxk 957cf08
Add toy data
jrxk 6873fb4
Merge branch 'master' into UDA
haoyuLucas b452aa5
changed data for uda
jrxk 81a484a
modify train loop for UDA
jrxk 0f8d7b0
Add doc for data augmentation
haoyuLucas de45a27
add TSA, minor changes
jrxk fe18f75
Merge branch 'UDA' of https://github.com/haoyuLucas/forte into uda_ex…
jrxk 67308ae
create imdb classifier
jrxk 0daf443
fix travis
jrxk 2349f0c
update UDA
jrxk 69eb738
update config, minor fixes
jrxk 6ec645c
Merge branch 'imdb_classifier_2' of https://github.com/jrxk/forte int…
jrxk 8b4fdab
add README, remove files
jrxk de59db0
add file link
jrxk 9340277
remove data files
jrxk 8aca85b
remove files
jrxk 8ffaf72
use UDA's back trans data
jrxk f7313dd
update README
jrxk 69009fe
Merge branch 'master' of https://github.com/asyml/forte into uda_expe…
jrxk c4d6991
some refactor
jrxk f5740a8
remove imdb model
jrxk e565bb8
Merge branch 'master' of https://github.com/asyml/forte into uda_example
jrxk 9a79456
refactor
jrxk f60961a
update README
jrxk a95023c
fix init
jrxk 2e16182
Merge branch 'master' into uda_example
hunterhector b00293f
Merge branch 'master' into uda_example
hunterhector 3984038
Merge branch 'master' of https://github.com/asyml/forte into uda_example
jrxk efcae45
removed wget, changed imdb_format to forte pipeline
jrxk 2800218
Update README with tutorial to UDA
jrxk f6a4be1
fix travis
jrxk c0aa83f
Merge branch 'master' into uda_example
hunterhector f17a238
move to da folder, remove classes
jrxk 8d8016f
Merge branch 'uda_example' of https://github.com/jrxk/forte into uda_…
jrxk a7fb89d
clean some code, update reader
jrxk 46ccb67
more clean, adding bt
jrxk 0fec67c
update scripts, add requirements for t2t
jrxk 34727eb
add merge sentences code
jrxk 554f892
Added instructions for back translation
259664f
fix docstring, travis
jrxk f935d74
fix travis
jrxk e319fe9
update test
jrxk 4c1b6fb
Merge branch 'master' into uda_example
jrxk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| ## Unsupervised Data Augmentation for Text Classification | ||
|
|
||
| Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks. For details, please refer to the [paper](https://arxiv.org/abs/1904.12848) and the [official repository](https://github.com/google-research/uda). | ||
|
|
||
| In this example, we demonstrate Forte's implementation of UDA using a simple BERT-based text classifier. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Install the dependencies | ||
|
|
||
| You need to install [texar-pytorch](https://github.com/asyml/texar-pytorch) first. | ||
|
|
||
| ### Get the IMDB data | ||
|
|
||
| We use the IMDB Text Classification dataset for this example. Use the following script to download the supervised and unsupervised training data. | ||
|
|
||
| ```bash | ||
| python download_imdb.py | ||
| ``` | ||
|
|
||
| ### Preproces and generate augmented data | ||
|
|
||
| You can use the following script to process the data into CSV format. | ||
|
|
||
| ```bash | ||
| python utils/imdb_format.py --raw_data_dir=data/IMDB_raw/aclImdb --train_id_path=data/IMDB_raw/train_id_list.txt --output_dir=data/IMDB | ||
| ``` | ||
|
|
||
| The next step is to generate augment training data (using your favorite back translation model) and output to a TXT file. Each example in the file should correspond to the same line in `train.csv` (without headers). | ||
|
|
||
| For demonstration purpose, we provide the processed and augmented [data files](https://drive.google.com/file/d/1OKrbS76mbGCIz3FcFQ8-qPpMTQkQy8bP/view?usp=sharing). Place the CSV and txt files in directory `data/IMDB`. | ||
|
|
||
| ### Train | ||
|
|
||
| To train the baseline model without UDA: | ||
|
|
||
| ```bash | ||
| python main.py --do-train --do-eval --do-test | ||
| ``` | ||
|
|
||
| To train with UDA: | ||
|
|
||
| ```bash | ||
| python main.py --do-train --do-eval --do-test --use-uda | ||
| ``` | ||
|
|
||
| To change the hyperparameters, please see `config_data.py`. You can also change the number of labeled examples used for training (`num_train_data`). | ||
|
|
||
| #### GPU Memory Issue: | ||
|
|
||
| According to the authors' [guideline for hyperparameters](https://github.com/google-research/uda#general-guidelines-for-setting-hyperparameters), longer sequence length and larger batch size lead to better performances. The sequence length and batch size are limited by the GPU memory. By default, we use `max_seq_length=128` and `batch_size=24` to run on a GTX1080Ti with 11GB memory. | ||
|
|
||
| ## Results | ||
|
|
||
| With the provided data, you should be able to achieve performance similar to the following: | ||
|
|
||
| | Number of Labeled Examples | BERT Accuracy | BERT+UDA Accuracy| | ||
| | -------------------------- | ------------- | ------------------ | | ||
| | 24 | 61.54 | 84.92 | | ||
| | 25000 | 89.68 | 90.19 | | ||
|
|
||
| When training with 24 examples, we use the Training Signal Annealing technique which can be turned on by setting `tsa=True`. | ||
|
|
||
| You can further improve the performance by tuning hyperparameters, generate better back-translation data, using a larger BERT model, using a larger `max_seq_length` etc. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Copyright 2020 The Forte Authors. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| name = "bert_classifier" | ||
| hidden_size = 768 | ||
| clas_strategy = "cls_time" | ||
| dropout = 0.1 | ||
| num_classes = 2 | ||
|
|
||
| # This hyperparams is used in bert_with_hypertuning_main.py example | ||
| hyperparams = { | ||
| "optimizer.warmup_steps": {"start": 10000, "end": 20000, "dtype": int}, | ||
| "optimizer.static_lr": {"start": 1e-3, "end": 1e-2, "dtype": float} | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| pickle_data_dir = "data/IMDB" | ||
| unsup_bt_file = "data/IMDB/para_0.txt" | ||
| max_seq_length = 128 | ||
| num_classes = 2 | ||
| num_train_data = 24 # supervised data limit. max 25000 | ||
|
|
||
| train_batch_size = 24 | ||
| max_train_epoch = 3000 | ||
| display_steps = 50 # Print training loss every display_steps; -1 to disable | ||
|
|
||
| eval_steps = 100 # Eval on the dev set every eval_steps; if -1 will eval every epoch | ||
| # Proportion of training to perform linear learning rate warmup for. | ||
| # E.g., 0.1 = 10% of training. | ||
| warmup_proportion = 0.1 | ||
| eval_batch_size = 8 | ||
| test_batch_size = 8 | ||
|
|
||
| feature_types = { | ||
| # Reading features from pickled data file. | ||
| # E.g., Reading feature "input_ids" as dtype `int64`; | ||
| # "FixedLenFeature" indicates its length is fixed for all data instances; | ||
| # and the sequence length is limited by `max_seq_length`. | ||
| "input_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "input_mask": ["int64", "stacked_tensor", max_seq_length], | ||
| "segment_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "label_ids": ["int64", "stacked_tensor"] | ||
| } | ||
|
|
||
| train_hparam = { | ||
| "allow_smaller_final_batch": False, | ||
| "batch_size": train_batch_size, | ||
| "dataset": { | ||
| "data_name": "data", | ||
| "feature_types": feature_types, | ||
| "files": "{}/train.pkl".format(pickle_data_dir) | ||
| }, | ||
| "shuffle": True, | ||
| "shuffle_buffer_size": None | ||
| } | ||
|
|
||
| eval_hparam = { | ||
| "allow_smaller_final_batch": True, | ||
| "batch_size": eval_batch_size, | ||
| "dataset": { | ||
| "data_name": "data", | ||
| "feature_types": feature_types, | ||
| "files": "{}/eval.pkl".format(pickle_data_dir) | ||
| }, | ||
| "shuffle": False | ||
| } | ||
|
|
||
| # UDA config | ||
| tsa = True | ||
| tsa_schedule = "linear_schedule" # linear_schedule, exp_schedule, log_schedule | ||
|
|
||
| unsup_feature_types = { | ||
| "input_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "input_mask": ["int64", "stacked_tensor", max_seq_length], | ||
| "segment_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "label_ids": ["int64", "stacked_tensor"], | ||
| "aug_input_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "aug_input_mask": ["int64", "stacked_tensor", max_seq_length], | ||
| "aug_segment_ids": ["int64", "stacked_tensor", max_seq_length], | ||
| "aug_label_ids": ["int64", "stacked_tensor"] | ||
| } | ||
|
|
||
| unsup_hparam = { | ||
| "allow_smaller_final_batch": True, | ||
| "batch_size": train_batch_size, | ||
| "dataset": { | ||
| "data_name": "data", | ||
| "feature_types": unsup_feature_types, | ||
| "files": "{}/unsup.pkl".format(pickle_data_dir) | ||
| }, | ||
| "shuffle": True, | ||
| "shuffle_buffer_size": None, | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # Copyright 2020 The Forte Authors. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| import os | ||
| import sys | ||
| import subprocess | ||
|
|
||
|
|
||
| def main(): | ||
| if not os.path.exists("data/IMDB_raw"): | ||
| subprocess.run("mkdir data/IMDB_raw", shell=True, check=True) | ||
| # pylint: disable=line-too-long | ||
| subprocess.run( | ||
| 'wget -P data/IMDB_raw/ https://github.com/google-research/uda/blob/master/text/data/IMDB_raw/train_id_list.txt', | ||
| shell=True, check=True) | ||
| subprocess.run( | ||
| 'wget -P data/IMDB_raw/ https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', | ||
| shell=True, check=True) | ||
| subprocess.run( | ||
| 'tar xzvf data/IMDB_raw/aclImdb_v1.tar.gz -C data/IMDB_raw/ && rm data/IMDB_raw/aclImdb_v1.tar.gz', | ||
jrxk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| shell=True, check=True) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| sys.exit(main()) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.