Cross-lingual Embeddings and Text Classification
nltk, sklearn, thulac, numpy, keras
And run following python code:
nltk.download('wordnet')
Run main.py after following 'Inputs' or 'Middle outputs' are prepared.
Inputs for data pre-processing:
| File Name | Folder | Description |
|---|---|---|
| wiki.en.align.vec | ./data/ | English aligned word embedding FastText |
| wiki.zh.align.vec | ./data/ | Chinese aligned word embedding FastText |
| ... | ./data/sport | Chinese news corpus THUCTC |
| ... | ./data/politics | Chinese news corpus THUCTC |
| ... | ./data/science | Chinese news corpus THUCTC |
| ... | ./data/UM-Corpus | Chinese news corpus UM-corpus |
| chinese.train.1000 | ./data/ | Chinese training corpus Reuters Corpora |
| chinese.dev | ./data/ | Chinese validation corpus Reuters Corpora |
| chinese.test | ./data/ | Chinese testing corpus Reuters Corpora |
| english.train.1000 | ./data/ | English training corpus Reuters Corpora |
| english.dev | ./data/ | English validation corpus Reuters Corpora |
| english.test | ./data/ | English testing corpus Reuters Corpora |
| stopwords-zh.txt | ./data/ | Chinese stop words |
With middle outputs prepared, one can comment:
==========================Data preparation==========================
in main.py, and jump to:
==========================Model training==========================
directly.
| File Name | Folder | Description |
|---|---|---|
| X_c3 | ./mid/ | Chinese corpus after data pre-processing |
| y_c3 | ./mid/ | Chinese label after data pre-processing |
| X_e3 | ./mid/ | English corpus after data pre-processing |
| y_e3 | ./mid/ | English label after data pre-processing |
Outputs for running main.py:
| File Name | Folder | Description |
|---|---|---|
| .log | ./output/ | Logs |
| .png | ./output/ | Charts |