Skip to content

Commit 404785f

Browse files
adolphk-ykljshou
authored andcommitted
Add Sequence labeling in Tutorial and tag scheme convert script (#65)
* add sequence labeling in tutorial * add tag scheme convert script * add paper link
1 parent 9b2adb9 commit 404785f

File tree

3 files changed

+173
-0
lines changed

3 files changed

+173
-0
lines changed

Tutorial.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
4. [Compression for MRC Model](#task-6.4)
2222
* [Task 7: Chinese Sentiment Analysis](#task-7)
2323
* [Task 8: Chinese Text Matching](#task-8)
24+
* [Task 9: Sequence Labeling](#task-9)
2425
* [Advanced Usage](#advanced-usage)
2526
* [Extra Feature Support](#extra-feature)
2627
* [Learning Rate Decay](#lr-decay)
@@ -562,7 +563,36 @@ Here is an example using Chinese data, for text matching task.
562563
```
563564
*Tips: you can try different models by running different JSON config files. The model file and train log file can be found in JOSN config file's outputs/save_base_dir after you finish training.*
564565
566+
### <span id="task-9">Task 9: Sequence Labeling</span>
567+
Sequence Labeling is an important NLP task, which includes NER, Slot Tagging, Pos Tagging, etc.
565568
569+
- ***Dataset***
570+
571+
[CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/) is a popular dataset in Sequence Labeling task. We use CoNLL 2003 English NER data for our experiment and you can refer the data format in [sample data](https://github.com/microsoft/NeuronBlocks/tree/master/dataset/slot_tagging/conll_2003).
572+
573+
- ***Tagging Scheme***
574+
575+
- NeuronBlocks support both BIO and BIOES tag schemes.
576+
- The IOB scheme is not supported, because of its worse performance in most [experiment](https://arxiv.org/pdf/1707.06799.pdf).
577+
- NeuronBlocks provides a [script](./tools/taggingSchemes_Converter.py) that converts the tag scheme among IOB/BIO/BIOES (NOTE: the script only supports tsv file which has data and label in two columns).
578+
579+
- ***Usages***
580+
581+
1. BiLSTM representation and Softmax output.
582+
```bash
583+
cd PROJECT_ROOT
584+
python train.py --conf_path=model_zoo/nlp_tasks/slot_tagging/conf_slot_tagging.json
585+
```
586+
587+
- ***Result***
588+
589+
1. BiLSTM representation and Softmax output.
590+
591+
Model | F1-score
592+
-------- | --------
593+
[Ma and Hovy(2016)](https://arxiv.org/pdf/1603.01354.pdf)|87.00
594+
BiLSTM+Softmax(NeuronBlocks)|88.50
595+
566596
## <span id="advanced-usage">Advanced Usage</span>
567597
568598
After building a model, the next goal is to train a model with good performance. It depends on a highly expressive model and tricks of the model training. NeuronBlocks provides some tricks of model training.

Tutorial_zh_CN.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
4. [机器阅读理解模型的模型压缩](#task-6.4)
2222
* [任务 7: 中文情感分析](#task-7)
2323
* [任务 8:中文文本匹配](#task-8)
24+
* [任务 9:序列标注](#task-9)
2425
* [高阶用法](#advanced-usage)
2526
* [额外的feature](#extra-feature)
2627
* [学习率衰减](#lr-decay)
@@ -552,6 +553,36 @@ This task is to train a query-passage regression model to learn from a heavy tea
552553
```
553554
*提示:您可以通过运行不同的JSON配置文件来尝试不同的模型。当训练完成后,模型文件和训练日志文件可以在JSON配置的outputs/save_base_dir目录中找到。*
554555
556+
### <span id="task-9">任务 9: 序列标注</span>
557+
序列标注是一项重要的NLP任务,包括 NER, Slot Tagging, Pos Tagging 等任务。
558+
559+
- ***数据集***
560+
561+
在序列标注任务中,[CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/)是一个很常用的数据集。在我们的序列标注任务中,使用 CoNLL 2003 中英文 NER 数据作为实验数据,其中数据格式可以参考我们给出的[抽样数据](https://github.com/microsoft/NeuronBlocks/tree/master/dataset/slot_tagging/conll_2003)。
562+
563+
- ***标注策略***
564+
565+
- NeuronBlocks 支持 BIO 和 BIOES 标注策略。
566+
- IOB 标注标注是不被支持的,因为在大多[实验](https://arxiv.org/pdf/1707.06799.pdf)中它具有很差的表现。
567+
- NeuronBlocks 提供一个在不同标注策略(IOB/BIO/BIOES)中的[转化脚本](./tools/taggingSchemes_Converter.py)(脚本仅支持具有 数据和标签 的两列tsv文件输入)。
568+
569+
- ***用法***
570+
571+
1. BiLSTM 词表示和 Softmax 输出
572+
```bash
573+
cd PROJECT_ROOT
574+
python train.py --conf_path=model_zoo/nlp_tasks/slot_tagging/conf_slot_tagging.json
575+
```
576+
577+
- ***结果***
578+
579+
1. BiLSTM 词表示和 Softmax 输出
580+
581+
Model | F1-score
582+
-------- | --------
583+
[Ma and Hovy(2016)](https://arxiv.org/pdf/1603.01354.pdf)|87.00
584+
BiLSTM+Softmax(NeuronBlocks)|88.50
585+
555586
## <span id="advanced-usage">高阶用法</span>
556587
557588
After building a model, the next goal is to train a model with good performance. It depends on a highly expressive model and tricks of the model training. NeuronBlocks provides some tricks of model training.

tools/taggingSchemes_Converter.py

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Copyright (c) Microsoft Corporation. All rights reserved.
2+
# Licensed under the MIT license.
3+
4+
import sys
5+
6+
7+
def BIO2BIOES(input_labels_list):
8+
output_labels_list = []
9+
for labels in input_labels_list:
10+
new_labels = []
11+
sent_len = len(labels)
12+
for idx in range(sent_len):
13+
if "-" not in labels[idx]:
14+
new_labels.append(labels[idx])
15+
else:
16+
label_type = labels[idx].split('-')[-1]
17+
if "B-" in labels[idx]:
18+
if (idx == sent_len - 1) or ("I-" not in labels[idx + 1]):
19+
new_labels.append("S-"+label_type)
20+
else:
21+
new_labels.append("B-"+label_type)
22+
elif "I-" in labels[idx]:
23+
if (idx == sent_len - 1) or ("I-" not in labels[idx + 1]):
24+
new_labels.append("E-"+label_type)
25+
else:
26+
new_labels.append("I-"+label_type)
27+
assert len(labels) == len(new_labels)
28+
output_labels_list.append(new_labels)
29+
return output_labels_list
30+
31+
32+
def BIOES2BIO(input_labels_list):
33+
output_labels_list = []
34+
for labels in input_labels_list:
35+
new_labels = []
36+
sent_len = len(labels)
37+
for idx in range(sent_len):
38+
if "-" not in labels[idx]:
39+
new_labels.append(labels[idx])
40+
else:
41+
label_type = labels[idx].split('-')[-1]
42+
if "E-" in labels[idx]:
43+
new_labels.append("I-" + label_type)
44+
elif "S-" in labels[idx]:
45+
new_labels.append("B-" + label_type)
46+
else:
47+
new_labels.append(labels[idx])
48+
assert len(labels) == len(new_labels)
49+
output_labels_list.append(new_labels)
50+
return output_labels_list
51+
52+
53+
def IOB2BIO(input_labels_list):
54+
output_labels_list = []
55+
for labels in input_labels_list:
56+
new_labels = []
57+
sent_len = len(labels)
58+
for idx in range(sent_len):
59+
if "I-" in labels[idx]:
60+
label_type = labels[idx].split('-')[-1]
61+
if (idx == 0) or (labels[idx - 1] == "O") or (label_type != labels[idx - 1].split('-')[-1]):
62+
new_labels.append("B-" + label_type)
63+
else:
64+
new_labels.append(labels[idx])
65+
else:
66+
new_labels.append(labels[idx])
67+
assert len(labels) == len(new_labels)
68+
output_labels_list.append(new_labels)
69+
return output_labels_list
70+
71+
72+
if __name__ == '__main__':
73+
'''Convert NER tagging schemes among IOB/BIO/BIOES.
74+
For example: if you want to convert the IOB tagging scheme to BIO, then you run as following:
75+
python taggingSchemes_Converter.py IOB2BIO input_iob_file output_bio_file
76+
Input data format is tsv format.
77+
'''
78+
input_file_name, output_file_name = sys.argv[2], sys.argv[3]
79+
words_list, labels_list, new_labels_list = [], [], []
80+
with open(input_file_name, 'r') as input_file:
81+
for line in input_file:
82+
item = line.rstrip().split('\t')
83+
assert len(item) == 2
84+
words, labels = item[0].split(' '), item[1].split(' ')
85+
if len(words) != len(labels):
86+
print("Error line: " + line.rstrip())
87+
continue
88+
words_list.append(words)
89+
labels_list.append(labels)
90+
91+
if sys.argv[1].upper() == "IOB2BIO":
92+
print("Convert IOB -> BIO...")
93+
new_labels_list = IOB2BIO(labels_list)
94+
elif sys.argv[1].upper() == "BIO2BIOES":
95+
print("Convert BIO -> BIOES...")
96+
new_labels_list = BIO2BIOES(labels_list)
97+
elif sys.argv[1].upper() == "BIOES2BIO":
98+
print("Convert BIOES -> BIO...")
99+
new_labels_list = BIOES2BIO(labels_list)
100+
elif sys.argv[1].upper() == "IOB2BIOES":
101+
print("Convert IOB -> BIOES...")
102+
tmp_labels_list = IOB2BIO(labels_list)
103+
new_labels_list = BIO2BIOES(tmp_labels_list)
104+
else:
105+
print("Argument error: sys.argv[1] should belongs to \"IOB2BIO/BIO2BIOES/BIOES2BIO/IOB2BIOES\"")
106+
107+
with open(output_file_name, 'w') as output_file:
108+
for index in range(len(words_list)):
109+
words, labels = words_list[index], new_labels_list[index]
110+
line = " ".join(words) + '\t' + " ".join(labels) + '\n'
111+
output_file.write(line)
112+

0 commit comments

Comments
 (0)