[Feature] Add Ner Suffix feature #1123

XINGXIAOYU · 2020-01-19T08:03:37Z

Description

Add a parameter "tagging_first_token", so you can choose to use the first piece or the last piece of each word. The first piece catches the prefix feature of a word, and the last piece catches the suffix feature of a word.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov · 2020-01-19T08:03:40Z

Codecov Report

Merging #1123 into master will decrease coverage by 0.12%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1123      +/-   ##
==========================================
- Coverage   88.34%   88.21%   -0.13%     
==========================================
  Files          66       66              
  Lines        6290     6290              
==========================================
- Hits         5557     5549       -8     
- Misses        733      741       +8

Impacted Files	Coverage Δ
src/gluonnlp/model/bert.py	`88.06% <0%> (-4.12%)`	⬇️
src/gluonnlp/utils/files.py	`45.9% <0%> (+3.27%)`	⬆️

mli · 2020-01-19T08:43:04Z

Job PR-1123/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1123/1/index.html

leezu · 2020-01-19T11:39:10Z

scripts/sequence_labeling/finetune_bert.py

                            help='Learning rate for optimization')
    arg_parser.add_argument('--warmup-ratio', type=float, default=0.1,
                            help='Warmup ratio for learning rate scheduling')
+    arg_parser.add_argument('--tagging-first-token', type=str2bool, default=True,


How about parser.add_argument('--tag-last-token', action='store_true'). It seems simpler to call finetune_bert.py --tag-last-token than finetune_bert.py --tagging-first-token=False.

In either case please update the test case in scripts/tests/ to run invoke the finetune_bert.py with both options. You can parametrize the test following for example Haibin's recent PR: https://github.com/dmlc/gluon-nlp/pull/1121/files#diff-fa82d34d543ff657c2fe09553bd0fa34R234

Sure, I will update it.

sxjscience · 2020-01-20T02:33:14Z

Have you found any performance differences?

XINGXIAOYU · 2020-01-20T03:03:18Z

@sxjscience I've tried the default parameters set in the scripts on conll2003 dataset. The performance using suffix feature will be a little lower than using the prefix feature.

sxjscience · 2020-01-20T03:07:02Z

I think we can try the following:

use the state corresponds to the first subword token
use the state corresponds to the last subword token
use the average pooling on top of the states of all the subwords
use the max pooling on top of the states of all the subwords

sxjscience · 2020-01-20T03:10:44Z

One problem is that since we are using self-attention, we are able to tailor the attention weights to cover the first, last, average cases. Thus, I don't think selecting the first/last token will impact the performance much.

XINGXIAOYU · 2020-01-20T03:28:26Z

@sxjscience In classification task, I think it does not matter. But in sequence labeling task, one word has one label. If we break the word 'w' into several subwords [sw1,sw2,...], then only sw1 will have the label, and the labels of the others will set to NULL. I think it does not make sense.

liuzh47 · 2020-01-20T03:29:59Z

scripts/sequence_labeling/predict_ner.py


    dataset = BERTTaggingDataset(text_vocab, None, None, config.test_path,
-                                 config.seq_len, train_config.cased, tag_vocab=tag_vocab)
+                                 config.seq_len, train_config.cased, tag_vocab=tag_vocab,tagging_first_token=config.tagging_first_token)


Pls add white space after the comma.

sxjscience · 2020-01-20T03:34:34Z

Due to the fact that we are using attention, the state bound to sw1 will be related to the other sub-words. The same thing happens for sw_n. A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Shawnyu <[email protected]> Sent: Sunday, January 19, 2020 7:28:28 PM To: dmlc/gluon-nlp <[email protected]> Cc: Xingjian SHI <[email protected]>; Mention <[email protected]> Subject: Re: [dmlc/gluon-nlp] [Feature] Add Ner Suffix feature (#1123) @sxjscience<https://github.com/sxjscience> In classification task, I think it does not matter. But in sequence labeling task, one word has one label. If we break the word 'w' into several subwords [sw1,sw2,...], then only sw1 will have the label, and the labels of the others will set to NULL. I think it does not make sense. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1123?email_source=notifications&email_token=ABHQH3RQMOA4X4WSVHE6VADQ6UK5ZA5CNFSM4KIXYPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLHMDY#issuecomment-576091663>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3UUE5HVOS5AHRINZLLQ6UK5ZANCNFSM4KIXYPMA>.

liuzh47 · 2020-01-20T03:51:32Z

scripts/sequence_labeling/data.py

+                        entries.append(PredictedToken(text=text,
+                                                      true_tag=true_tag, pred_tag=pred_tag))
+                        tmptext = ''
+                    else:


Can both cases be merged here? For example, if len(tmptext) == 0, you can still have text = tmptext + token_text which is equivalent to token_text.

liuzh47 · 2020-01-20T03:53:58Z

scripts/sequence_labeling/data.py

-                                              true_tag=true_tag, pred_tag=pred_tag))
-
+                if true_tag == NULL_TAG:
+                    tmptext += token_text


better name it as tmp_text. Or what about partial_text?

XINGXIAOYU · 2020-01-20T04:10:46Z

@sxjscience Agree with you, and I'll try this method.

liuzh47 · 2020-01-20T07:10:24Z

A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding.

I am confused about this part. Why masking loss of other sub-word tokens is reasonable? For example on NER tasks, suffix is much more important than prefix in words like firefighter.

sxjscience · 2020-01-20T07:14:15Z

A reasonable approach is to mask the loss corresponding to the other sub-word tokens and only use the state of the first subword as the contextualized word embedding.

I am confused about this part. Why masking loss of other sub-word tokens is reasonable? For example on NER tasks, suffix is much more important than prefix of the words like firefighter.

Since we are using attention, the higher-level state associated with fire has already captured the information about fighter. Suffix should be more appropriate if we are using a LSTM. However, we are using attention and we are actually attending to the whole input.

chenw23 · 2020-06-19T07:45:07Z

@sxjscience Do you think we should continue with this pull request?
If we should, maybe I can take over the remaining work that needs to be done on this pull request.

ner suffix feature enhancement

38e8d5d

XINGXIAOYU requested a review from a team as a code owner January 19, 2020 08:03

leezu reviewed Jan 19, 2020

View reviewed changes

liuzh47 reviewed Jan 20, 2020

View reviewed changes

szha changed the base branch from master to v0.x August 13, 2020 02:17

[Feature] Add Ner Suffix feature #1123

Are you sure you want to change the base?

[Feature] Add Ner Suffix feature #1123

Uh oh!

Conversation

XINGXIAOYU commented Jan 19, 2020

Description

Checklist

Essentials

Comments

Uh oh!

codecov bot commented Jan 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mli commented Jan 19, 2020

Uh oh!

leezu Jan 19, 2020

Choose a reason for hiding this comment

Uh oh!

XINGXIAOYU Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

sxjscience commented Jan 20, 2020

Uh oh!

XINGXIAOYU commented Jan 20, 2020

Uh oh!

sxjscience commented Jan 20, 2020

Uh oh!

sxjscience commented Jan 20, 2020

Uh oh!

XINGXIAOYU commented Jan 20, 2020

Uh oh!

liuzh47 Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

sxjscience commented Jan 20, 2020 via email

Uh oh!

liuzh47 Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

liuzh47 Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

XINGXIAOYU commented Jan 20, 2020

Uh oh!

liuzh47 commented Jan 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sxjscience commented Jan 20, 2020

Uh oh!

chenw23 commented Jun 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Jan 19, 2020 •

edited

Loading

liuzh47 commented Jan 20, 2020 •

edited

Loading