-
Notifications
You must be signed in to change notification settings - Fork 914
[wip] Corefud v1.3 #1502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
[wip] Corefud v1.3 #1502
Conversation
stanza/models/coref/bert.py
Outdated
curr_id = -1 | ||
curr_number = -1 | ||
|
||
list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keyword - better to use something else
not a fan of l
as a name either. it's only used in a short function, but still, it makes it very difficult to search
stanza/models/common/doc.py
Outdated
words = self._words | ||
empty_words = self._empty_words | ||
|
||
all = sorted(words + empty_words, key=lambda x:(x.id,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the lambda one line? it takes a moment to realize this is one thing across two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also all
is a keyword (and yes, i know id
is as well - that wasn't my choice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping about all
and the multiline lambda here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry missed that, addressed in 6defc5c
stanza/models/coref/model.py
Outdated
if return_subwords: | ||
(nonblank_batches, | ||
nonblank_labels) = bert.get_subwords_batches(doc, self.config, | ||
self.tokenizer, nonblank_only=True) | ||
all_batches = bert.get_subwords_batches(doc, self.config, self.tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there meant to be an else
here? i was somewhat confused earlier when i saw the return format could be one of two types, which would be confusing for the caller... and here is a perhaps somewhat confusing usage of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partly i would wonder if this all still works the same when there are no zeros to train for
stanza/models/coref/bert.py
Outdated
|
||
start += len(subwords[start:end]) | ||
|
||
if nonblank_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge fan of the return which could be 1 thing or 2. likely to confuse people
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer the case
5639048
to
1d0863d
Compare
stanza/models/coref/dataset.py
Outdated
@@ -38,6 +38,9 @@ def __init__(self, path, config, tokenizer): | |||
word2subword = [] | |||
subwords = [] | |||
word_id = [] | |||
nonblank_subwords = [] # a list of subwords, skipping _ | |||
previous_was_blank = [] # was the word before _? | |||
was_blank = False # a flag to set if we saw "_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something other than was
such as has
or saw
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait... maybe i'm not understanding correctly. i would say these names could use some documentation / editing and there should be some comments on why we're keeping track of this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch; these fields are not needed anymore in the new implementation of zeros
) | ||
logger.info(f"CoNLL-2012 3-Score Average : {w_checker.bakeoff:.5f}") | ||
logger.info(f"Zero prediction accuracy: {z_correct / z_total:.5f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, is this always on? i would think there will be datasets that don't have zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, reporting this shouldn't hurt, since all we'll have in that case is that all of doc["is_zero"]
is False
. Hence, that will give us 100% zeros accuracy, and not break any logging. Do you think we should handle those cases differently? The tricky part is that we have currently no way to tell if a dataset has no zeros, or if a batch has no zeros (which is quite likely since zeros are relatively rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case it doesn't matter too much, although i would think a higher level part of the routine could also look at the whole dataset and check if it has zeros or not. but not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good; I would err on the side of "no" just because technically having "100% zeros accuracy" is technically correct still + involves less post-processing. Your call though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, no strong opinions except that / z_total
is probably not ideal in the case of z_total == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have:
zero_targets = torch.tensor(doc["is_zero"], device=res.zero_scores.device)
z_total += zero_targets.numel()
so in this case the only situation in which z_total
would be the case where the number of elements in doc["is_zero"]
is zero for the entire corpus (i.e., the corpus has no length); this would be a bad state and not usually possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, will z_correct include documents correctly predicted to have 0 zeros?
stanza/models/coref/model.py
Outdated
'train_c_loss': c_loss.item(), | ||
'train_s_loss': s_loss.item(), | ||
} | ||
if z_loss: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just keep using res.zero_scores.size(0) == 0
? i'm picturing a weird edge case where the learning has gone wrong and the losses are all 0, or maybe a weird case where the prediction is so close to the correct value that there's a 0 loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. 9331fed
# crap! there's two zeros right next to each other | ||
# we are sad and confused so we give up in this case | ||
if len(sentence_text) > span[1] and sentence_text[span[1]] == "_": | ||
warnings.warn("Found two zeros next to each other in sequence; we are confused and therefore giving up.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! I think I found it in at least Catalan or Spanish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. can also post an issue with the dataset maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amir-zeldes and I had a bunch of fun sitting around at ACL trying to come up with cases where this may happen; seems like its not totally impossible, so likely not a dataset error. Would love feedback on this front though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's common in CorefUD, if it's even attested(?), but conceptually it could happen, for example in an SOV language when both the subject and object are dropped. For example in Japanese:
- [okāsan]1 wa [yasai]2 o katta no? "Did mom buy vegetables?"
- [
_
]1 [_
]2 katta yo! "Yes she did!" (lit. "Did!")
candidate_head = span[1] | ||
else: | ||
try: | ||
candidate_head = find_cconj_head(sentence_heads, sentence_upos, span[1], span[2]+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this ever happen? i don't quite remember the find_cconj_head
method without looking it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was inserted because for some reason the find cconj head function throws an infinite loop in _get_depth_recursive
. This is because the relative head of a zero is None
(since its not a word with an annotated dependent); we never see this before because only verbs (i.e., sentence roots) has None
heads, and verbs are not coreferent.
We can either have a rule where if its a zero, we mark the head as the existing head; or we can try to altruistically run the algorithm and give up if it runs for too long, as in here
if candidate_head is None: | ||
for candidate_head in range(span[1], span[2] + 1): | ||
# stanza uses 0 to mark the head, whereas OntoNotes is counting | ||
# words from 0, so we have to subtract 1 from the stanza heads | ||
#print(span, candidate_head, parsed_sentence.words[candidate_head].head - 1) | ||
# treat the head of the phrase as the first word that has a head outside the phrase | ||
if (parsed_sentence.words[candidate_head].head - 1 < span[1] or | ||
parsed_sentence.words[candidate_head].head - 1 > span[2]): | ||
if parsed_sentence.all_words[candidate_head].head and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do with 0 for roots? am i missing something about the root possibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsed_sentence.all_words[candidate_head].head = None
implies that the word is a zero, which at this point would be a bad state (i.e., somehow it didn't trigger a recursion error with the dynamic depth, but also is a zero/a verb?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i was thinking was .head is checking truthiness, which would also fail on 0 for root. if the diagnostic test for it being a zero is None, how about --- is None
instead of just checking its truthiness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, patched with 4d14f4d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for zero-node annotation in CorefUD v1.3, enabling the coreference system to recognize and handle zero anaphora (empty pronouns marked with underscores). The implementation includes a new zero anaphora predictor component and comprehensive handling throughout the coreference pipeline.
- Adds zero anaphora detection and handling capabilities
- Updates data processing to work with CorefUD v1.3 format
- Implements zero node creation and coreference cluster integration
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
File | Description |
---|---|
stanza/utils/datasets/coref/convert_udcoref.py | Updates dataset conversion to handle zero tokens and CorefUD v1.3 format |
stanza/pipeline/coref_processor.py | Adds zero anaphora handling in the main processing pipeline |
stanza/models/coref/utils.py | Implements sigmoid focal loss function for zero prediction training |
stanza/models/coref/model.py | Integrates zero predictor component and training logic |
stanza/models/coref/dataset.py | Minor formatting updates |
stanza/models/coref/const.py | Adds zero_scores field to CorefResult |
stanza/models/common/doc.py | Extends document model to support zero nodes and all_words property |
Comments suppressed due to low confidence (1)
stanza/models/common/doc.py:750
- The variable name
all
is ambiguous and shadows the built-inall()
function. Consider renaming toall_words_sorted
or similar.
all = sorted(words + empty_words, key=lambda x:(x.id,)
- new language groups for coref 1.3 - handling of underscore forms
...also support for zero-node annotation! So for a sentence with underscores in it, the system would actually be able to recognize it as a possible-coreferent (i.e. zero-anaphora) and mark it if needed.