Skip to content

Conversation

Jemoka
Copy link
Member

@Jemoka Jemoka commented Jun 20, 2025

...also support for zero-node annotation! So for a sentence with underscores in it, the system would actually be able to recognize it as a possible-coreferent (i.e. zero-anaphora) and mark it if needed.

curr_id = -1
curr_number = -1

list = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyword - better to use something else

not a fan of l as a name either. it's only used in a short function, but still, it makes it very difficult to search

words = self._words
empty_words = self._empty_words

all = sorted(words + empty_words, key=lambda x:(x.id,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the lambda one line? it takes a moment to realize this is one thing across two lines

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also all is a keyword (and yes, i know id is as well - that wasn't my choice)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping about all and the multiline lambda here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry missed that, addressed in 6defc5c

if return_subwords:
(nonblank_batches,
nonblank_labels) = bert.get_subwords_batches(doc, self.config,
self.tokenizer, nonblank_only=True)
all_batches = bert.get_subwords_batches(doc, self.config, self.tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there meant to be an else here? i was somewhat confused earlier when i saw the return format could be one of two types, which would be confusing for the caller... and here is a perhaps somewhat confusing usage of it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partly i would wonder if this all still works the same when there are no zeros to train for


start += len(subwords[start:end])

if nonblank_only:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of the return which could be 1 thing or 2. likely to confuse people

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer the case

@Jemoka Jemoka force-pushed the corefud_v1.3 branch 2 times, most recently from 5639048 to 1d0863d Compare July 31, 2025 13:27
@Jemoka Jemoka requested a review from AngledLuffa August 3, 2025 18:09
@@ -38,6 +38,9 @@ def __init__(self, path, config, tokenizer):
word2subword = []
subwords = []
word_id = []
nonblank_subwords = [] # a list of subwords, skipping _
previous_was_blank = [] # was the word before _?
was_blank = False # a flag to set if we saw "_"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something other than was such as has or saw?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait... maybe i'm not understanding correctly. i would say these names could use some documentation / editing and there should be some comments on why we're keeping track of this here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch; these fields are not needed anymore in the new implementation of zeros

)
logger.info(f"CoNLL-2012 3-Score Average : {w_checker.bakeoff:.5f}")
logger.info(f"Zero prediction accuracy: {z_correct / z_total:.5f}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, is this always on? i would think there will be datasets that don't have zeros

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, reporting this shouldn't hurt, since all we'll have in that case is that all of doc["is_zero"] is False. Hence, that will give us 100% zeros accuracy, and not break any logging. Do you think we should handle those cases differently? The tricky part is that we have currently no way to tell if a dataset has no zeros, or if a batch has no zeros (which is quite likely since zeros are relatively rare.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case it doesn't matter too much, although i would think a higher level part of the routine could also look at the whole dataset and check if it has zeros or not. but not a big deal

Copy link
Member Author

@Jemoka Jemoka Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; I would err on the side of "no" just because technically having "100% zeros accuracy" is technically correct still + involves less post-processing. Your call though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, no strong opinions except that / z_total is probably not ideal in the case of z_total == 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have:

                zero_targets = torch.tensor(doc["is_zero"], device=res.zero_scores.device)
                z_total += zero_targets.numel()

so in this case the only situation in which z_total would be the case where the number of elements in doc["is_zero"] is zero for the entire corpus (i.e., the corpus has no length); this would be a bad state and not usually possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, will z_correct include documents correctly predicted to have 0 zeros?

'train_c_loss': c_loss.item(),
'train_s_loss': s_loss.item(),
}
if z_loss:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just keep using res.zero_scores.size(0) == 0? i'm picturing a weird edge case where the learning has gone wrong and the losses are all 0, or maybe a weird case where the prediction is so close to the correct value that there's a 0 loss

Copy link
Member Author

@Jemoka Jemoka Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. 9331fed

# crap! there's two zeros right next to each other
# we are sad and confused so we give up in this case
if len(sentence_text) > span[1] and sentence_text[span[1]] == "_":
warnings.warn("Found two zeros next to each other in sequence; we are confused and therefore giving up.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this ever happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I think I found it in at least Catalan or Spanish

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. can also post an issue with the dataset maintainers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amir-zeldes and I had a bunch of fun sitting around at ACL trying to come up with cases where this may happen; seems like its not totally impossible, so likely not a dataset error. Would love feedback on this front though

Copy link

@amir-zeldes amir-zeldes Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's common in CorefUD, if it's even attested(?), but conceptually it could happen, for example in an SOV language when both the subject and object are dropped. For example in Japanese:

  • [okāsan]1 wa [yasai]2 o katta no? "Did mom buy vegetables?"
  • [_]1 [_]2 katta yo! "Yes she did!" (lit. "Did!")

candidate_head = span[1]
else:
try:
candidate_head = find_cconj_head(sentence_heads, sentence_upos, span[1], span[2]+1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this ever happen? i don't quite remember the find_cconj_head method without looking it up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was inserted because for some reason the find cconj head function throws an infinite loop in _get_depth_recursive. This is because the relative head of a zero is None (since its not a word with an annotated dependent); we never see this before because only verbs (i.e., sentence roots) has None heads, and verbs are not coreferent.
We can either have a rule where if its a zero, we mark the head as the existing head; or we can try to altruistically run the algorithm and give up if it runs for too long, as in here

if candidate_head is None:
for candidate_head in range(span[1], span[2] + 1):
# stanza uses 0 to mark the head, whereas OntoNotes is counting
# words from 0, so we have to subtract 1 from the stanza heads
#print(span, candidate_head, parsed_sentence.words[candidate_head].head - 1)
# treat the head of the phrase as the first word that has a head outside the phrase
if (parsed_sentence.words[candidate_head].head - 1 < span[1] or
parsed_sentence.words[candidate_head].head - 1 > span[2]):
if parsed_sentence.all_words[candidate_head].head and (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do with 0 for roots? am i missing something about the root possibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed_sentence.all_words[candidate_head].head = None implies that the word is a zero, which at this point would be a bad state (i.e., somehow it didn't trigger a recursion error with the dynamic depth, but also is a zero/a verb?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i was thinking was .head is checking truthiness, which would also fail on 0 for root. if the diagnostic test for it being a zero is None, how about --- is None instead of just checking its truthiness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, patched with 4d14f4d

@Jemoka Jemoka marked this pull request as ready for review August 6, 2025 04:57
@Jemoka Jemoka requested review from Copilot and AngledLuffa and removed request for Copilot August 6, 2025 05:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for zero-node annotation in CorefUD v1.3, enabling the coreference system to recognize and handle zero anaphora (empty pronouns marked with underscores). The implementation includes a new zero anaphora predictor component and comprehensive handling throughout the coreference pipeline.

  • Adds zero anaphora detection and handling capabilities
  • Updates data processing to work with CorefUD v1.3 format
  • Implements zero node creation and coreference cluster integration

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
stanza/utils/datasets/coref/convert_udcoref.py Updates dataset conversion to handle zero tokens and CorefUD v1.3 format
stanza/pipeline/coref_processor.py Adds zero anaphora handling in the main processing pipeline
stanza/models/coref/utils.py Implements sigmoid focal loss function for zero prediction training
stanza/models/coref/model.py Integrates zero predictor component and training logic
stanza/models/coref/dataset.py Minor formatting updates
stanza/models/coref/const.py Adds zero_scores field to CorefResult
stanza/models/common/doc.py Extends document model to support zero nodes and all_words property
Comments suppressed due to low confidence (1)

stanza/models/common/doc.py:750

  • The variable name all is ambiguous and shadows the built-in all() function. Consider renaming to all_words_sorted or similar.
        all = sorted(words + empty_words, key=lambda x:(x.id,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants