[wip] Corefud v1.3 #1502

Jemoka · 2025-06-20T05:34:17Z

...also support for zero-node annotation! So for a sentence with underscores in it, the system would actually be able to recognize it as a possible-coreferent (i.e. zero-anaphora) and mark it if needed.

AngledLuffa · 2025-06-25T05:08:06Z

stanza/models/coref/bert.py

+    curr_id = -1
+    curr_number = -1
+
+    list = []


keyword - better to use something else

not a fan of l as a name either. it's only used in a short function, but still, it makes it very difficult to search

AngledLuffa · 2025-06-25T05:09:50Z

stanza/models/common/doc.py

+        words = self._words
+        empty_words = self._empty_words
+
+        all = sorted(words + empty_words, key=lambda x:(x.id,)


make the lambda one line? it takes a moment to realize this is one thing across two lines

also all is a keyword (and yes, i know id is as well - that wasn't my choice)

ping about all and the multiline lambda here

sorry missed that, addressed in 6defc5c

AngledLuffa · 2025-06-25T05:15:28Z

stanza/models/coref/model.py

+        if return_subwords:
+            (nonblank_batches,
+            nonblank_labels) = bert.get_subwords_batches(doc, self.config,
+                                                        self.tokenizer, nonblank_only=True)
        all_batches = bert.get_subwords_batches(doc, self.config, self.tokenizer)


is there meant to be an else here? i was somewhat confused earlier when i saw the return format could be one of two types, which would be confusing for the caller... and here is a perhaps somewhat confusing usage of it

partly i would wonder if this all still works the same when there are no zeros to train for

AngledLuffa · 2025-06-25T05:17:27Z

stanza/models/coref/bert.py

+
+        start += len(subwords[start:end])
+
+    if nonblank_only:


not a huge fan of the return which could be 1 thing or 2. likely to confuse people

This is no longer the case

AngledLuffa · 2025-08-04T06:30:33Z

stanza/models/coref/dataset.py

@@ -38,6 +38,9 @@ def __init__(self, path, config, tokenizer):
            word2subword = []
            subwords = []
            word_id = []
+            nonblank_subwords = [] # a list of subwords, skipping _
+            previous_was_blank = [] # was the word before _?
+            was_blank = False # a flag to set if we saw "_"


maybe something other than was such as has or saw?

wait... maybe i'm not understanding correctly. i would say these names could use some documentation / editing and there should be some comments on why we're keeping track of this here

great catch; these fields are not needed anymore in the new implementation of zeros

stanza/models/coref/dataset.py

AngledLuffa · 2025-08-04T06:36:19Z

stanza/models/coref/model.py

                )
            logger.info(f"CoNLL-2012 3-Score Average : {w_checker.bakeoff:.5f}")
+            logger.info(f"Zero prediction accuracy: {z_correct / z_total:.5f}")


in general, is this always on? i would think there will be datasets that don't have zeros

in general, reporting this shouldn't hurt, since all we'll have in that case is that all of doc["is_zero"] is False. Hence, that will give us 100% zeros accuracy, and not break any logging. Do you think we should handle those cases differently? The tricky part is that we have currently no way to tell if a dataset has no zeros, or if a batch has no zeros (which is quite likely since zeros are relatively rare.

in that case it doesn't matter too much, although i would think a higher level part of the routine could also look at the whole dataset and check if it has zeros or not. but not a big deal

Sounds good; I would err on the side of "no" just because technically having "100% zeros accuracy" is technically correct still + involves less post-processing. Your call though.

well, no strong opinions except that / z_total is probably not ideal in the case of z_total == 0

We have:

zero_targets = torch.tensor(doc["is_zero"], device=res.zero_scores.device) z_total += zero_targets.numel()

so in this case the only situation in which z_total would be the case where the number of elements in doc["is_zero"] is zero for the entire corpus (i.e., the corpus has no length); this would be a bad state and not usually possible.

ah, will z_correct include documents correctly predicted to have 0 zeros?

AngledLuffa · 2025-08-04T06:41:50Z

stanza/models/coref/model.py

+                        'train_c_loss': c_loss.item(),
+                        'train_s_loss': s_loss.item(),
+                    }
+                    if z_loss:


maybe just keep using res.zero_scores.size(0) == 0? i'm picturing a weird edge case where the learning has gone wrong and the losses are all 0, or maybe a weird case where the prediction is so close to the correct value that there's a 0 loss

done. 9331fed

AngledLuffa · 2025-08-04T06:43:56Z

stanza/utils/datasets/coref/convert_udcoref.py

+                    # crap! there's two zeros right next to each other
+                    # we are sad and confused so we give up in this case
+                    if len(sentence_text) > span[1] and sentence_text[span[1]] == "_":
+                        warnings.warn("Found two zeros next to each other in sequence; we are confused and therefore giving up.")


does this ever happen?

Yup! I think I found it in at least Catalan or Spanish

makes sense. can also post an issue with the dataset maintainers

@amir-zeldes and I had a bunch of fun sitting around at ACL trying to come up with cases where this may happen; seems like its not totally impossible, so likely not a dataset error. Would love feedback on this front though

I don't think it's common in CorefUD, if it's even attested(?), but conceptually it could happen, for example in an SOV language when both the subject and object are dropped. For example in Japanese:

[okāsan]1 wa [yasai]2 o katta no? "Did mom buy vegetables?"

[_]1 [_]2 katta yo! "Yes she did!" (lit. "Did!")

AngledLuffa · 2025-08-04T06:44:38Z

stanza/utils/datasets/coref/convert_udcoref.py

+                    candidate_head = span[1]
+                else:
+                    try:
+                        candidate_head = find_cconj_head(sentence_heads, sentence_upos, span[1], span[2]+1)


does this ever happen? i don't quite remember the find_cconj_head method without looking it up

Yes, it was inserted because for some reason the find cconj head function throws an infinite loop in _get_depth_recursive. This is because the relative head of a zero is None (since its not a word with an annotated dependent); we never see this before because only verbs (i.e., sentence roots) has None heads, and verbs are not coreferent.
We can either have a rule where if its a zero, we mark the head as the existing head; or we can try to altruistically run the algorithm and give up if it runs for too long, as in here

AngledLuffa · 2025-08-04T06:46:12Z

stanza/utils/datasets/coref/convert_udcoref.py

                if candidate_head is None:
                    for candidate_head in range(span[1], span[2] + 1):
                        # stanza uses 0 to mark the head, whereas OntoNotes is counting
                        # words from 0, so we have to subtract 1 from the stanza heads
                        #print(span, candidate_head, parsed_sentence.words[candidate_head].head - 1)
                        # treat the head of the phrase as the first word that has a head outside the phrase
-                        if (parsed_sentence.words[candidate_head].head - 1 < span[1] or
-                            parsed_sentence.words[candidate_head].head - 1 > span[2]):
+                        if parsed_sentence.all_words[candidate_head].head and (


what does this do with 0 for roots? am i missing something about the root possibility?

parsed_sentence.all_words[candidate_head].head = None implies that the word is a zero, which at this point would be a bad state (i.e., somehow it didn't trigger a recursion error with the dynamic depth, but also is a zero/a verb?)

what i was thinking was .head is checking truthiness, which would also fail on 0 for root. if the diagnostic test for it being a zero is None, how about --- is None instead of just checking its truthiness?

done, patched with 4d14f4d

Copilot

Pull Request Overview

This PR adds support for zero-node annotation in CorefUD v1.3, enabling the coreference system to recognize and handle zero anaphora (empty pronouns marked with underscores). The implementation includes a new zero anaphora predictor component and comprehensive handling throughout the coreference pipeline.

Adds zero anaphora detection and handling capabilities
Updates data processing to work with CorefUD v1.3 format
Implements zero node creation and coreference cluster integration

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
stanza/utils/datasets/coref/convert_udcoref.py	Updates dataset conversion to handle zero tokens and CorefUD v1.3 format
stanza/pipeline/coref_processor.py	Adds zero anaphora handling in the main processing pipeline
stanza/models/coref/utils.py	Implements sigmoid focal loss function for zero prediction training
stanza/models/coref/model.py	Integrates zero predictor component and training logic
stanza/models/coref/dataset.py	Minor formatting updates
stanza/models/coref/const.py	Adds zero_scores field to CorefResult
stanza/models/common/doc.py	Extends document model to support zero nodes and all_words property

Comments suppressed due to low confidence (1)

stanza/models/common/doc.py:750

The variable name all is ambiguous and shadows the built-in all() function. Consider renaming to all_words_sorted or similar.

        all = sorted(words + empty_words, key=lambda x:(x.id,)

stanza/utils/datasets/coref/convert_udcoref.py

stanza/models/coref/model.py

stanza/pipeline/coref_processor.py

stanza/utils/datasets/coref/convert_udcoref.py

- new language groups for coref 1.3 - handling of underscore forms

AngledLuffa reviewed Jun 25, 2025

View reviewed changes

Jemoka force-pushed the corefud_v1.3 branch from 7e02203 to d66d930 Compare June 26, 2025 06:23

Jemoka force-pushed the corefud_v1.3 branch 2 times, most recently from 5639048 to 1d0863d Compare July 31, 2025 13:27

Jemoka requested a review from AngledLuffa August 3, 2025 18:09

AngledLuffa reviewed Aug 4, 2025

View reviewed changes

stanza/models/coref/dataset.py Outdated Show resolved Hide resolved

AngledLuffa reviewed Aug 4, 2025

View reviewed changes

Jemoka marked this pull request as ready for review August 6, 2025 04:57

Jemoka requested review from Copilot and AngledLuffa and removed request for Copilot August 6, 2025 05:02

Copilot AI reviewed Aug 6, 2025

View reviewed changes

Jemoka force-pushed the corefud_v1.3 branch from acc3486 to bbba2a5 Compare August 6, 2025 05:15

Jemoka force-pushed the corefud_v1.3 branch from e1f56aa to 6defc5c Compare August 13, 2025 06:46

Jemoka and others added 6 commits August 12, 2025 23:46

corefud 1.3 corpus support

b9638ac

- new language groups for coref 1.3 - handling of underscore forms

model changes to support underscore innference

2650440

inference processor for coref

c0fa7cc

fixes for zero coref inference

009c31c

small debugging patches to support empty node prediction

359a2e5

Merge remote-tracking branch 'origin/dev' into corefud_v1.3

0efcfb4

Jemoka force-pushed the corefud_v1.3 branch from 6defc5c to 0efcfb4 Compare August 13, 2025 06:48

[wip] Corefud v1.3 #1502

Are you sure you want to change the base?

[wip] Corefud v1.3 #1502

Uh oh!

Conversation

Jemoka commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jemoka Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jemoka Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amir-zeldes Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Jemoka Aug 13, 2025 •

edited

Loading

Jemoka Aug 6, 2025 •

edited

Loading

amir-zeldes Aug 13, 2025 •

edited

Loading