Skip to content

Commit 58bca82

Browse files
committed
Merge branch 'hierarchical-coref'
2 parents 89c0eac + e350fc3 commit 58bca82

File tree

6 files changed

+609
-176
lines changed

6 files changed

+609
-176
lines changed

README.md

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,12 @@
1-
# Current Status
2-
3-
**Note that this is an early release. Don't hesitate to report bugs/possible improvements! There are surely many.**
4-
5-
61
# Tibert
72

83
`Tibert` is a transformers-compatible reproduction from the paper [End-to-end Neural Coreference Resolution](https://aclanthology.org/D17-1018/) with several modifications. Among these:
94

105
- Usage of BERT (or any BERT variant) as an encoder as in [BERT for Coreference Resolution: Baselines and Analysis](https://aclanthology.org/D19-1588/)
116
- batch size can be greater than 1
127
- Support of singletons as in [Adapted End-to-End Coreference Resolution System for Anaphoric Identities in Dialogues](https://aclanthology.org/2021.codi-sharedtask.6)
13-
8+
- Hierarchical merging as in [Coreference in Long Documents using Hierarchical Entity Merging](https://aclanthology.org/2024.latechclfl-1.2/)
9+
1410

1511
It can be installed with `pip install tibert`.
1612

@@ -90,6 +86,41 @@ print(annotated_doc.coref_chains)
9086
`>>>[[Mention(tokens=['The', 'princess'], start_idx=11, end_idx=13), Mention(tokens=['Princess', 'Liana'], start_idx=0, end_idx=2)], [Mention(tokens=['Zarth', 'Arn'], start_idx=6, end_idx=8)]]`
9187

9288

89+
## Hierarchical Merging
90+
91+
Hierarchical merging allows to reduce RAM usage and computations when performing inference on long documents. To do so, the user provides the text cut in chunks. The model will perform prediction for chunks, which means the long document wont be taken at once into memory. Then, hierarchical merging will try to merge chunk predictions. This allow scaling to arbitrarily large documents. See [Coreference in Long Documents using Hierarchical Entity Merging](https://aclanthology.org/2024.latechclfl-1.2/) for more details.
92+
93+
Hierarchical merging can be used as follows:
94+
95+
```python
96+
from tibert import BertForCoreferenceResolution, predict_coref
97+
from tibert.utils import pprint_coreference_document
98+
from transformers import BertTokenizerFast
99+
100+
model = BertForCoreferenceResolution.from_pretrained(
101+
"compnet-renard/bert-base-cased-literary-coref"
102+
)
103+
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
104+
105+
chunk1 = "Princess Liana felt sad, because Zarth Arn was gone."
106+
chunk2 = "She went to sleep."
107+
108+
annotated_doc = predict_coref(
109+
[chunk1, chunk2], model, tokenizer, hierarchical_merging=True
110+
)
111+
112+
pprint_coreference_document(annotated_doc)
113+
```
114+
115+
This results in:
116+
117+
`>>>(1 Princess Liana ) felt sad , because (0 Zarth Arn ) was gone . (1 She ) went to sleep .`
118+
119+
Even if the mentions `Princess Liana` and `She` are not in the same chunk, hierarchical merging still resolves this case correctly.
120+
121+
*Note that, at the time of writing, the performance of the hierarchical merging feature has not been benchmarked*.
122+
123+
93124
## Training a model
94125

95126
Aside from the `tibert.train.train_coref_model` function, it is possible to train a model from the command line. Training a model requires installing the `sacred` library. Here is the most basic example:

tests/test_bertcoref.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@ def test_doc_is_reconstructed(
2222
print(prep_doc)
2323
collator = DataCollatorForSpanClassification(bert_tokenizer, max_span_size)
2424
batch = collator([batch])
25-
reconstructed_doc = prep_doc.from_wpieced_to_tokenized(doc.tokens, batch, 0)
25+
seq_size = batch["input_ids"].shape[1]
26+
wp_to_token = [batch.token_to_word(0, token_index=i) for i in range(seq_size)]
27+
reconstructed_doc = prep_doc.from_wpieced_to_tokenized(doc.tokens, wp_to_token)
2628

2729
assert doc.tokens == reconstructed_doc.tokens
2830
assert doc.coref_chains == reconstructed_doc.coref_chains

0 commit comments

Comments
 (0)