Coreference Training for Biomedical Text. #13160

shahryary · 2023-11-29T08:37:08Z

shahryary
Nov 29, 2023

I am attempting to train coreferences from scratch for biomedical content. To date, I have created approximately 30 annotated coreference examples using Prodigy for training purposes.

Below is the command sequence I utilized for Prodigy:

prodigy coref.manual coref_sample  blank:en  ./sample.jsonl --label COREF
#
prodigy data-to-spacy . --coref coref_sample  --eval-split 0.3

#output: 
============================== Generating data ==============================
Components: coref
Merging training and evaluation data for 1 components
  - [coref] Training: 22 | Evaluation: 9 (30% split)
Training: 22 | Evaluation: 9
Labels: coref (0)
✔ Saved 22 training examples
train.spacy
✔ Saved 9 evaluation examples
dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy
✔ Generated training config

======================== Generating cached label data ========================
ℹ No components to generate cached label data for

============================= Finalizing export =============================
✔ Saved training config
config.cfg

For training, the only resource I have referenced so far is this guide.

In their example, they used "OntoNotes" for preprocessing and creating spaCy format. However, I don't require this step as I can create the spaCy format directly from the Prodigy annotations using "data-to-spacy". Consequently, I have excluded the "preprocess" workflow from my project. Here is the command sequence for training:

=========================== Initializing pipeline ===========================
 [INFO] Set up nlp object from config
 [INFO] Pipeline: ['sentencizer', 'transformer', 'span_resolver']
 [INFO] Resuming training for: ['transformer']
 [INFO] Created vocabulary
 [INFO] Finished initializing nlp object
 [INFO] Initialized pipeline components: ['sentencizer', 'span_resolver']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'transformer', 'span_resolver']
ℹ Set annotations on update for: ['sentencizer']
ℹ Initial learn rate: 0.0001
E    #       LOSS TRANS...  LOSS SPAN_...  SENTS_F  SPAN_COREF...  SCORE 
---  ------  -------------  -------------  -------  -------------  ------
  0       0           0.00           0.00     0.00           0.00    0.00
✔ Saved pipeline to output directory
training/span_resolver/model-last

================================== assemble ==================================
Running command: spacy assemble configs//coref.cfg training/coref

=========================== Initializing pipeline ===========================
 [INFO] Created vocabulary
 [INFO] Finished initializing nlp object
✔ Initialized pipeline

============================ Serializing to disk ============================
✔ Created output directory: training/coref

I am puzzled as to why there is only one epoch displayed. Additionally, the model does not seem to provide predictions. Could you offer any insights or suggestions?

adrianeboyd · 2023-12-01T07:22:58Z

adrianeboyd
Dec 1, 2023

The coref project is the right place to reference. I'm not sure 30 examples is enough to get started, though, and I'm not sure off the top of my head whether the data or config exported by prodigy lines up with all the default settings in the coref project.

Training the span resolver is one of the later steps, are you running the train-cluster and prep-span-data steps from the project, too?

0 replies

shahryary · 2023-12-01T08:12:11Z

shahryary
Dec 1, 2023
Author

Yeah, I am following the training steps outlined in the project.yml file, though I am skipping the preprocessing part. Here are the steps I've taken for the training:

train-cluster
prep-span-data
train-span-resolver
assemble

I concur that a dataset of 30 examples is insufficient. However, before we embark on creating a larger training set, it's crucial to first confirm that we can successfully train the model. The creation of coreference labeling is quite time-consuming, so ensuring our approach is effective beforehand is important.

Here is full output of training steps:

=============================== train-cluster ===============================
Running command: python -m spacy train configs//cluster.cfg -g 0 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy -o training/cluster --training.max_epochs 20
✔ Created output directory: training/cluster
ℹ Saving to output directory: training/cluster
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-12-01 09:02:41,937] [INFO] Set up nlp object from config
[2023-12-01 09:02:41,952] [INFO] Pipeline: ['transformer', 'coref']
[2023-12-01 09:02:41,958] [INFO] Created vocabulary
[2023-12-01 09:02:41,959] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-12-01 09:02:51,062] [INFO] Initialized pipeline components: ['transformer', 'coref']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'coref']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS COREF  COREF_F  COREF_P  COREF_R  SCORE 
---  ------  -------------  ----------  -------  -------  -------  ------
Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors
  0       0           5.20       28.04     0.00     0.00     0.00    0.00
✔ Saved pipeline to output directory
training/cluster/model-last

=============================== prep-span-data ===============================
Running command: python scripts/prep_span_data.py --heads silver --model-path training/cluster/model-best/ --gpu 0 --input-path corpus/train.spacy --output-path corpus/spans.train.spacy --head-prefix coref_head_clusters --span-prefix coref_clusters
24it [00:03, 11.04it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (524 > 512). Running this sequence through the model will result in indexing errors
37it [00:05,  7.39it/s]
Processed 37 documents and skipped 13
Found 2735 heads with 0 duplicates
Found target spans for 0 heads.
Running command: python scripts/prep_span_data.py --heads silver --model-path training/cluster/model-best/ --gpu 0 --input-path corpus/dev.spacy --output-path corpus/spans.dev.spacy --head-prefix coref_head_clusters --span-prefix coref_clusters
6it [00:02,  2.85it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors
15it [00:03,  4.38it/s]
Processed 15 documents and skipped 6
Found 716 heads with 0 duplicates
Found target spans for 0 heads.

============================ train-span-resolver ============================
Running command: spacy train configs//span.cfg -c scripts/custom_functions.py -g 0 --paths.train corpus/spans.train.spacy --paths.dev corpus/spans.dev.spacy --training.max_epochs 20 --paths.transformer_source training/cluster/model-best -o training/span_resolver
✔ Created output directory: training/span_resolver
ℹ Saving to output directory: training/span_resolver
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-12-01 09:07:00,924] [INFO] Set up nlp object from config
[2023-12-01 09:07:00,938] [INFO] Pipeline: ['sentencizer', 'transformer', 'span_resolver']
[2023-12-01 09:07:00,939] [INFO] Resuming training for: ['transformer']
[2023-12-01 09:07:00,944] [INFO] Created vocabulary
[2023-12-01 09:07:00,946] [INFO] Finished initializing nlp object
[2023-12-01 09:07:03,633] [INFO] Initialized pipeline components: ['sentencizer', 'span_resolver']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'transformer', 'span_resolver']
ℹ Set annotations on update for: ['sentencizer']
ℹ Initial learn rate: 0.0001
E    #       LOSS TRANS...  LOSS SPAN_...  SENTS_F  SPAN_COREF...  SCORE 
---  ------  -------------  -------------  -------  -------------  ------
Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors
  0       0           0.00           0.00     0.00           0.00    0.00
 10     250           0.00           0.00     0.00           0.00    0.00
✔ Saved pipeline to output directory
training/span_resolver/model-last

================================== assemble ==================================
Running command: spacy assemble configs//coref.cfg training/coref

=========================== Initializing pipeline ===========================
[2023-12-01 09:08:13,492] [INFO] Created vocabulary
[2023-12-01 09:08:13,494] [INFO] Finished initializing nlp object
✔ Initialized pipeline

============================ Serializing to disk ============================
✔ Created output directory: training/coref

3 replies

adrianeboyd Dec 1, 2023

So I don't think it makes sense to train the span_resolver until there's a working coref component that makes silver predictions that can used for training.

But it's odd that the train-cluster step stops so quickly. Can you double-check the spans keys that are present in your training data? Load train.spacy as a DocBin and check the keys in doc.spans.

shahryary Dec 1, 2023
Author

Here is the output:

>>> import spacy
>>> from spacy.tokens import DocBin
>>> # Load the DocBin file
>>> doc_bin = DocBin().from_disk("corpus/train.spacy")
>>> docs = list(doc_bin.get_docs(spacy.blank("en").vocab))
>>> for doc in docs:
...     if hasattr(doc, 'spans'):
...         print(doc.spans.keys())
... 
KeysView({'coref_cluster_1': [an early inducible gene, Id1 mRNA], 'coref_cluster_2': [the addition, maximal upregulation], 'coref_cluster_3': [baseline, BMP-6]})
KeysView({'coref_cluster_1': [myeloma cells, BMP-2]})
KeysView({'coref_cluster_1': [Id2, BMP-6, Id3]})
KeysView({'coref_cluster_1': [transcriptional activity, Luciferase assays], 'coref_cluster_2': [A3G, the empty vector]})
KeysView({'coref_cluster_1': [primary blood lymphocytes, A3G expression]})
KeysView({'coref_cluster_1': [other HIV-1 proteins, the A3G promoter]})
KeysView({'coref_cluster_1': [boxless promoters, the 5'-RACE], 'coref_cluster_2': [multiple transcription initiation sites, the transcription initiation site]})
KeysView({'coref_cluster_1': [different cell lines, A3G promoter]})
KeysView({'coref_cluster_1': [FOXP3, T cell priming]})
KeysView({'coref_cluster_1': [overexpressing, transgenic mice], 'coref_cluster_2': [transforming, GATA-3], 'coref_cluster_3': [FOXP3, TGF)-beta]})
KeysView({'coref_cluster_1': [FOXP3, GATA3], 'coref_cluster_2': [transactivation process, promoter]})
KeysView({'coref_cluster_1': [FOXP3, the Th2 cytokine IL-4 inhibits the induction], 'coref_cluster_2': [Treg, the generation]})
KeysView({'coref_cluster_1': [FOXP3, IL-4 induction], 'coref_cluster_2': [promoter, GATA3]})
KeysView({'coref_cluster_1': [transcription factors, GATA3, T cells], 'coref_cluster_2': [the T cell phenotype, environmental signals]})
KeysView({'coref_cluster_1': [IL-4, Th2 cells, TGF], 'coref_cluster_2': [FOXP3, beta]})
KeysView({'coref_cluster_1': [FOXP3, TGF], 'coref_cluster_2': [FOXP3, the iTreg lineage]})
KeysView({'coref_cluster_1': [FOXP3, IL-4]})
KeysView({'coref_cluster_1': [high IL-4 concentration, GATA3]})

adrianeboyd Dec 1, 2023

From the code and the demo coref model, it looks like it it's expecting the name to becoref_clusters_## with an extra s. Can you see if that makes a difference?

shahryary · 2023-12-01T13:26:07Z

shahryary
Dec 1, 2023
Author

Well, running prodigy data-to-spacy . --coref coref_sample --eval-split 0.3 creates spans as "coref_cluster" in train.spacy . I also checked the config file, where it states 'span_cluster_prefix="coref_clusters"'. So, I think when exporting data-to-spacy KeysView({'coref_cluster_XXX': in train/dev.spacy should change to "coref_clusters" as well.

coref_clusters

E    #       LOSS TRANS...  LOSS COREF  COREF_F  COREF_P  COREF_R  SCORE 
---  ------  -------------  ----------  -------  -------  -------  ------
Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors
  0       0           4.54       28.15     0.00     0.00     2.22    0.00

It sounds like I at least have some values in the COREF_R part. I assume that to achieve proper training, I need to create a larger dataset for training.

2 replies

adrianeboyd Dec 1, 2023

As I mentioned above:

I'm not sure off the top of my head whether the data or config exported by prodigy lines up with all the default settings in the coref project.

These settings are pretty configurable and you might need to modify the prodigy output so that it works with the coref project.

And even then it's going to be hard to see whether it's working with only 30 examples. If this model works at all with your data, you might try adding some silver data that was automatically annotated with en_coreference_web_trf just to help you get started with training?

shahryary Dec 4, 2023
Author

Adding silver data might be a beneficial starting point for training. I have experimented with en_coreference_web_trf, but the results with biomedical data did not meet our expectations. Meanwhile, I discovered the fastcoref model from Hugging Face here is the link which is compatible with Spacy and performs well with biomedical texts. For the time being, I will proceed with this option. Later, when time permits, I plan to use it to generate silver data and train the model using Spacy.

thanks for the support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Coreference Training for Biomedical Text. #13160

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Coreference Training for Biomedical Text. #13160

Uh oh!

shahryary Nov 29, 2023

Replies: 3 comments · 5 replies

Uh oh!

adrianeboyd Dec 1, 2023

Uh oh!

shahryary Dec 1, 2023 Author

Uh oh!

adrianeboyd Dec 1, 2023

Uh oh!

shahryary Dec 1, 2023 Author

Uh oh!

adrianeboyd Dec 1, 2023

Uh oh!

shahryary Dec 1, 2023 Author

Uh oh!

adrianeboyd Dec 1, 2023

Uh oh!

shahryary Dec 4, 2023 Author

shahryary
Nov 29, 2023

Replies: 3 comments 5 replies

adrianeboyd
Dec 1, 2023

shahryary
Dec 1, 2023
Author

shahryary Dec 1, 2023
Author

shahryary
Dec 1, 2023
Author

shahryary Dec 4, 2023
Author