Entity Linker Takes long time to initialize (transformer) #12364

jkgenser · 2023-03-05T15:21:05Z

jkgenser
Mar 5, 2023

Hello -

I am running an entity linker model with the following pipeline:
transformer --> sentencizer --> ner --> entity_linker

When I start training, for some reason it takes a very very very long time to even start training. The operation that spacy is ruining while I start training seems to be related to entity linking since it's printing out some print statements I have in my custom candidate generator. I'm using the same candidate generator on tok2vec models without this same hold up on just starting training so I'm wondering if someone can help point me in the direction on why it takes so long just to start?

Do I need to change something about my initial data? Below I've copied the output that shows the print outs during the initialization step. I've timed it and it takes 25 minutes before training even starts. Once training starts, I get 2-5 its/sec which is reasonable and expected given other transformers I've trained with spacy.

Any recommendation or questions to help diagnose why the why entity linking init takes so long and how to improve it would be much appreciated.

One note is that I am using 1/12 of my dataset as an evaluation set. I see the span candidate print outs during evaluation since I know that is when these candidate fns are called. However it's curious to me that it takes much more than the evaluation time * 12 in order to get the model initialized. i.e. It takes about 30 seconds to run an evaluation on 1/12 of my dataset but like I said earlier, about 25 minutes before the training loop actually starts.

=========================== Initializing pipeline ===========================
[2023-03-05 09:55:04,879] [INFO] Set up nlp object from config
[2023-03-05 09:55:04,885] [INFO] Pipeline: ['transformer', 'sentencizer', 'ner', 'entity_linker']
[2023-03-05 09:55:04,887] [INFO] Created vocabulary
[2023-03-05 09:55:04,888] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-03-05 09:55:17,952] [INFO] Initialized pipeline components: ['transformer', 'sentencizer', 'ner', 'entity_linker']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'sentencizer', 'ner', 'entity_linker']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  LOSS ENTIT...  SENTS_F  SENTS_P  SENTS_R  ENTS_F  ENTS_P  ENTS_R  NEL_MICRO_F  SCORE
---  ------  -------------  --------  -------------  -------  -------  -------  ------  ------  ------  -----------  ------
span_text: insulin aspart, matched alias: insulin, similarity: 90.0
span_text: No, matched alias: non-ambulatory, similarity: 90.0
span_text: Age, matched alias: esophageal dysmotility, similarity: 90.0
span_text: M, matched alias: remeron, similarity: 90.0
span_text: Dr, matched alias: bedrest, similarity: 90.0
span_text: Michael, matched alias: MI, similarity: 90.0
span_text: Miriam, matched alias: MI, similarity: 90.0
span_text: Rm, matched alias: Metformin, similarity: 90.0
span_text: Bed, matched alias: intubed, similarity: 90.0
span_text: Complete, matched alias: le, similarity: 90.0
span_text: Collected, matched alias: le, similarity: 90.0
span_text: By, matched alias: bypass grafts, similarity: 90.0
span_text: major, matched alias: Major depressive disorder, similarity: 90.0.

Config file:

[paths]
train = null
dev = null
vectors = null
kb = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","sentencizer","ner","entity_linker"]
batch_size = 512
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.entity_linker]
factory = "entity_linker"
candidates_batch_size = 1
entity_vector_length = 300
# get_candidates = {"@misc":"spacy.CandidateGenerator.v1"}
get_candidates = {"@misc": "oler.FuzzyCandidateGenerator.v1"}
get_candidates_batch = {"@misc":"spacy.CandidateBatchGenerator.v1"}
incl_context = true
incl_prior = true
labels_discard = []
n_sents = 0
overwrite = true
scorer = {"@scorers":"spacy.entity_linker_scorer.v1"}
threshold = null
use_gold_ents = true

[components.entity_linker.model]
@architectures = "spacy.EntityLinker.v2"
nO = null

[components.entity_linker.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 0.5
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.5
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.sentencizer]
factory = "sentencizer"
overwrite = false
punct_chars = null
scorer = {"@scorers":"spacy.senter_scorer.v1"}

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
# name = "roberta-base"
name = "distilbert-base-uncased"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1500
max_epochs = 0
max_steps = 20000
eval_frequency = 500
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
sents_f = 0.0
sents_p = 0.0
sents_r = 0.0
ents_f = 0.85
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
nel_micro_f = 0.15
nel_micro_r = null
nel_micro_p = null


[initialize]

[initialize.components]

[initialize.components.entity_linker]

[initialize.components.entity_linker.kb_loader]
@misc = "spacy.KBFromFile.v1"
kb_path = ${paths.kb}


[initialize.tokenizer]

jkgenser · 2023-03-06T00:16:05Z

jkgenser
Mar 6, 2023
Author

A little more scientifically, I've added a print statement in spacy code to check the words/sec of the evaluation loop.

The first time it runs speed is:

span_text: placement, matched alias: replacement, similarity: 90.0
span_text: symmetry, matched alias: asymmetry, similarity: 94.11764705882352
span_text: Ms., matched alias: ms., similarity: 100.0
results speed:36.299580322394675

And then while training, the speed is dramatically faster:

span_text: non-rebreather to, matched alias: non-rebreather, similarity: 90.32258064516128
span_text: 5 renal failure, matched alias: Renal Failure, similarity: 92.85714285714286
results speed:4684.82207609811

Here is the code I added to print speed around L1415 of spacy/language.py

        end_time = timer()
        results = scorer.score(examples)
        n_words = sum(len(eg.predicted) for eg in examples)
        results["speed"] = n_words / (end_time - start_time)
        print(f"results speed:{results['speed']}")
        return results

0 replies

svlandeg · 2023-03-10T11:07:04Z

svlandeg
Mar 10, 2023

Hi Jerry,

I know you were asking about issues with the entity_linker before (#12324) - have you been able to resolve those OOM issues? What was the culprit?

I still wonder how big your KB is (vector length 300? number of entities/aliases? Size on disk), because here in this config I also notice that you use a custom candidate generator ("oler.FuzzyCandidateGenerator.v1") which might be exponentially expensive with larger KBs.

When I start training, for some reason it takes a very very very long time to even start training.

Is it in fact the case that it takes long to start, or does one iteration take long? The difference is this: if it takes long to start, then the following iterations (printed lined on the console) would be faster. But if it's the case that every iteration takes long, then it'll be a while before you see the first line of results on the console, but it will take an equally long amount for the next line. It would be good to understand which of the two is happening to better identify likely culprits. Have you been able to run this long enough to get at least one or two lines of results printed, and can you share those?

One other idea is whether you can try setting [training.max_epochs] to -1 to stream the data, see whether that makes a difference.

3 replies

jkgenser Mar 10, 2023
Author

@svlandeg Regarding #12324 One of the things that I've done is rather than use text from 3-10 page PDFs for each Doc instead chunking the original source text into 1-3 sentence chunks before creating Doc objects. The number of characters in these documents ranged from 5,000-20,000. This strategy has mitigated the OOM issues. It was curious that I was able to train NER model on those same documents without issues and get really good results but once I added entity linking, the OOM was happening as part of entity linker model initialization which somehow took a lot more memory given the same/similar inputs. I can probably reproduce this if it's something of interest to the spacy team in case you think there might be a bug somewhere causing too much memory utilization.

On the topic of this thread:
My KB has about 1800 entities and 1646 aliases (haha I know this is problematic, some of my entities I don't have aliases for yet, working through that). My entities have length 768 since I'm using en_core_web_trf in order to encode the entity descriptions. I've added the script that I use in order to encode my entities in the knowledge base below in case you see some issue with using non-standard entity encodings.

Regarding my custom candidate generator, I don't believe it's adding significant latency due to the characteristics of my KB as well as if I go back to using the default candidate generator I still observe the slow training start behavior. I've added code below that is a sketch of my custom candidate generator. The reason I use it is because I have an entity id with alias like a fib and if I get a new entity prediction for Diagnosis of A. fib atrial fib, etc. I want to fuzzy match on the most similar entity as the space of aliases is very large. I'm probably going to at some point move to using approximate nearest neighbors for the next iteration but this is a simpler solution that is performing well for now.

In the folder of my KB, this is the size of the serialized files:

⬢ [Systemd] ❯ ls -lah
total 16M
drwxr-xr-x  2 j j 4.0K Mar  5 18:04 .
drwxr-xr-x 13 j j 4.0K Mar  5 18:04 ..
-rw-r--r--  1 j j 5.4M Mar  5 18:04 contents
-rw-r--r--  1 j j  10M Mar  5 18:04 strings.json

Script I use to encode entities:

nlp = spacy.load("en_core_web_trf")


def sim(a, b):
    cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    return cos_sim


def calculate_vector_for_doc(doc: Doc):
    vec_idx_splits = np.cumsum(doc._.trf_data.align.lengths)
    # get transformer vectors and reshape them into one large continous tensor
    trf_vecs = doc._.trf_data.tensors[0].reshape(-1, 768)
    # calculate mapping groups from spacy tokens to transformer vector indices
    vec_idxs = np.split(doc._.trf_data.align.dataXd, vec_idx_splits)

    # take sum of mapped transformer vector indices for spacy vectors
    vecs = np.stack([trf_vecs[idx].sum(0) for idx in vec_idxs[:-1]])
    return vecs[0]


def main(save_kb_loc: Path):
    umls_mds_df = pd.read_csv("./assets_vcs/umls_mds.csv")
    umls_cui_dropdup = umls_mds_df[["cui", "description"]].drop_duplicates()
    kb = InMemoryLookupKB(vocab=nlp.vocab, entity_vector_length=768)
    for row in umls_cui_dropdup.itertuples():
        desc_doc = nlp(row.description)
        desc_vec = calculate_vector_for_doc(desc_doc)
        kb.add_entity(row.cui, entity_vector=desc_vec, freq=1)
        print(f"Added entity: {row.cui} with description: {row.description}")

    kb.to_disk(save_kb_loc)

Sketch of my script for fuzzy candidate matching:

from rapidfuzz import fuzz, process

blank_pipeline = spacy.blank("en")

def get_fuzzy_candidates(kb, span):
    orig_candidates = kb.get_candidates(span)
    if orig_candidates:
        return orig_candidates
   
    candidate = process.extractOne(span.text, kb.get_alias_strings(), scorer=fuzz.ratio, score_cutoff=88)
    return kb.get_candidates(blank_pipeline(candidate))

jkgenser Mar 10, 2023
Author

Is it in fact the case that it takes long to start, or does one iteration take long? The difference is this: if it takes long to start, then the following iterations (printed lined on the console) would be faster. But if it's the case that every iteration takes long, then it'll be a while before you see the first line of results on the console, but it will take an equally long amount for the next line. It would be good to understand which of the two is happening to better identify likely culprits. Have you been able to run this long enough to get at least one or two lines of results printed, and can you share those?

The first set of evaluation metrics takes a long time to print. Subsequent evaluations are much faster. I will re-run the beginning of my training loop and print out the results of evaluation. There is an initial evaluation step that I believe happens and prints out poor results before any SGD loops have happened. This is the one I believe is problematic. It takes about 25 minutes for it to complete before the progress par starts rendering.

Once the progress bar renders and completes the number of steps before evaluation, the actual evaluation runs much quicker. The subsequent evaluation step printing out "speed" on the github line below was ~4,600 vs. the ~36 for the first evaluation step. Does this clarify your question?

https://github.com/explosion/spaCy/blob/master/spacy/language.py#L1420

jkgenser Mar 10, 2023
Author

Setting [training.max_epochs] to -1 does not mitigate this problem.

Here is a printout of the beginning of the operation that is very slow. My custom candidate generator is printing our suggestions. Once this initial portion that takes ~25 minutes is complete, then it prints the first line of evaluation results. Next the model starts iterating. Then once it hits eval_frequency iterations, it runs an evaluation loop and prints the second line of evaluation results. The pause to print second line of evaluation results is usually less than 60 seconds.

Further context I'm using distillbert-base-uncased on a GPU.

[2023-03-10 11:54:46,621] [INFO] Initialized pipeline components: ['transformer', 'sentencizer', 'ner', 'entity_linker']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'sentencizer', 'ner', 'entity_linker']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  LOSS ENTIT...  SENTS_F  SENTS_P  SENTS_R  ENTS_F  ENTS_P  ENTS_R  NEL_MICRO_F  SCORE
---  ------  -------------  --------  -------------  -------  -------  -------  ------  ------  ------  -----------  ------span_text: Confused, matched alias: confused, similarity: 100.0
span_text: injection, matched alias: Infection, similarity: 88.88888888888889
span_text: pulmonary hypertension and, matched alias: Pulmonary hypertension, similarity: 91.66666666666666
# (more of these lines... this is what's slow)
  0       0         521.84    748.41           0.00    65.48    57.83    75.47    1.31    0.68   15.45        90.76    0.15
Epoch 1:   3%|██▍
# (more training)
   0     500      460509.69  181757.37           0.00    65.26    57.57    75.31   47.62   57.70   40.55        90.97    0.5

svlandeg · 2023-03-24T14:18:43Z

svlandeg
Mar 24, 2023

Hey @jkgenser, thanks for all the details, that's very helpful!

It was curious that I was able to train NER model on those same documents without issues and get really good results but once I added entity linking, the OOM was happening as part of entity linker model initialization which somehow took a lot more memory given the same/similar inputs. I can probably reproduce this if it's something of interest to the spacy team in case you think there might be a bug somewhere causing too much memory utilization.

You're right that this is unexpected and unideal. If you could share your project with us, we could look into it more - you can find my email address at https://explosion.ai/about and of course we'd keep that entirely confidential and would only use the data/code for debugging.

With respect to the long initialization time, I haven't yet been able to identify a likely culprit. The main thing that happens before printing any train lines is assembling the dataset, which is why I asked whether setting max_epochs to -1 would help - I'm surprised that it doesn't make a difference.

From the thread, I understand that your custom candidate generator is likely not to blame, because you run into the same issues with the built-in one. Additionally, if you use a tok2vec layer instead of a transformer, you said that you don't have issues either. But I assume your KB used different lengths then, right?

One hypothesis is that the EL needs to be further optimized for GPU usage, in particular by avoiding indexing into a GPU-allocated array during a loop. When you ran the tok2vec version of this, and didn't see the slow down, did you run it on CPU? If you did, what if you run the tok2vec-based pipeline on GPU, do you see the same issue? If so then it's clearly a GPU efficiency issue in the entity linker. But that still wouldn't explain to me why this is only an issue in the beginning 🤔

Again, if you could share your project & data so I can rerun it and do some more experiments on both CPU & GPU, I'd love to dig into this further. Thanks for your patience!

0 replies

Uh oh!

Entity Linker Takes long time to initialize (transformer) #12364

Uh oh!

Uh oh!

jkgenser Mar 5, 2023

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

jkgenser Mar 6, 2023 Author

Uh oh!

svlandeg Mar 10, 2023

Uh oh!

jkgenser Mar 10, 2023 Author

Uh oh!

jkgenser Mar 10, 2023 Author

Uh oh!

Uh oh!

jkgenser Mar 10, 2023 Author

Uh oh!

svlandeg Mar 24, 2023

jkgenser
Mar 5, 2023

Replies: 3 comments 3 replies

jkgenser
Mar 6, 2023
Author

svlandeg
Mar 10, 2023

jkgenser Mar 10, 2023
Author

jkgenser Mar 10, 2023
Author

jkgenser Mar 10, 2023
Author

svlandeg
Mar 24, 2023