Custom model took more time to load than model standard pipeline #12111

Shiyinq · 2023-01-17T04:48:35Z

Shiyinq
Jan 17, 2023

How to reproduce the behaviour

my custom model took more than 10 second to load, but en_core_web_trf and en_core_web_lg jus few second.
here a screenshot:

and then i tried on my laptop, the result is the same took more 10 second for custom model

how to reduce the load time?

Your Environment

Google Colab:

Operating System: Ubuntu 18.04.6 LTS
Python Version Used: Python 3.8.16
spaCy Version Used: 3.4.4
Environment Information:

Laptop:

Operating System: windows 10
Python Version Used: Python 3.10.5
spaCy Version Used: 3.4.4
Environment Information:

Answered by adrianeboyd

Jan 17, 2023

This is due to differences in the default English and Indonesian tokenizer settings. The Indonesian defaults include a large number of exceptions to handle cases like "aba-aba", which take longer to load.

If this is a major concern on your end, you can consider customizing the tokenizer settings (https://spacy.io/usage/training#custom-tokenizer), but changing the tokenization can cause misalignments with your training data that can have a big effect on the model performance, especially for token-level annotation like tags and parses. So keep an eye on the token_* scores for your training data while modifying this.

(As a side note, sys.getsizeof() isn't going to give you any useful info in…

View full answer

adrianeboyd · 2023-01-17T09:14:01Z

adrianeboyd
Jan 17, 2023

This is due to differences in the default English and Indonesian tokenizer settings. The Indonesian defaults include a large number of exceptions to handle cases like "aba-aba", which take longer to load.

If this is a major concern on your end, you can consider customizing the tokenizer settings (https://spacy.io/usage/training#custom-tokenizer), but changing the tokenization can cause misalignments with your training data that can have a big effect on the model performance, especially for token-level annotation like tags and parses. So keep an eye on the token_* scores for your training data while modifying this.

(As a side note, sys.getsizeof() isn't going to give you any useful info in this context. It's probably not measuring what you're trying to measure.)

7 replies

Shiyinq Jan 18, 2023
Author

when im training model using id, i start with spacy.blank('id')
for multi language i should start with spacy.blank('xx') or with trained pipline (https://spacy.io/models/xx) ?

but im not only using texcat, in the future i will combine texcat & ner its ok to use multi-language ?

Shiyinq Jan 18, 2023
Author

what if i load the every model first

nlp1 = spacy.load(f"models/model_name1/models/model-best")
nlp2 = spacy.load(f"models/model_name2/models/model-best")

and then store nlp1, nlp2 in somewhere, redis maybe?

so the api endpoint not load the model in every request, but get data from the storage something like redis?

@router.post("/{model_name}/predict", tags=["Predict"])
def predict(model_name: str, body: Body):
    nlp = # from somewhere, where the spacy model is preloaded

    result = nlp(body.text)
    
    return result.cats

is it possible to save the loaded models in something like redis ?

adrianeboyd Jan 18, 2023

You would use spacy.blank("xx"). Whether xx is okay depends entirely on the tokenization in your training data. For textcat, where the labels are on the doc and not the individual tokens, minor differences don't matter much as long as you train and run with the same tokenizer, but for annotation on individual tokens or spans you have to watch out for misalignments between the token boundaries in your training data and tokens from the tokenizer.

Usually you would want to have the models preloaded for your API if possible, to improve the speed.

Shiyinq Jan 18, 2023
Author

is it possible to save models preloaded in something like redis or mysql, mongo ?

look like:

nlp = spacy.load(f"models/model_name/models/model-best")

save_to_db_mysql(nlp)
save_to_redis(nlp)
save_to_mongo(nlp)

adrianeboyd Jan 18, 2023

You can save the config + model bytes, here's what that looks like: https://spacy.io/usage/saving-loading#pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Custom model took more time to load than model standard pipeline #12111

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Custom model took more time to load than model standard pipeline #12111

Uh oh!

Shiyinq Jan 17, 2023

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 7 replies

Uh oh!

adrianeboyd Jan 17, 2023

Uh oh!

Shiyinq Jan 18, 2023 Author

Uh oh!

Uh oh!

Shiyinq Jan 18, 2023 Author

Uh oh!

adrianeboyd Jan 18, 2023

Uh oh!

Shiyinq Jan 18, 2023 Author

Uh oh!

adrianeboyd Jan 18, 2023

Shiyinq
Jan 17, 2023

Replies: 1 comment 7 replies

adrianeboyd
Jan 17, 2023

Shiyinq Jan 18, 2023
Author

Shiyinq Jan 18, 2023
Author

Shiyinq Jan 18, 2023
Author