Loading Indonesian is very slow compared to other languages. #11080

geokaragiannis · 2022-07-05T08:48:44Z

geokaragiannis
Jul 5, 2022

How to reproduce the behaviour

Loading Indonesian is very slow compared to other languages.

import time
from spacy.lang.id import Indonesian

t1 = time.time()
nlp_id = Indonesian()
t2 = time.time()
print(t2-t1) #~13 seconds

For other languages (English, Spanish, etc) the same operation takes about 0.1 seconds or even less

Your Environment

Operating System: Mac OS Big Sur 11.2.1
Python Version Used: 3.7.1 & 3.8.2
spaCy Version Used: 2.2.4 & 3.3.1
Environment Information: macOS-10.16-x86_64-i386-64bit

Answered by adrianeboyd

Jul 5, 2022

Hi, the loading time depends on the language defaults. It looks like Indonesian has a large number of tokenizer exceptions, which take a while to load.

If you don't need these exceptions for your task, you could remove some of the exceptions from the language defaults (IndonesianDefaults.tokenizer_exceptions) before loading the pipeline to reduce the loading time. If you save this pipeline with nlp.to_disk() it will only include the modified exceptions so you can load it directly without having to make the same modifications each time.

View full answer

adrianeboyd · 2022-07-05T11:26:05Z

adrianeboyd
Jul 5, 2022

Hi, the loading time depends on the language defaults. It looks like Indonesian has a large number of tokenizer exceptions, which take a while to load.

If you don't need these exceptions for your task, you could remove some of the exceptions from the language defaults (IndonesianDefaults.tokenizer_exceptions) before loading the pipeline to reduce the loading time. If you save this pipeline with nlp.to_disk() it will only include the modified exceptions so you can load it directly without having to make the same modifications each time.

0 replies

geokaragiannis · 2022-07-06T07:07:28Z

geokaragiannis
Jul 6, 2022
Author

Hi @adrianeboyd thank you for your response. Is there a way to remove the exceptions without using nlp.to_disk(). So just removing the Indonesian tokenizer exceptions before calling nlp_id = Indonesian(). Thank you!

2 replies

adrianeboyd Jul 6, 2022

You can just edit the Indonesian defaults before loading, but I'm not sure it will be faster overall if you're deleting a bunch of individual exceptions:

from spacy.lang.id import IndonesianDefaults

# delete as many individual exceptions as you want
del IndonesianDefaults.tokenizer_exceptions["ular-ular"]

# or delete all the exceptions
IndonesianDefaults.tokenizer_exceptions = {}

nlp = spacy.blank("id")

Obviously the tokenization will change as a result, so make sure it's still okay for your task.

geokaragiannis Jul 6, 2022
Author

Thanks a lot @adrianeboyd!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Loading Indonesian is very slow compared to other languages. #11080

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Loading Indonesian is very slow compared to other languages. #11080

Uh oh!

geokaragiannis Jul 5, 2022

How to reproduce the behaviour

Your Environment

Replies: 2 comments · 2 replies

Uh oh!

adrianeboyd Jul 5, 2022

Uh oh!

geokaragiannis Jul 6, 2022 Author

Uh oh!

adrianeboyd Jul 6, 2022

Uh oh!

geokaragiannis Jul 6, 2022 Author

geokaragiannis
Jul 5, 2022

Replies: 2 comments 2 replies

adrianeboyd
Jul 5, 2022

geokaragiannis
Jul 6, 2022
Author

geokaragiannis Jul 6, 2022
Author