Skip to content
Discussion options

You must be logged in to vote

This is due to differences in the default English and Indonesian tokenizer settings. The Indonesian defaults include a large number of exceptions to handle cases like "aba-aba", which take longer to load.

If this is a major concern on your end, you can consider customizing the tokenizer settings (https://spacy.io/usage/training#custom-tokenizer), but changing the tokenization can cause misalignments with your training data that can have a big effect on the model performance, especially for token-level annotation like tags and parses. So keep an eye on the token_* scores for your training data while modifying this.

(As a side note, sys.getsizeof() isn't going to give you any useful info in…

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@Shiyinq
Comment options

@Shiyinq
Comment options

@adrianeboyd
Comment options

@Shiyinq
Comment options

@adrianeboyd
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / id Indonesian language data and models perf / speed Performance: speed
2 participants
Converted from issue

This discussion was converted from issue #12110 on January 17, 2023 09:14.