Skip to content
Discussion options

You must be logged in to vote

Double-check that you've included the vectors in initialize.vectors and have enabled vectors with in the tok2vec with include_static_vectors = true?

For Dutch this isn't the issue, but for Italian the drop is probably mainly related to tokenization. The TOK score of 96-97 is relatively low and every tokenization error turns into at least one lemmatization error.

Docs in spacy don't support multiword tokens, so in the provided trained pipelines and the examples in that blog post, we merge multiword tokens when converting UD corpora. We also group sentences into paragraph-sized chunks, so our typical conversion with both options is:

spacy convert -T -n 10 file.conllu .

There's a longer expl…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@alitaker
Comment options

@adrianeboyd
Comment options

@alitaker
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization experimental Experimental components and features
2 participants