Skip to content
Discussion options

You must be logged in to vote

annotating_components only sets annotation in the predicted doc, not in the reference docs, so if you need sentence boundaries in get_instances for the reference docs, you have to set them separately before training, either directly in the saved .spacy annotation or with a custom corpus reader.

For the custom reader, the tokenization for blank:en may not match the saved tokenization, so it would be better to process the gold Doc object with the sentencizer rather than gold.text.

For testing, you can also have the corpus reader add the sentence boundaries to the predicted docs, but in practice you would want a component in the pipeline that adds this or you wouldn't be able to run the comp…

Replies: 3 comments 9 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
9 replies
@Larsdegroot
Comment options

@adrianeboyd
Comment options

@Larsdegroot
Comment options

@adrianeboyd
Comment options

@Larsdegroot
Comment options

Answer selected by Larsdegroot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / pipeline Feature: Processing pipeline and components
2 participants