Pretraining NER and Textcat components of Healthsea #10456

shrinidhin · 2022-03-08T06:43:50Z

shrinidhin
Mar 8, 2022

Hello!I had a few questions about the Healthsea project by spacy. While pre-training for initializing the tok2vec layer of NER and the clausecat component,

If we are using a raw text json file for the pretraining, Is it required to use different files for both the components or would the same file work for both components?
Should the raw text data without annotations to be used for pre-training be different from the training and dev set?
I am currently building an entity+sentiment analyzer for news articles, using the logic used in healthsea. So my raw text file will simply contain raw text of the news articles right?

Any help will be appreciated. Thank you!

Answered by thomashacker

Mar 9, 2022

Hello,

One training file should work for both components
There's no strict rule, I'd say that it's fine to use the raw text from both the train and dev dataset
Yes, a json/jsonl with raw text should work

View full answer

thomashacker · 2022-03-09T10:13:53Z

thomashacker
Mar 9, 2022

Hello,

One training file should work for both components
There's no strict rule, I'd say that it's fine to use the raw text from both the train and dev dataset
Yes, a json/jsonl with raw text should work

1 reply

shrinidhin Mar 9, 2022
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pretraining NER and Textcat components of Healthsea #10456

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pretraining NER and Textcat components of Healthsea #10456

Uh oh!

shrinidhin Mar 8, 2022

Replies: 1 comment · 1 reply

Uh oh!

thomashacker Mar 9, 2022

Uh oh!

shrinidhin Mar 9, 2022 Author

shrinidhin
Mar 8, 2022

Replies: 1 comment 1 reply

thomashacker
Mar 9, 2022

shrinidhin Mar 9, 2022
Author