Input data for training textcat component in healthsea #10194

shrinidhin · 2022-02-02T08:16:51Z

shrinidhin
Feb 2, 2022

Hello!I am trying to use the approach used in Healthsea by spacy for a project. I am trying to understand what is the format for data to be included?I have gone through the annotation,json file and have observed that for statements with multiple entities, the tagging is somewhat like this:

Example 1: {"text":"great . great for colds , sore throat and very soothing for <CONDITION> ","meta":{"entities":"stomach_issues"},"_input_hash":1624872409,"_task_hash":1047418295,"options":[{"id":"POSITIVE","text":"POSITIVE","meta":"1.00"},{"id":"NEUTRAL","text":"NEUTRAL","meta":"0.00"},{"id":"NEGATIVE","text":"NEGATIVE","meta":"0.00"},{"id":"ANAMNESIS","text":"ANAMNESIS","meta":"0.00"}],"accept":["POSITIVE"],"_session_id":null,"_view_id":"choice","config":{"choice_style":"single"},"answer":"accept","_timestamp":1629206648}

So in a real word scenario where all the 3 entities i.e. colds, sore throat and stomach issues are present in a sentence and I would like to understand the sentiment of each, the clausecat will use the segmentation bit to split them into clauses and apply blinding logic.
In my case, I have large paragraphs of texts to perform the entity+sentiment analysis on. So what will be the annotation format?
There are some examples in the json file where all the entities are tagged together but the sentiment value is just one. For eg:

Example 2: {"text":"I used to have a slight <CONDITION> and since I have been taking Dr. Formulated Probiotics my <BENEFIT> improved significantly.","meta":{"entities":"[inmmune deficiency, immune systems'markers]"},"_input_hash":511398025,"_task_hash":51951885,"options":[{"id":"POSITIVE","text":"POSITIVE"},{"id":"NEGATIVE","text":"NEGATIVE"},{"id":"NEUTRAL","text":"NEUTRAL"},{"id":"ANAMNESIS","text":"ANAMNESIS"}],"_session_id":null,"_view_id":"choice","accept":["POSITIVE"],"config":{"choice_style":"single"},"answer":"accept","_timestamp":1626342559}

I understand from the above example that there are 2 entities here : a CONDITION and a BENEFIT. Both whose sentiment value is Positive. So for cases where sentiment of each entity is different, the annotation has to be in the format given in example 1?
Any inputs will be really appreciated. Thank you!

Answered by thomashacker

Feb 2, 2022

Hello,
Yes, you are right.
The Segmentation step tries to break down long documents/sentences into smaller chunks while maintaining their context:

"This is great for joint pain, but it also causes rashes" -> "This is great for joint pain", "It also causes rashes"

The segmented chunks make it easier for the textcat to predict the sentiment. As you already mentioned, what if the doc contains multiple entities with multiple sentiments even after segmentation? For this case, we try to use blinding:

"This is great for joint pain, rashes but not arthritis"
-> "This is great for <CONDITION>, rashes but not arthritis" (POSITIVE)
-> "This is great for  joint pain, <CONDITION> but not arthritis" (…

View full answer

thomashacker · 2022-02-02T15:19:21Z

thomashacker
Feb 2, 2022

Hello,
Yes, you are right.
The Segmentation step tries to break down long documents/sentences into smaller chunks while maintaining their context:

"This is great for joint pain, but it also causes rashes" -> "This is great for joint pain", "It also causes rashes"

The segmented chunks make it easier for the textcat to predict the sentiment. As you already mentioned, what if the doc contains multiple entities with multiple sentiments even after segmentation? For this case, we try to use blinding:

"This is great for joint pain, rashes but not arthritis"
-> "This is great for <CONDITION>, rashes but not arthritis" (POSITIVE)
-> "This is great for  joint pain, <CONDITION> but not arthritis" (POSITIVE)
-> "This is great for  joint pain, rashes but not <CONDITION>" (NEUTRAL)

For every entity found, the blinding step creates multiple versions of the doc with the specific entity blinded/replaced. The goal here is to "tell the textcat" for which entity the sentiment should focus on.

The data for annotation should be processed the same way as the textcat will receive them in training. This means, you'll have to segment your data first (Segmentation step) and then blind entities + create multiple versions if a sentence has multiple entities (Blinding step). To create the textcat data I trained the NER first and then used it together with the Segmentation & Blinding algorithm.

The second example that you show from the Healthsea dataset is actually wrong, for every annotated example there should only exist one blinded entity. I think, fixing that could potentially improve the overall performance of the Healthsea model 😄

1 reply

shrinidhin Feb 3, 2022
Author

@thomashacker This is super helpful!Thanks a lot for your explanation!

shrinidhin · 2022-02-17T15:51:13Z

shrinidhin
Feb 17, 2022
Author

Hello!So I am having a very absurd error. I have trained my pipeline with my custom data and have my final trained pipeline saved in the model_best folder. Now I am trying to use the trained model for predictions. I am loading the model as follows:

import spacy
path='entity+sentiment/clausecat/output/model-best/'
sentiment=spacy.load(path)

When I execute this, I get an error as follows:
ValueError: [E002] Can't find factory for 'benepar' for language English (en).

I then used import benepar but then I got a factory error for another component. Am I missing something here?I thought once you have trained your final pipeline, your custom components are already registered. I checked the final config.cfg file generated in the model_best folder and all details of components and their factory are correct. How can I fix this?Is there any additional code required?
Thank you!

4 replies

thomashacker Feb 17, 2022

You can use the spacy package command to package your model into an installable module in which the custom code is registered. Otherwise, you can share the factory error and we'll have a closer look.

shrinidhin Feb 17, 2022
Author

So my custom_code.py file contains:
import clause_segmentation import clausecat_component import clausecat_model import clausecat_reader import clause_aggregation import benepar
I package my model using the following command:

python -m spacy package output/model-best package -c custom_code.py

After installing the package with pip while trying to load the model, I get the following error:

ModuleNotFoundError: No module named 'clausecat_component'

I have verified that the file names are proper, the factory names included in the code and config file are the same.

thomashacker Feb 18, 2022

You can add multiple scripts to the -c parameter. For packaging the Healthsea model I used this command (full code)

python -m spacy package ${vars.models.model_clausecat} package --code scripts/clausecat/clause_aggregation.py,scripts/clausecat/clause_segmentation.py,scripts/clausecat/clausecat_component.py,scripts/clausecat/clausecat_model.py,scripts/clausecat/clausecat_reader.py --name healthsea --create-meta --build wheel --force

This way all the scripts are included in the package 🎉

shrinidhin Feb 20, 2022
Author

@thomashacker This is great!Thank you very much! Grateful for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Input data for training textcat component in healthsea #10194

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Input data for training textcat component in healthsea #10194

Uh oh!

shrinidhin Feb 2, 2022

Replies: 2 comments · 5 replies

Uh oh!

thomashacker Feb 2, 2022

Uh oh!

shrinidhin Feb 3, 2022 Author

Uh oh!

shrinidhin Feb 17, 2022 Author

Uh oh!

thomashacker Feb 17, 2022

Uh oh!

Uh oh!

shrinidhin Feb 17, 2022 Author

Uh oh!

thomashacker Feb 18, 2022

Uh oh!

shrinidhin Feb 20, 2022 Author

shrinidhin
Feb 2, 2022

Replies: 2 comments 5 replies

thomashacker
Feb 2, 2022

shrinidhin Feb 3, 2022
Author

shrinidhin
Feb 17, 2022
Author

shrinidhin Feb 17, 2022
Author

shrinidhin Feb 20, 2022
Author