Input data for training textcat component in healthsea #10194
-
Hello!I am trying to use the approach used in Healthsea by spacy for a project. I am trying to understand what is the format for data to be included?I have gone through the annotation,json file and have observed that for statements with multiple entities, the tagging is somewhat like this:
So in a real word scenario where all the 3 entities i.e. colds, sore throat and stomach issues are present in a sentence and I would like to understand the sentiment of each, the clausecat will use the segmentation bit to split them into clauses and apply blinding logic.
I understand from the above example that there are 2 entities here : a CONDITION and a BENEFIT. Both whose sentiment value is Positive. So for cases where sentiment of each entity is different, the annotation has to be in the format given in example 1? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Hello,
The segmented chunks make it easier for the textcat to predict the sentiment. As you already mentioned, what if the doc contains multiple entities with multiple sentiments even after segmentation? For this case, we try to use blinding:
For every entity found, the blinding step creates multiple versions of the doc with the specific entity blinded/replaced. The goal here is to "tell the textcat" for which entity the sentiment should focus on. The data for annotation should be processed the same way as the textcat will receive them in training. This means, you'll have to segment your data first (Segmentation step) and then blind entities + create multiple versions if a sentence has multiple entities (Blinding step). To create the textcat data I trained the NER first and then used it together with the Segmentation & Blinding algorithm. The second example that you show from the Healthsea dataset is actually wrong, for every annotated example there should only exist one blinded entity. I think, fixing that could potentially improve the overall performance of the Healthsea model 😄 |
Beta Was this translation helpful? Give feedback.
-
Hello!So I am having a very absurd error. I have trained my pipeline with my custom data and have my final trained pipeline saved in the model_best folder. Now I am trying to use the trained model for predictions. I am loading the model as follows:
When I execute this, I get an error as follows: I then used |
Beta Was this translation helpful? Give feedback.
Hello,
Yes, you are right.
The Segmentation step tries to break down long documents/sentences into smaller chunks while maintaining their context:
The segmented chunks make it easier for the textcat to predict the sentiment. As you already mentioned, what if the doc contains multiple entities with multiple sentiments even after segmentation? For this case, we try to use blinding: