Making the model more robust #9096

JimDunlop · 2021-08-31T06:28:31Z

JimDunlop
Aug 31, 2021

I am using spacy for sentiment (or urgency) analysis of customer complains. I work with wordlist and get somewhat good results.
Now i want to explore more sophisticated possibiltys, like topic modeling, similaritys between texts, aspects of sentiments, etc.

The POS-Tagger and DEP-Parser seem to be rather inaccuarte with my dataset, which might be because of the direct speech of the complains, the long and complex sentences of some of those complaints, or because of our specialised domain with lots of unknown words for spacy.

So what is the best practise to get a more robust modell in my case? Is it possible without POS and DEP annotaded training-set?
Can i get better results with a custom vocab?

Theese are my first tries in enhancing the modell/pipeline, so any help is greatly appreciated.

svlandeg · 2021-08-31T14:01:19Z

svlandeg
Aug 31, 2021

Hi!

The pretrained models that we provide will only get you so far, as you've found. If your domain is significantly different, you should definitely consider retraining the models or even training them from scratch. The relevant documentation is here: https://spacy.io/usage/training

5 replies

JimDunlop Aug 31, 2021
Author

Thanks for your reply, but what should i be training with? I dont have any pos or dep- annotated data. Should i just annotate domain specific words? Would they even need to be annotaded, or would it be enough for them to be in a custom vocab?

polm Sep 1, 2021

To be clear, are you using the pos or dep output? If you aren't then you can just train a model without worrying about them. If you want to use them but they aren't good enough, then the only way to make them better is to provide training data relevant for your domain.

You cannot just provide a custom vocab, or just annotate individual words to improve pos and dep accuracy. You need to annotate words in context in whole sentences. You can see some example English training data here.

Building annotated pos or dep data is a big undertaking and may be out of scope for you. But for sentiment analysis you don't necessarily need it - that's usually more of a text classification problem. So maybe just try using a textcat model and seeing how that does first.

JimDunlop Sep 1, 2021
Author

I use dep to find the head of a sentiment token and then iterate over its children. I use the pos-tag to find negation, but this is so inaccurate, that i have to double check for negation in a wordlist.
One of the next goals is, to find out what aspect the negative sentiment is about. The aspect is, in most cases, a word that spacy doesnt know yet. My first thought was to make those apsects, custom entities. I've got some results with the phrase-matcher and the entity-ruler.
Sentence segmentation is another thing. There are lot of special-character words and abbrevations that cause spacy to wrongly split sentences. I've looked into the SentenceRecognizer and should be able to fix this with some training on its own.
So while doing all those little fixes i thought, if there is maybe a standard-way of doing things. Thus the title, making the model more robust.
For the pos and dep training: Could i just train the new model on top of the old one? What i mean ist, just provide training-data relevant to my domain and add the results to the existing model?

polm Sep 1, 2021

Ah, OK, if you really need to improve the accuracy of pos and dep annotations then you'll want domain-appropriate training data.

Could i just train the new model on top of the old one? What i mean ist, just provide training-data relevant to my domain and add the results to the existing model?

This is kind of possible, but you'll run into a few problems. One is "catastrophic forgetting", where the model will forget everything it learned that isn't in your training data. See the FAQ, there are a few entries about this.

Another thing is that your new data still needs to be annotated, though you don't have to do it from scratch. You can run your raw data through spaCy's existing models to get annotations and then correct those and use them as training data.

JimDunlop Sep 1, 2021
Author

I guess i will stick with the little fixes and phrase-matching for now then :)
Thanks a lot for taking the time to help me! Appreciate it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Making the model more robust #9096

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Making the model more robust #9096

Uh oh!

JimDunlop Aug 31, 2021

Replies: 1 comment · 5 replies

Uh oh!

svlandeg Aug 31, 2021

Uh oh!

Uh oh!

JimDunlop Aug 31, 2021 Author

Uh oh!

polm Sep 1, 2021

Uh oh!

JimDunlop Sep 1, 2021 Author

Uh oh!

polm Sep 1, 2021

Uh oh!

JimDunlop Sep 1, 2021 Author

JimDunlop
Aug 31, 2021

Replies: 1 comment 5 replies

svlandeg
Aug 31, 2021

JimDunlop Aug 31, 2021
Author

JimDunlop Sep 1, 2021
Author

JimDunlop Sep 1, 2021
Author