Custom textcat model - multiple text columns as input? #7236
Replies: 2 comments 1 reply
-
This is a great question. Unfortunately spaCy does not have support for multi-modal models with multiple fields or sources of features in the built-in models. Like you mention, the easiest thing would be to just concatenate the headline and article body. It might not make a difference, but when you do that you could also add a period and newline ( I dealt with a vaguely similar problem involving multi-field objects a while ago using product data, which typically has a title and a description. I tested these variants - they're pretty easy to set up, so it should be fast to test them with your data.
In my case the descriptions ended up not being very helpful, so I dropped them from my model. Which of the combinations will work for you depends on whether your headlines can be understood in isolation and how focused your articles are - for example, if a sports article has a paragraph in the middle about a player's immigration status it could potentially be mistaken for an article on international affairs or something, so including the full body text might be less helpful if you have many cases like that. If you want to actually generate seperate representations for the different fields, one way to do that would be training two textcat models and then combining the values using Thinc, though that would be more involved. |
Beta Was this translation helpful? Give feedback.
-
**... makes notes .. ** I think spaCy won't be able to support this directly. But I do find this inspiring as a potential feature for tokenwiser. The idea is to attach a |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there
I am following this tutorial on how to train my own custom text classification model: textcat_goemotions
The corpus is converted like this:
As you can see, there is only one text column in the provided TSV asset files.
Let's say I have a dataset that looks like this:
Where I wanted to train the model on both
text_headline
andtext_article
.Is this possible with spaCy? And what is the best practices? Should I merge the two columns into one large text column?
Beta Was this translation helpful? Give feedback.
All reactions