Should I standardise text prior to annotation/creation of a Q&A dataset? #3639
Avs-safety
started this conversation in
General
Replies: 2 comments 2 replies
-
Hi there @Avs-safety I am fairly confident that transformer-based LMs "know" all these different brands of cars are just cars. Are you actually experiencing issues or are you being overly cautious before proceeding further? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm planning on creating my own dataset, however before starting I am trying to understand if any standardisation of the text needs to be performed.
The data I am using contains words that are polysemic, for example there may be many different car models (ford, renault, etc) however these are all 'cars'. It is envisaged that when the model trained on the completed dataset is in use, people will not be asking questions about car models but will be using the term 'car' in their question.
Essentially, would I be better off standardising data (taking the above example; replacing all car models with a single term 'car') aiming to suit the type of questions asked? Plus, would this have a detrimental on a fine-tuned model?
This is a general question, however I would appreciate feedback from anyone who has encountered a similar problem. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions