Should I standardise text prior to annotation/creation of a Q&A dataset? #3639

Avs-safety · 2022-11-28T14:36:06Z

Avs-safety
Nov 28, 2022

I'm planning on creating my own dataset, however before starting I am trying to understand if any standardisation of the text needs to be performed.

The data I am using contains words that are polysemic, for example there may be many different car models (ford, renault, etc) however these are all 'cars'. It is envisaged that when the model trained on the completed dataset is in use, people will not be asking questions about car models but will be using the term 'car' in their question.

Essentially, would I be better off standardising data (taking the above example; replacing all car models with a single term 'car') aiming to suit the type of questions asked? Plus, would this have a detrimental on a fine-tuned model?

This is a general question, however I would appreciate feedback from anyone who has encountered a similar problem. Thank you.

vblagoje · 2022-12-02T11:38:43Z

vblagoje
Dec 2, 2022
Maintainer

Hi there @Avs-safety I am fairly confident that transformer-based LMs "know" all these different brands of cars are just cars. Are you actually experiencing issues or are you being overly cautious before proceeding further?

2 replies

Avs-safety Dec 2, 2022
Author

Thanks @vblagoje Its me being overly cautious and not wanting to embark on the time-consuming task of annotation without checking if I have considered everything.

vblagoje Dec 5, 2022
Maintainer

@Avs-safety I understand, I am like that as well but then again, just trying and doing some action, any action sometimes yields crucial data points to reorient yourself in the right direction. HTH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should I standardise text prior to annotation/creation of a Q&A dataset? #3639

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Should I standardise text prior to annotation/creation of a Q&A dataset? #3639

Uh oh!

Uh oh!

Avs-safety Nov 28, 2022

Replies: 2 comments · 2 replies

Uh oh!

vblagoje Dec 2, 2022 Maintainer

Uh oh!

Avs-safety Dec 2, 2022 Author

Uh oh!

Uh oh!

vblagoje Dec 5, 2022 Maintainer

Avs-safety
Nov 28, 2022

Replies: 2 comments 2 replies

vblagoje
Dec 2, 2022
Maintainer

Avs-safety Dec 2, 2022
Author

vblagoje Dec 5, 2022
Maintainer