Want to treat the token "Chat GPT" as one token AND transform it to "ChatGPT" #12843

fabriziox1234 · 2023-07-20T02:35:37Z

fabriziox1234
Jul 20, 2023

As title states, my text document has different forms of "Chat GPT". For example, "chatgpt", "chat gpt", "Chat GPT". SpaCy treats the last 2 as 2 separate tokens. I want it them to be treated as one token AND they should become one word "ChatGPT". This is because I don't want my TF-IDF to treat "chatgpt" and "Chat GPT" as two different tokens, even though both are the same thing. I have searched non stop for an answer to no avail. The closes thing I've found is making "Chat" "GPT" into 1 token "Chat GPT" but as mentioned, that's only half of what I need

Answered by rmitsch

Jul 21, 2023

Hi @fabriziox1234! Customizing the tokenizer wouldn't work in this case, as it'd modify the text. Are you looking to do this with just "ChatGPT" variants? If so, I recommend a very trivial approach: search and replace all variants with the desired one (e. g. "ChatGPT") as a preprocessing step.

View full answer

Norky101 · 2023-07-20T13:46:40Z

Norky101
Jul 20, 2023

I am not an expert at all, but I believe you can customize how the spacy tokenizer works. From a python perspective, you may need to use .join() method to make them as one word in the future.
I would try search how to customize the tokenizer first though if you have not already.

0 replies

rmitsch · 2023-07-21T07:16:35Z

rmitsch
Jul 21, 2023
Maintainer

Hi @fabriziox1234! Customizing the tokenizer wouldn't work in this case, as it'd modify the text. Are you looking to do this with just "ChatGPT" variants? If so, I recommend a very trivial approach: search and replace all variants with the desired one (e. g. "ChatGPT") as a preprocessing step.

1 reply

fabriziox1234 Jul 21, 2023
Author

This is how I solved my problem :). I was asking because I want to develop my knowledge with SpaCy though, so I was looking for a SpaCy solution. Thank you for the reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Want to treat the token "Chat GPT" as one token AND transform it to "ChatGPT" #12843

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Want to treat the token "Chat GPT" as one token AND transform it to "ChatGPT" #12843

Uh oh!

fabriziox1234 Jul 20, 2023

Replies: 3 comments · 1 reply

Uh oh!

Norky101 Jul 20, 2023

Uh oh!

rmitsch Jul 21, 2023 Maintainer

Uh oh!

fabriziox1234 Jul 21, 2023 Author

fabriziox1234
Jul 20, 2023

Replies: 3 comments 1 reply

Norky101
Jul 20, 2023

rmitsch
Jul 21, 2023
Maintainer

fabriziox1234 Jul 21, 2023
Author