Want to treat the token "Chat GPT" as one token AND transform it to "ChatGPT" #12843
-
As title states, my text document has different forms of "Chat GPT". For example, "chatgpt", "chat gpt", "Chat GPT". SpaCy treats the last 2 as 2 separate tokens. I want it them to be treated as one token AND they should become one word "ChatGPT". This is because I don't want my TF-IDF to treat "chatgpt" and "Chat GPT" as two different tokens, even though both are the same thing. I have searched non stop for an answer to no avail. The closes thing I've found is making "Chat" "GPT" into 1 token "Chat GPT" but as mentioned, that's only half of what I need |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
I am not an expert at all, but I believe you can customize how the spacy tokenizer works. From a python perspective, you may need to use .join() method to make them as one word in the future. |
Beta Was this translation helpful? Give feedback.
-
Hi @fabriziox1234! Customizing the tokenizer wouldn't work in this case, as it'd modify the text. Are you looking to do this with just "ChatGPT" variants? If so, I recommend a very trivial approach: search and replace all variants with the desired one (e. g. "ChatGPT") as a preprocessing step. |
Beta Was this translation helpful? Give feedback.
Hi @fabriziox1234! Customizing the tokenizer wouldn't work in this case, as it'd modify the text. Are you looking to do this with just "ChatGPT" variants? If so, I recommend a very trivial approach: search and replace all variants with the desired one (e. g. "ChatGPT") as a preprocessing step.