splitting text with model tokenizer #7760

dmet6789 · 2023-07-15T18:11:31Z

dmet6789
Jul 15, 2023

Hello,
I would like to use the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model to calculate embeddings for long texts. Since the model has a limitation of 512 tokens for input, I need to divide the text into chunks that do not exceed a length of 512. I understand that it would be best to use the tokenizer from the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model itself.
I have tried doing it this way, but when I check the size of the resulting chunks, I see that most of them have a length of more than 500 or 900 characters. I believe something is not correct.
Does anyone know if this is the correct way to do it? Or is there another way to accomplish this?
The code I'm using is:

from transformers import AutoTokenizer
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
with open("mybook.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
#check the chunks size
for i,chunk in enumerate(texts):
print(i,"--",len(chunk))
thanks very much
regards
diego

Answered by dosubot[bot]

Jul 15, 2023

Answer generated by a 🤖

Answer

I understand that you're trying to split your text into chunks that do not exceed a length of 512 tokens using the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model. However, you're encountering an issue where the resulting chunks are often more than 500 or 900 characters long. This is likely because you're using the CharacterTextSplitter class, which splits the text based on characters, not tokens.

To address this issue, I suggest using the SentenceTransformersTokenTextSplitter class instead. This class is designed to split the text based on tokens, which is more suitable for Sentence Transformers models like the one you're using.

Here's h…

View full answer

dmet6789 · 2023-07-15T19:11:30Z

dosubot[bot]
bot Jul 15, 2023

Answer generated by a 🤖

Answer

I understand that you're trying to split your text into chunks that do not exceed a length of 512 tokens using the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model. However, you're encountering an issue where the resulting chunks are often more than 500 or 900 characters long. This is likely because you're using the CharacterTextSplitter class, which splits the text based on characters, not tokens.

To address this issue, I suggest using the SentenceTransformersTokenTextSplitter class instead. This class is designed to split the text based on tokens, which is more suitable for Sentence Transformers models like the one you're using.

Here's how you can modify your code:

from transformers import AutoTokenizer
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

with open("mybook.txt") as f:
    state_of_the_union = f.read()

text_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50, 
    model_name=model_name, 
    tokens_per_chunk=512
)

texts = text_splitter.split_text(state_of_the_union)

#check the chunks size
for i,chunk in enumerate(texts):
    print(i,"--",len(chunk))

In this revised code, SentenceTransformersTokenTextSplitter is used to split the text into chunks of 512 tokens each. The chunk_overlap parameter is set to 50, which means that each chunk will overlap with the next one by 50 tokens. This is to ensure that no information is lost in the splitting process.

Please note that this is one potential solution based on the information you've provided. If you find this solution helpful and believe it could benefit others, I encourage you to make a pull request to update the LangChain documentation. This way, other users facing the same issue can also benefit from your experience.

Sources

langchain/text_splitter.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

1 reply

dmet6789 Jul 16, 2023
Author

Thanks very much for your recomendation, it works fine
regards
diego

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

splitting text with model tokenizer #7760

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

splitting text with model tokenizer #7760

Uh oh!

dmet6789 Jul 15, 2023

Answer

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Jul 15, 2023

Answer

Sources

Uh oh!

dmet6789 Jul 16, 2023 Author

dmet6789
Jul 15, 2023

Replies: 1 comment 1 reply

dosubot[bot]
bot Jul 15, 2023

dmet6789 Jul 16, 2023
Author