-
Hello, from transformers import AutoTokenizer |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Answer generated by a 🤖 AnswerI understand that you're trying to split your text into chunks that do not exceed a length of 512 tokens using the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model. However, you're encountering an issue where the resulting chunks are often more than 500 or 900 characters long. This is likely because you're using the To address this issue, I suggest using the Here's how you can modify your code: from transformers import AutoTokenizer
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
with open("mybook.txt") as f:
state_of_the_union = f.read()
text_splitter = SentenceTransformersTokenTextSplitter(
chunk_overlap=50,
model_name=model_name,
tokens_per_chunk=512
)
texts = text_splitter.split_text(state_of_the_union)
#check the chunks size
for i,chunk in enumerate(texts):
print(i,"--",len(chunk)) In this revised code, Please note that this is one potential solution based on the information you've provided. If you find this solution helpful and believe it could benefit others, I encourage you to make a pull request to update the LangChain documentation. This way, other users facing the same issue can also benefit from your experience. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
Answer generated by a 🤖
Answer
I understand that you're trying to split your text into chunks that do not exceed a length of 512 tokens using the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model. However, you're encountering an issue where the resulting chunks are often more than 500 or 900 characters long. This is likely because you're using the
CharacterTextSplitter
class, which splits the text based on characters, not tokens.To address this issue, I suggest using the
SentenceTransformersTokenTextSplitter
class instead. This class is designed to split the text based on tokens, which is more suitable for Sentence Transformers models like the one you're using.Here's h…