Skip to content
Discussion options

You must be logged in to vote

Answer generated by a 🤖

Answer

I understand that you're trying to split your text into chunks that do not exceed a length of 512 tokens using the 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' model. However, you're encountering an issue where the resulting chunks are often more than 500 or 900 characters long. This is likely because you're using the CharacterTextSplitter class, which splits the text based on characters, not tokens.

To address this issue, I suggest using the SentenceTransformersTokenTextSplitter class instead. This class is designed to split the text based on tokens, which is more suitable for Sentence Transformers models like the one you're using.

Here's h…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@dmet6789
Comment options

Answer selected by dmet6789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant