Skip to content

Conversation

@khaledsulayman
Copy link
Member

@khaledsulayman khaledsulayman commented Jun 10, 2025

In the sdg_hub code where this came from, there was a call to chunk_document() that would be made if the token count of a particular chunk exceeded 1024.

The chunk_document function uses the recursive text splitter to split chunks to certain token size thresholds, but realistically should not be needed anymore given our use of the hybrid chunker. This PR raises a more explicit error in the case that we do exceed 1024.

@khaledsulayman khaledsulayman marked this pull request as draft June 12, 2025 17:35
@khaledsulayman khaledsulayman force-pushed the fix-rechunking branch 2 times, most recently from 328e9ef to c78d1f7 Compare June 13, 2025 16:44
@khaledsulayman khaledsulayman marked this pull request as ready for review June 13, 2025 16:50
)
for c in chunked_document_all_icl:
if get_token_count(c["document"], tokenizer) > 1024:
raise ValueError("Chunk exceeds token count of 1024")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get the first X and last X tokens in the chunk printed out as well as all chunks that are too big into a list and all printed out at once. This way a users know where they need to trim down their chunks before moving forward.

It would look something like:

Chunk Size Errors:
Chunk "foo bar baz ... biz baz bar."  exceeds max token count of 1024.
Chunk "foo2 bar2 baz2 ... biz2 baz2 bar2."  exceeds max token count of 1024.
....

@alimaredia alimaredia merged commit 935d356 into instructlab:main Jun 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants