raise error if token count exceeds 1024 instead of attempting to re-chunk #29

khaledsulayman · 2025-06-10T18:26:25Z

In the sdg_hub code where this came from, there was a call to chunk_document() that would be made if the token count of a particular chunk exceeded 1024.

The chunk_document function uses the recursive text splitter to split chunks to certain token size thresholds, but realistically should not be needed anymore given our use of the hybrid chunker. This PR raises a more explicit error in the case that we do exceed 1024.

notebooks/instructlab-knowledge/utils/create_seed_dataset.py

…hunk Signed-off-by: Khaled Sulayman <[email protected]>

alimaredia · 2025-06-16T13:19:56Z

notebooks/instructlab-knowledge/utils/create_seed_dataset.py

-    )
+    for c in chunked_document_all_icl:
+        if get_token_count(c["document"], tokenizer) > 1024:
+            raise ValueError("Chunk exceeds token count of 1024")


Could we get the first X and last X tokens in the chunk printed out as well as all chunks that are too big into a list and all printed out at once. This way a users know where they need to trim down their chunks before moving forward.

It would look something like:

Chunk Size Errors: Chunk "foo bar baz ... biz baz bar." exceeds max token count of 1024. Chunk "foo2 bar2 baz2 ... biz2 baz2 bar2." exceeds max token count of 1024. ....

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman requested review from JustinXHale, alimaredia, alinaryan, anastasds, fabianofranz and iamemilio as code owners June 10, 2025 18:26

khaledsulayman force-pushed the fix-rechunking branch from fe5cdcb to c25f46b Compare June 10, 2025 20:16

alimaredia reviewed Jun 11, 2025

View reviewed changes

notebooks/instructlab-knowledge/utils/create_seed_dataset.py Show resolved Hide resolved

khaledsulayman marked this pull request as draft June 12, 2025 17:35

khaledsulayman force-pushed the fix-rechunking branch 2 times, most recently from 328e9ef to c78d1f7 Compare June 13, 2025 16:44

khaledsulayman marked this pull request as ready for review June 13, 2025 16:50

khaledsulayman requested a review from alimaredia June 13, 2025 16:50

khaledsulayman force-pushed the fix-rechunking branch from c78d1f7 to d408ab8 Compare June 13, 2025 16:53

raise error if token count exceeds 1024 instead of attempting to re-c…

961dbd2

…hunk Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the fix-rechunking branch from d408ab8 to 961dbd2 Compare June 13, 2025 16:54

alimaredia reviewed Jun 16, 2025

View reviewed changes

when chunk exceeds max token count, print truncated chunk

85999ee

Signed-off-by: Khaled Sulayman <[email protected]>

alimaredia approved these changes Jun 18, 2025

View reviewed changes

alimaredia merged commit 935d356 into instructlab:main Jun 18, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

raise error if token count exceeds 1024 instead of attempting to re-chunk #29

raise error if token count exceeds 1024 instead of attempting to re-chunk #29

Uh oh!

khaledsulayman commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

alimaredia Jun 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raise error if token count exceeds 1024 instead of attempting to re-chunk #29

raise error if token count exceeds 1024 instead of attempting to re-chunk #29

Uh oh!

Conversation

khaledsulayman commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alimaredia Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khaledsulayman commented Jun 10, 2025 •

edited

Loading