Skip to content

Update PDF extraction and OCR options for hybrid chunking#557

Merged
mergify[bot] merged 42 commits intoinstructlab:mainfrom
aakankshaduggal:hybrid-chunker
Apr 9, 2025
Merged

Update PDF extraction and OCR options for hybrid chunking#557
mergify[bot] merged 42 commits intoinstructlab:mainfrom
aakankshaduggal:hybrid-chunker

Conversation

@aakankshaduggal
Copy link
Contributor

@aakankshaduggal aakankshaduggal commented Feb 12, 2025

Addresses #503 , #436

@mergify mergify bot added the ci-failure label Feb 12, 2025
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Feb 12, 2025
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
@mergify mergify bot added ci-failure and removed ci-failure labels Feb 12, 2025
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
aakankshaduggal and others added 18 commits March 1, 2025 00:36
… import semchunk and transformers in chunkers.py

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
…ormers import

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ct mac-os-latest-xlarge platform

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ing MPS

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ture

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
- Updated tox.ini to pass CI environment variable.
- Modified DocumentChunker to check for CI environment before disabling MPS on macOS.

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
…ences to docling parse

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
tox
env:
# Increase from 1.7 to a greater value to avoid the PyTorch MPS backend running OOM.
PYTORCH_MPS_HIGH_WATERMARK_RATIO: 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed since we force Mac runners to use CPU in other parts of this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch @bbrowning, this was removed in the PR we rebased on, I accidentally kept this incoming change instead of discarding during the rebase. Fixing in a new commit now ⚡

fused_texts = self.fuse_texts(chunks, 200)
num_tokens_per_doc = _num_tokens_from_words(self.chunk_word_count)
chunk_size = _num_chars_from_tokens(num_tokens_per_doc)
final_chunks = chunk_markdowns(fused_texts, chunk_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we're doing another pass of chunking on the chunks produced by the HybridChunker. Are we seeing it generate chunks larger than expected? If so, I'd consider that a bug to file with the Docling team.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra pass isn’t performing chunking twice—it’s a standardization step. After the HybridChunker produces chunks (which can sometimes be too large or too small), we fuse and then re-chunk the text so that the final output is more consistent in size. This helps ensure that all chunks meet our expected size constraints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a question of semantics, but we're definitely chunking again. We're taking the chunks generated by docling, combining any short ones into their preceding chunks (via fuse_texts), and then splitting them back up again. The fusing and re-chunking are done without any of the context awareness that Docling originally had when it made these chunks. Do we have evidence to suggest that this results in better output than just using what Docling produced for us? How often is Docling gives us very short chunks that we end up fusing into other chunks? And how often is our final chunking step breaking up the Docling-produced chunks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ben's right here. I believe we only initially kept the markdown text splitter because we considered it part of the custom chunk building we were doing after the docling conversions, etc. Moving to the hybrid chunker should allow us to get rid of this step just as we did with the rest of the custom stuff.

I think it would be worth doing a test and inspect the output of the hybrid chunker. Happy to support on that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think we can simplify our workflow by removing the extra fuse/re‐chunking step and rely solely on the HybridChunker. We’ll monitor the results—if we don’t see positive outcomes, I’ll bring this issue to the docling team’s attention. Thanks for your suggestions!

aakankshaduggal and others added 5 commits March 30, 2025 19:33
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
…p in test.yml

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been through a lot of iterations and looks good overall. I appreciate all the extra tests added and the removal of large portions of code that we no longer have to maintain around docling json parsing 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants