Update PDF extraction and OCR options for hybrid chunking#557
Update PDF extraction and OCR options for hybrid chunking#557mergify[bot] merged 42 commits intoinstructlab:mainfrom
Conversation
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
d97fb71 to
6790918
Compare
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
8de7679 to
51fb86d
Compare
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
935c050 to
45ea8f4
Compare
45ea8f4 to
bdd771a
Compare
bdd771a to
87e4278
Compare
87e4278 to
16427d5
Compare
… import semchunk and transformers in chunkers.py Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
…ormers import Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ct mac-os-latest-xlarge platform Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ing MPS Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>
…ture Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
- Updated tox.ini to pass CI environment variable. - Modified DocumentChunker to check for CI environment before disabling MPS on macOS. Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
…ences to docling parse Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
.github/workflows/test.yml
Outdated
| tox | ||
| env: | ||
| # Increase from 1.7 to a greater value to avoid the PyTorch MPS backend running OOM. | ||
| PYTORCH_MPS_HIGH_WATERMARK_RATIO: 2.0 |
There was a problem hiding this comment.
Is this still needed since we force Mac runners to use CPU in other parts of this change?
There was a problem hiding this comment.
good catch @bbrowning, this was removed in the PR we rebased on, I accidentally kept this incoming change instead of discarding during the rebase. Fixing in a new commit now ⚡
| fused_texts = self.fuse_texts(chunks, 200) | ||
| num_tokens_per_doc = _num_tokens_from_words(self.chunk_word_count) | ||
| chunk_size = _num_chars_from_tokens(num_tokens_per_doc) | ||
| final_chunks = chunk_markdowns(fused_texts, chunk_size) |
There was a problem hiding this comment.
I'm curious why we're doing another pass of chunking on the chunks produced by the HybridChunker. Are we seeing it generate chunks larger than expected? If so, I'd consider that a bug to file with the Docling team.
There was a problem hiding this comment.
The extra pass isn’t performing chunking twice—it’s a standardization step. After the HybridChunker produces chunks (which can sometimes be too large or too small), we fuse and then re-chunk the text so that the final output is more consistent in size. This helps ensure that all chunks meet our expected size constraints.
There was a problem hiding this comment.
It may be a question of semantics, but we're definitely chunking again. We're taking the chunks generated by docling, combining any short ones into their preceding chunks (via fuse_texts), and then splitting them back up again. The fusing and re-chunking are done without any of the context awareness that Docling originally had when it made these chunks. Do we have evidence to suggest that this results in better output than just using what Docling produced for us? How often is Docling gives us very short chunks that we end up fusing into other chunks? And how often is our final chunking step breaking up the Docling-produced chunks?
There was a problem hiding this comment.
I think Ben's right here. I believe we only initially kept the markdown text splitter because we considered it part of the custom chunk building we were doing after the docling conversions, etc. Moving to the hybrid chunker should allow us to get rid of this step just as we did with the rest of the custom stuff.
I think it would be worth doing a test and inspect the output of the hybrid chunker. Happy to support on that!
There was a problem hiding this comment.
Okay, I think we can simplify our workflow by removing the extra fuse/re‐chunking step and rely solely on the HybridChunker. We’ll monitor the results—if we don’t see positive outcomes, I’ll bring this issue to the docling team’s attention. Thanks for your suggestions!
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
…p in test.yml Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>
Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>
bbrowning
left a comment
There was a problem hiding this comment.
This has been through a lot of iterations and looks good overall. I appreciate all the extra tests added and the removal of large portions of code that we no longer have to maintain around docling json parsing 🎉
Addresses #503 , #436