Update PDF extraction and OCR options for hybrid chunking by aakankshaduggal · Pull Request #557 · instructlab/sdg

aakankshaduggal · 2025-02-12T20:25:52Z

Addresses #503 , #436

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

… import semchunk and transformers in chunkers.py Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

…ormers import Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

…ct mac-os-latest-xlarge platform Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

…ing MPS Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

…ture Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

- Updated tox.ini to pass CI environment variable. - Modified DocumentChunker to check for CI environment before disabling MPS on macOS. Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

…ences to docling parse Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

bbrowning · 2025-03-28T12:58:55Z

.github/workflows/test.yml

          tox
+        env:
+          # Increase from 1.7 to a greater value to avoid the PyTorch MPS backend running OOM.
+          PYTORCH_MPS_HIGH_WATERMARK_RATIO: 2.0


Is this still needed since we force Mac runners to use CPU in other parts of this change?

good catch @bbrowning, this was removed in the PR we rebased on, I accidentally kept this incoming change instead of discarding during the rebase. Fixing in a new commit now ⚡

bbrowning · 2025-03-28T14:05:19Z

src/instructlab/sdg/utils/chunkers.py

+            fused_texts = self.fuse_texts(chunks, 200)
+            num_tokens_per_doc = _num_tokens_from_words(self.chunk_word_count)
+            chunk_size = _num_chars_from_tokens(num_tokens_per_doc)
+            final_chunks = chunk_markdowns(fused_texts, chunk_size)


I'm curious why we're doing another pass of chunking on the chunks produced by the HybridChunker. Are we seeing it generate chunks larger than expected? If so, I'd consider that a bug to file with the Docling team.

The extra pass isn’t performing chunking twice—it’s a standardization step. After the HybridChunker produces chunks (which can sometimes be too large or too small), we fuse and then re-chunk the text so that the final output is more consistent in size. This helps ensure that all chunks meet our expected size constraints.

It may be a question of semantics, but we're definitely chunking again. We're taking the chunks generated by docling, combining any short ones into their preceding chunks (via fuse_texts), and then splitting them back up again. The fusing and re-chunking are done without any of the context awareness that Docling originally had when it made these chunks. Do we have evidence to suggest that this results in better output than just using what Docling produced for us? How often is Docling gives us very short chunks that we end up fusing into other chunks? And how often is our final chunking step breaking up the Docling-produced chunks?

I think Ben's right here. I believe we only initially kept the markdown text splitter because we considered it part of the custom chunk building we were doing after the docling conversions, etc. Moving to the hybrid chunker should allow us to get rid of this step just as we did with the rest of the custom stuff.

I think it would be worth doing a test and inspect the output of the hybrid chunker. Happy to support on that!

Okay, I think we can simplify our workflow by removing the extra fuse/re‐chunking step and rely solely on the HybridChunker. We’ll monitor the results—if we don’t see positive outcomes, I’ll bring this issue to the docling team’s attention. Thanks for your suggestions!

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

…p in test.yml Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

requirements.txt

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

bbrowning

This has been through a lot of iterations and looks good overall. I appreciate all the extra tests added and the removal of large portions of code that we no longer have to maintain around docling json parsing 🎉

mergify bot added the ci-failure label Feb 12, 2025

Update PDF extraction and OCR options for hybrid chunking

6790918

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

aakankshaduggal force-pushed the hybrid-chunker branch from d97fb71 to 6790918 Compare February 12, 2025 20:29

mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Feb 12, 2025

Update docling versions

51fb86d

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

aakankshaduggal force-pushed the hybrid-chunker branch from 8de7679 to 51fb86d Compare February 12, 2025 21:05

mergify bot added ci-failure and removed ci-failure labels Feb 12, 2025

Update easyocr params

90a7a4b

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

mergify bot added ci-failure and removed ci-failure labels Feb 12, 2025

Add docling-core[chunking] to requirements

19ba945

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

mergify bot added ci-failure and removed ci-failure labels Feb 12, 2025

aakankshaduggal force-pushed the hybrid-chunker branch from 935c050 to 45ea8f4 Compare February 13, 2025 16:16

mergify bot added ci-failure and removed ci-failure labels Feb 13, 2025

aakankshaduggal force-pushed the hybrid-chunker branch from 45ea8f4 to bdd771a Compare February 13, 2025 16:32

mergify bot added ci-failure and removed ci-failure labels Feb 13, 2025

aakankshaduggal force-pushed the hybrid-chunker branch from bdd771a to 87e4278 Compare February 13, 2025 16:43

mergify bot added ci-failure and removed ci-failure labels Feb 13, 2025

aakankshaduggal force-pushed the hybrid-chunker branch from 87e4278 to 16427d5 Compare February 13, 2025 19:30

aakankshaduggal and others added 18 commits March 1, 2025 00:36

Remove docling core test from lean imports, add transformers back and…

7709f0c

… import semchunk and transformers in chunkers.py Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

fix: Move docling_core import inside method to avoid top-level transf…

ec99a89

…ormers import Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

fix: Remove semchunk and transformers from requirements.txt

09e7b09

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

ci: Update GitHub Actions workflow to use macos-latest-xlarge runners

bd3488b

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

ci: Update GitHub Actions workflow free disk space condition to refle…

7c11b3c

…ct mac-os-latest-xlarge platform Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

test: Add fixture to force CPU usage on macOS CI environments

2cdb25e

Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

test: Add fixture to force CPU usage on macOS CI environments, disabl…

0a3eecd

…ing MPS Signed-off-by: Cloud User <ec2-user@ip-172-31-44-225.ec2.internal>

test: test debug logging and MPS handling in macOS CI environment fix…

755851c

…ture Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

test: Reorder imports and minor formatting in macOS CI fixture

16f6b9f

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

test: Simplify macOS device handling in chunkers and test fixture

bde48a3

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

refactor: Simplify MPS handling in macOS CI test fixture

cb5f4b1

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

fix: add CI env check condition to disabling MPS

259bdb5

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

Revert latest commit to fix broken test

0d5a749

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

ci: update test workflow to use macos-latest platform

728443e

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

ci: add CI environment variable to enable macOS MPS handling

a33749b

- Updated tox.ini to pass CI environment variable. - Modified DocumentChunker to check for CI environment before disabling MPS on macOS. Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

refactor: Remove PDF extraction using docling parse, remove all refer…

a8273f4

…ences to docling parse Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

Merge remote-tracking branch 'upstream/main' into hybrid-chunker

c7563b7

Fix duplicate imports

40072d0

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

bbrowning reviewed Mar 28, 2025

View reviewed changes

aakankshaduggal and others added 5 commits March 30, 2025 19:33

Update test_chunkers to check for token count per chunk over char length

f1ca2d3

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

Get rid of the standardization/extra chunking step

8ef0429

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

Get rid of function and test for docs containing html

88de4c6

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

fix: remove PYTORCH_MPS_HIGH_WATERMARK_RATIO environment variable bum…

65032f1

…p in test.yml Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

fix: fix / add semantic checks to test_chunker functional test

5678215

Signed-off-by: eshwarprasadS <eshwarprasad.s01@gmail.com>

bbrowning reviewed Apr 3, 2025

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

Update the docling version to 2.28.4 and docling core to 2.25.0

9e627a5

Signed-off-by: Aakanksha Duggal <aduggal@redhat.com>

eshwarprasadS approved these changes Apr 8, 2025

View reviewed changes

bbrowning approved these changes Apr 8, 2025

View reviewed changes

khaledsulayman approved these changes Apr 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PDF extraction and OCR options for hybrid chunking#557

Update PDF extraction and OCR options for hybrid chunking#557
mergify[bot] merged 42 commits intoinstructlab:mainfrom
aakankshaduggal:hybrid-chunker

aakankshaduggal commented Feb 12, 2025 •

edited

Loading

Uh oh!

bbrowning Mar 28, 2025

Uh oh!

eshwarprasadS Apr 2, 2025

Uh oh!

bbrowning Mar 28, 2025

Uh oh!

aakankshaduggal Mar 28, 2025

Uh oh!

bbrowning Mar 31, 2025

Uh oh!

khaledsulayman Mar 31, 2025

Uh oh!

aakankshaduggal Mar 31, 2025

Uh oh!

Uh oh!

bbrowning left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

aakankshaduggal commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

eshwarprasadS Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

bbrowning Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

aakankshaduggal Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

bbrowning Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

khaledsulayman Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

aakankshaduggal Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aakankshaduggal commented Feb 12, 2025 •

edited

Loading