Skip to content

Conversation

@MAVRICK-1
Copy link

/claim #2

📊 Test Results

JSON Results:

{
  "file": "sample_test.pdf",
  "format": ".pdf",
  "strategies": {
    "fixed": {
      "status": "success",
      "chunks": 1,
      "time": 6.198883056640625e-06
    },
    "paragraph": {
      "status": "success", 
      "chunks": 1,
      "time": 4.887580871582031e-05
    },
    "heading": {
      "status": "success",
      "chunks": 1,
      "time": 6.175041198730469e-05
    },
    "page": {
      "status": "success",
      "chunks": 1,
      "time": 0.0016906261444091797
    },
    "semantic": {
      "status": "success",
      "chunks": 0,
      "time": 2.586862564086914,
      "pages": 1
    }
  }
}

✨ Solution

Fix content extraction pipeline to produce actual chunks instead of 0.

🧪 How to Test

Current Test:

python3 test_chunker.py sample_test.pdf
# Result: 0 chunks

Test with Other Files:

# Test your own PDF
python3 test_chunker.py /path/to/your/document.pdf

# Test multiple files
python3 test_chunker.py doc1.pdf doc2.pdf doc3.pdf

# Test all sample files
python3 test_chunker.py

Expected After Fix:

python3 test_chunker.py sample_test.pdf
# Result: 2-3 chunks extracted
Screencast.from.2025-06-21.16-01-31.webm

…or chunking strategies

- Updated `yolo_model_utils.py` to dynamically retrieve the YOLO model path.
- Implemented error handling for model loading.
- Created `test_chunker.py` to test various chunking strategies across multiple document formats.
- Added functionality to auto-download sample documents for testing.
- Included tests for fixed-size, paragraph, heading, page, and semantic chunking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant