feat: optimize document processing #12

harsh-791 · 2025-05-14T06:55:19Z

Optimize Document Processing and Fix Critical Issues(closes : #2 )

Overview

This PR addresses several critical issues in the document processing system and implements significant performance optimizations to reduce latency and improve reliability.

Changes Made

Bug Fixes

Fixed circular import between openai.py and chunking.py by moving paragraph_chunking import inside functions
Removed request_timeout parameter from OpenAI client for better compatibility
Upgraded OpenAI package to latest version (1.78.1)
Fixed FastAPI server startup issues

Performance Optimizations

Implemented thread-safe caching mechanism with hit/miss tracking
Optimized PDF text extraction with parallel processing
Improved chunk merging logic for better text processing
Added memory-efficient streaming for large documents
Enhanced error handling with proper fallback mechanisms

Code Quality

Improved error handling in semantic chunking
Added proper logging for better debugging
Updated API documentation
Enhanced example usage

Testing

Tested document processing with various file types (PDF, DOCX, PPTX)
Verified chunking strategies (semantic, fixed) work as expected
Confirmed API endpoints are accessible and functioning
Validated performance improvements with large documents

Performance Impact

Reduced document processing latency
Improved memory usage during large document processing
Enhanced reliability of chunking operations

Documentation

Updated API documentation
Added example usage in README
Included performance monitoring capabilities

Dependencies

Upgraded OpenAI package to 1.78.1
No breaking changes to existing dependencies

Checklist

Code follows project style guidelines
All tests pass
Documentation has been updated
No breaking changes to existing functionality
Performance improvements verified

/claim #2

harsh-791 · 2025-05-14T06:55:58Z

@mubashir-oss can you please review the pr and suggest any changes if required

mubashir-oss · 2025-05-18T07:15:41Z

@harsh-791 please attach a recording

harsh-791 · 2025-05-23T07:51:49Z

Screencast from 2025-05-23 13-14-07.webm

Hi @mubashir-oss ,

I’ve attached a screenshot and a video showing the latency results for the document chunking endpoint. To measure this, I used a simple curl command with timing flags while running everything locally on my machine.

As you can see in the screenshot:

The total time to process and get a response was about 1.45 seconds.
The time to first byte was also 1.45 seconds, which means most of the time was spent on processing.
The connection itself was almost instant (0.00025 seconds).

For this test, I used a real PDF file and the following command:

curl -X 'POST' \
  'http://127.0.0.1:8000/chunking' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F "document_file=@/home/harsh/Downloads/6.pdf" \
  -F 'strategy=semantic' \
  -F 'chunk_size=1000' \
  -F 'overlap=100' \
  -w "\nTotal time: %{time_total}s\nTime to first byte: %{time_starttransfer}s\nTime to connect: %{time_connect}s\n"

harsh-791 · 2025-06-11T18:56:37Z

@mubashir-oss can you please review this.
(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

feat: optimize document processing

c02b4fb

algora-pbc bot added the 🙋 Bounty claim label May 14, 2025

algora-pbc bot mentioned this pull request May 14, 2025

Fix:Reduce the latency of document parser #2

Open

refactor: enhanced semantic chunking

d85a3bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize document processing #12

feat: optimize document processing #12

Uh oh!

harsh-791 commented May 14, 2025

Uh oh!

harsh-791 commented May 14, 2025

Uh oh!

mubashir-oss commented May 18, 2025

Uh oh!

harsh-791 commented May 23, 2025

Uh oh!

harsh-791 commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: optimize document processing #12

Are you sure you want to change the base?

feat: optimize document processing #12

Uh oh!

Conversation

harsh-791 commented May 14, 2025

Optimize Document Processing and Fix Critical Issues(closes : #2 )

Overview

Changes Made

Bug Fixes

Performance Optimizations

Code Quality

Testing

Performance Impact

Documentation

Dependencies

Checklist

Uh oh!

harsh-791 commented May 14, 2025

Uh oh!

mubashir-oss commented May 18, 2025

Uh oh!

harsh-791 commented May 23, 2025

Uh oh!

harsh-791 commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants