Skip to content

Conversation

@harsh-791
Copy link

Optimize Document Processing and Fix Critical Issues(closes : #2 )

Overview

This PR addresses several critical issues in the document processing system and implements significant performance optimizations to reduce latency and improve reliability.

Changes Made

Bug Fixes

  • Fixed circular import between openai.py and chunking.py by moving paragraph_chunking import inside functions
  • Removed request_timeout parameter from OpenAI client for better compatibility
  • Upgraded OpenAI package to latest version (1.78.1)
  • Fixed FastAPI server startup issues

Performance Optimizations

  • Implemented thread-safe caching mechanism with hit/miss tracking
  • Optimized PDF text extraction with parallel processing
  • Improved chunk merging logic for better text processing
  • Added memory-efficient streaming for large documents
  • Enhanced error handling with proper fallback mechanisms

Code Quality

  • Improved error handling in semantic chunking
  • Added proper logging for better debugging
  • Updated API documentation
  • Enhanced example usage

Testing

  • Tested document processing with various file types (PDF, DOCX, PPTX)
  • Verified chunking strategies (semantic, fixed) work as expected
  • Confirmed API endpoints are accessible and functioning
  • Validated performance improvements with large documents

Performance Impact

  • Reduced document processing latency
  • Improved memory usage during large document processing
  • Enhanced reliability of chunking operations

Documentation

  • Updated API documentation
  • Added example usage in README
  • Included performance monitoring capabilities

Dependencies

  • Upgraded OpenAI package to 1.78.1
  • No breaking changes to existing dependencies

Checklist

  • Code follows project style guidelines
  • All tests pass
  • Documentation has been updated
  • No breaking changes to existing functionality
  • Performance improvements verified

/claim #2

@harsh-791
Copy link
Author

@mubashir-oss can you please review the pr and suggest any changes if required

@mubashir-oss
Copy link
Contributor

@harsh-791 please attach a recording

@harsh-791
Copy link
Author

Screencast from 2025-05-23 13-14-07.webm
Screenshot from 2025-05-23 13-14-57

Hi @mubashir-oss ,

I’ve attached a screenshot and a video showing the latency results for the document chunking endpoint. To measure this, I used a simple curl command with timing flags while running everything locally on my machine.

As you can see in the screenshot:

  • The total time to process and get a response was about 1.45 seconds.
  • The time to first byte was also 1.45 seconds, which means most of the time was spent on processing.
  • The connection itself was almost instant (0.00025 seconds).

For this test, I used a real PDF file and the following command:

curl -X 'POST' \
  'http://127.0.0.1:8000/chunking' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F "document_file=@/home/harsh/Downloads/6.pdf" \
  -F 'strategy=semantic' \
  -F 'chunk_size=1000' \
  -F 'overlap=100' \
  -w "\nTotal time: %{time_total}s\nTime to first byte: %{time_starttransfer}s\nTime to connect: %{time_connect}s\n"

@harsh-791
Copy link
Author

@mubashir-oss can you please review this.
(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix:Reduce the latency of document parser

2 participants