Skip to content

Conversation

@Dallas98
Copy link
Collaborator

This pull request introduces several improvements and optimizations to the generation_service.py module, focusing on processing efficiency, resource control, and enhanced data extraction. The most significant changes include increasing concurrency limits, introducing randomization to QA generation, extracting image URLs from documents, and adjusting batch sizes for chunk processing.

Performance and Resource Management:

  • Increased the concurrency limit for question processing from 10 to 20 and the batch size for chunk processing from 20 to 100, allowing more tasks to be processed in parallel and improving throughput. [1] [2]

Feature Enhancements:

  • Added a new function, extract_img_urls, that extracts image URLs from document content using a regular expression, and integrated this extraction into the question processing workflow to store found image URLs in the data object. [1] [2]

Quality Control and Randomization:

  • Introduced a randomization step in the QA generation process: for each chunk, QA generation is now probabilistically skipped based on the temperature parameter in question_cfg, which can help diversify output and control resource usage.

@Dallas98 Dallas98 merged commit 8fc4455 into main Dec 22, 2025
2 checks passed
@Dallas98 Dallas98 deleted the dev branch December 24, 2025 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants