Skip to content

Optimize Document Embedding with Multi-threading #22

@longdafeng

Description

@longdafeng

Description

The DocumentEmbedder.embed_from_directory() method in src/rag/rag.py processes documents sequentially, which is slow when embedding large document collections. We should implement multi-threading to parallelize file processing, embedding generation, and database insertion operations to significantly improve performance.

Proposed Solution

Add multi-threading support to the DocumentEmbedder class to process multiple files concurrently. This would involve:

  1. Using a thread pool to process files in parallel
  2. Batching embeddings and database insertions efficiently
  3. Maintaining thread safety for database operations

Related Code

  • src/rag/rag.py - DocumentEmbedder class (lines 328-464)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions