Refactor: pip install --upgrade DocAgent#2076
Conversation
DocAgent RefactorCurrent state of DocAgentThe existing DocAgent follows a swarm architecture with multiple specialized agents (Triage, Task Manager, Parser, Data Ingestion, Query, Error, and Summary agents). While this design provides clear separation of concerns, it introduces several production-readiness issues: the new design will feature 4 layers:
### How do we solve this problem?
Instead of processing documents during runtime, the new architecture will use an event-driven approach, where documents will be ingested based on triggered events like button clicks, file uploads, etc. # Before: Synchronous processing during query
user_query = "What's in this PDF?"
# Agent processes PDF → chunks → vectorizes → stores → queries (slow!)
# After: Event-driven ingestion
ingestion_service.ingest_document("large_report.pdf") # Async event
# Later...
user_query = "What's in this PDF?"
# Agent queries pre-processed data (fast!)
@dataclass
class StorageConfig:
storage_type: str = "local" # "local", "s3", "azure", "gcs", "minio"
base_path: Path = field(default_factory=lambda: Path("./storage"))
bucket_name: str | None = None
credentials: dict[str, Any] | None = None
@dataclass
class RAGConfig:
rag_type: str = "vector" # "vector", "structured", "graph"
backend: str = "chromadb" # "chromadb", "weaviate", "neo4j", "inmemory"
collection_name: str | None = None
embedding_model: str = "all-MiniLM-L6-v2"
config = DocAgentConfig(
rag=RAGConfig(
rag_type="vector",
backend="weaviate",
embedding_model="all-MiniLM-L6-v2"
),
storage=StorageConfig(
storage_type="s3",
bucket_name="my-docs-bucket"
),
processing=ProcessingConfig(
chunk_size=1024,
max_file_size=500 * 1024 * 1024 # 500MB
)
)example usagefrom autogen.agents.experimental.document_agent import DocAgent2, DocumentIngestionService
from autogen.agents.experimental.document_agent.core import DocAgentConfig
# Configure for production use
config = DocAgentConfig(
rag=RAGConfig(backend="weaviate", rag_type="vector"),
storage=StorageConfig(storage_type="s3", bucket_name="company-docs")
)
# Initialize query engine (supports multiple backends)
query_engine = WeaviateQueryEngine(config.rag)
# Create ingestion service (handles document processing)
ingestion_service = DocumentIngestionService(query_engine, config)
# Process documents asynchronously (event-driven)
ingestion_service.ingest_document("large_manual.pdf") # Non-blocking
# Create query agent (fast, no document processing)
doc_agent = DocAgent2(
query_engine=query_engine,
config=config
)
# Query pre-processed documents
response = doc_agent.query("What are the safety procedures?")todos:
rough FS structure document_agent/
├── core/
│ ├── __init__.py
│ ├── base_interfaces.py # Extract interfaces from existing code
│ └── config.py # Configuration from existing code
├── ingestion/
│ ├── __init__.py
│ ├── document_processor.py # Move from parser_utils.py + docling_doc_ingest_agent.py
│ └── chunking_strategies.py # Extract from existing parsing logic
├── storage/
│ ├── __init__.py
│ └── local_storage.py # Move from document_utils.py
├── rag/
│ ├── __init__.py
│ ├── base_rag.py # Extract from chroma_query_engine.py + inmemory_query_engine.py
│ └── vector_rag.py # Move chroma_query_engine.py
└── agents/
├── __init__.py
├── doc_agent.py # Simplified version of document_agent.py
└── ingestion_agent.py # Move from docling_doc_ingest_agent.py
The refactored DocAgent transforms from a research prototype into a production/enterprise-ready Ag2 feature with following benefits:
|
|
@marklysze can you help review? Thank you! |
87d30b5 to
da2723a
Compare
Why are these changes needed?
Identified Enterprise-readiness issues:
the base refactor solves the 1st 2 problems defined above , runtime Performance and resource waste via decoupling data ingestion from the parent architecture.
example output:
Related issue number
closes #2078
Checks