diff --git a/docs/design/document_upload_design.md b/docs/design/document_upload_design.md index f1f36b86..9bf4dc43 100644 --- a/docs/design/document_upload_design.md +++ b/docs/design/document_upload_design.md @@ -1,18 +1,14 @@ -# ApeRAG Document Upload Module Data Flow +# ApeRAG Document Upload Architecture Design ## Overview -This document details the complete data flow of the document upload module in the ApeRAG project, from frontend file upload to backend storage and index construction. +This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction. -**Core Concept**: Adopts a **two-phase commit** design, first uploading to temporary state (UPLOADED), then formally adding to the knowledge base after user confirmation (PENDING → index building). +**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities. -## Core Interfaces +## System Architecture -1. **Upload File**: `POST /api/v1/collections/{collection_id}/documents/upload` -2. **Confirm Documents**: `POST /api/v1/collections/{collection_id}/documents/confirm` -3. **One-step Upload** (legacy): `POST /api/v1/collections/{collection_id}/documents` - -## Data Flow Diagram +### Overall Architecture ``` ┌─────────────────────────────────────────────────────────────┐ @@ -25,18 +21,19 @@ This document details the complete data flow of the document upload module in th ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ View Layer: aperag/views/collections.py │ -│ - upload_document_view() │ -│ - confirm_documents_view() │ -│ - JWT authentication, parameter validation │ +│ - HTTP request handling │ +│ - JWT authentication │ +│ - Parameter validation │ └────────┬───────────────────────────────────┬────────────────┘ │ │ │ document_service.upload_document() │ document_service.confirm_documents() ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Service Layer: aperag/service/document_service.py │ +│ - Business logic orchestration │ │ - File validation (type, size) │ -│ - Duplicate detection (SHA-256 hash) │ -│ - Quota check │ +│ - SHA-256 hash deduplication │ +│ - Quota checking │ │ - Transaction management │ └────────┬───────────────────────────────────┬────────────────┘ │ │ @@ -51,1116 +48,1030 @@ This document details the complete data flow of the document upload module in th │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Data Storage Layer │ +│ Storage Layer │ │ │ │ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ +│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ │ │ │ │ │ │ │ │ │ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ -│ │ - document_ │ │ - Original files │ │ - Vector │ │ -│ │ index │ │ - Converted files│ │ indexes │ │ -│ │ │ │ │ │ │ │ +│ │ - document_ │ │ - Original files │ │ - Vectors │ │ +│ │ index │ │ - Converted files│ │ │ │ │ └───────────────┘ └──────────────────┘ └─────────────┘ │ │ │ │ ┌───────────────┐ ┌──────────────────┐ │ │ │ Elasticsearch │ │ Neo4j/PG │ │ │ │ │ │ │ │ -│ │ - Fulltext │ │ - Knowledge │ │ -│ │ indexes │ │ graph │ │ -│ │ │ │ │ │ +│ │ - Full-text │ │ - Knowledge Graph│ │ │ └───────────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────────┐ │ Celery Workers │ + │ │ │ - Doc parsing │ - │ - Chunking │ + │ - Format convert │ + │ - Content extract│ + │ - Doc chunking │ │ - Index building │ └───────────────────┘ ``` -## Complete Process Details +### Layered Architecture -### Phase 1: File Upload (Temporary Storage) +``` +┌─────────────────────────────────────────────┐ +│ View Layer (views/collections.py) │ HTTP handling, auth, validation +└─────────────────┬───────────────────────────┘ + │ calls +┌─────────────────▼───────────────────────────┐ +│ Service Layer (service/document_service.py)│ Business logic, transaction, permission +└─────────────────┬───────────────────────────┘ + │ calls +┌─────────────────▼───────────────────────────┐ +│ Repository Layer (db/ops.py, objectstore/) │ Data access abstraction +└─────────────────┬───────────────────────────┘ + │ accesses +┌─────────────────▼───────────────────────────┐ +│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ Data persistence +└─────────────────────────────────────────────┘ +``` -#### 1.1 View Layer - HTTP Request Handling +## Core Process Details -**File**: `aperag/views/collections.py` +### Phase 0: API Interface Definition -```python -@router.post("/collections/{collection_id}/documents/upload", tags=["documents"]) -@audit(resource_type="document", api_name="UploadDocument") -async def upload_document_view( - request: Request, - collection_id: str, - file: UploadFile = File(...), - user: User = Depends(required_user), -) -> view_models.UploadDocumentResponse: - """Upload a single document file to temporary storage""" - return await document_service.upload_document(str(user.id), collection_id, file) -``` +The system provides three main interfaces: -**Responsibilities**: -- Receive multipart/form-data file uploads -- JWT Token authentication -- Extract path parameters (collection_id) -- Call Service layer -- Return `UploadDocumentResponse` (includes document_id, filename, size, status) +1. **Upload File** (Two-phase mode - Step 1) + - Endpoint: `POST /api/v1/collections/{collection_id}/documents/upload` + - Function: Upload file to temporary storage, status `UPLOADED` + - Returns: `document_id`, `filename`, `size`, `status` -#### 1.2 Service Layer - Business Logic Orchestration +2. **Confirm Documents** (Two-phase mode - Step 2) + - Endpoint: `POST /api/v1/collections/{collection_id}/documents/confirm` + - Function: Confirm uploaded documents, trigger index building + - Parameters: `document_ids` array + - Returns: `confirmed_count`, `failed_count`, `failed_documents` -**File**: `aperag/service/document_service.py` +3. **One-step Upload** (Legacy mode, backward compatible) + - Endpoint: `POST /api/v1/collections/{collection_id}/documents` + - Function: Upload and directly add to knowledge base, status directly to `PENDING` + - Supports batch upload -```python -async def upload_document( - self, user_id: str, collection_id: str, file: UploadFile -) -> view_models.UploadDocumentResponse: - """Upload a single document file to temporary storage with duplicate detection""" - # 1. Validate collection exists and is active - collection = await self._validate_collection(user_id, collection_id) - - # 2. Validate file type and size - file_suffix = self._validate_file(file.filename, file.size) - - # 3. Read file content - file_content = await file.read() - - # 4. Calculate file hash (SHA-256) - file_hash = calculate_file_hash(file_content) - - # 5. Transaction processing - async def _upload_document_atomically(session): - # 5.1 Duplicate detection - existing_doc = await self._check_duplicate_document( - user_id, collection.id, file.filename, file_hash - ) - - if existing_doc: - # Idempotent: return existing document - return view_models.UploadDocumentResponse( - document_id=existing_doc.id, - filename=existing_doc.name, - size=existing_doc.size, - status=existing_doc.status, - ) - - # 5.2 Create new document (UPLOADED status) - document_instance = await self._create_document_record( - session=session, - user=user_id, - collection_id=collection.id, - filename=file.filename, - size=file.size, - status=db_models.DocumentStatus.UPLOADED, # Temporary status - file_suffix=file_suffix, - file_content=file_content, - content_hash=file_hash, - ) - - return view_models.UploadDocumentResponse( - document_id=document_instance.id, - filename=document_instance.name, - size=document_instance.size, - status=document_instance.status, - ) - - return await self.db_ops.execute_with_transaction(_upload_document_atomically) -``` +### Phase 1: File Upload and Temporary Storage -**Core Validation Logic**: +#### 1.1 Upload Flow -1. **Collection Validation** (`_validate_collection`) - - Collection exists - - Collection is in ACTIVE status +``` +User selects files + │ + ▼ +Frontend calls upload API + │ + ▼ +View layer validates identity and params + │ + ▼ +Service layer processes business logic: + │ + ├─► Verify collection exists and active + │ + ├─► Validate file type and size + │ + ├─► Read file content + │ + ├─► Calculate SHA-256 hash + │ + └─► Transaction processing: + │ + ├─► Duplicate detection (by filename + hash) + │ ├─ Exact match: Return existing doc (idempotent) + │ ├─ Same name, different content: Throw conflict error + │ └─ New document: Continue creation + │ + ├─► Create Document record (status=UPLOADED) + │ + ├─► Upload to object store + │ └─ Path: user-{user_id}/{collection_id}/{document_id}/original{suffix} + │ + └─► Update document metadata (object_path) +``` -2. **File Validation** (`_validate_file`) - - File extension is supported - - File size within limit (default 100MB) +#### 1.2 File Validation -3. **Duplicate Detection** (`_check_duplicate_document`) - - Query by filename and SHA-256 hash - - If same name but different hash: throw `DocumentNameConflictException` - - If same name and hash: return existing document (idempotent) +**Supported File Types**: +- Documents: `.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` +- Text: `.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` +- Images: `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` +- Audio: `.mp3`, `.wav`, `.m4a` +- Archives: `.zip`, `.tar`, `.gz`, `.tgz` -#### 1.3 Document Creation Logic +**Size Limits**: +- Default: 100 MB (configurable via `MAX_DOCUMENT_SIZE` environment variable) +- Extracted total size: 5 GB (`MAX_EXTRACTED_SIZE`) -**Method**: `_create_document_record` +#### 1.3 Duplicate Detection Mechanism -```python -async def _create_document_record( - self, - session: AsyncSession, - user: str, - collection_id: str, - filename: str, - size: int, - status: db_models.DocumentStatus, - file_suffix: str, - file_content: bytes, - custom_metadata: dict = None, - content_hash: str = None, -) -> db_models.Document: - # 1. Create database record - document_instance = db_models.Document( - user=user, - name=filename, - status=status, - size=size, - collection_id=collection_id, - content_hash=content_hash, - ) - session.add(document_instance) - await session.flush() - await session.refresh(document_instance) - - # 2. Upload to object storage - async_obj_store = get_async_object_store() - upload_path = f"{document_instance.object_store_base_path()}/original{file_suffix}" - await async_obj_store.put(upload_path, file_content) - - # 3. Update metadata - metadata = {"object_path": upload_path} - if custom_metadata: - metadata.update(custom_metadata) - document_instance.doc_metadata = json.dumps(metadata) - session.add(document_instance) - await session.flush() - - return document_instance -``` +Uses **filename + SHA-256 hash** dual detection: -**Object Storage Path Generation**: +| Scenario | Filename | Hash | System Behavior | +|----------|----------|------|-----------------| +| Exact match | Same | Same | Return existing document (idempotent) | +| Name conflict | Same | Different | Throw `DocumentNameConflictException` | +| New document | Different | - | Create new document record | -```python -# Model method: aperag/db/models.py -def object_store_base_path(self) -> str: - """Generate the base path for object store""" - user = self.user.replace("|", "-") - return f"user-{user}/{self.collection_id}/{self.id}" - -# Example storage path: -# user-google-oauth2|123456/col_abc123/doc_xyz789/original.pdf -``` +**Advantages**: +- ✅ Supports idempotent upload: Network retries won't create duplicates +- ✅ Prevents content conflicts: Same name with different content prompts user +- ✅ Saves storage space: Same content stored only once -### Phase 2: Confirm Documents (Formal Addition) +### Phase 2: Temporary Storage Configuration -#### 2.1 View Layer +#### 2.1 Object Storage Types -**File**: `aperag/views/collections.py` +System supports two object storage backends, switchable via environment variables: -```python -@router.post("/collections/{collection_id}/documents/confirm", tags=["documents"]) -@audit(resource_type="document", api_name="ConfirmDocuments") -async def confirm_documents_view( - request: Request, - collection_id: str, - data: view_models.ConfirmDocumentsRequest, - user: User = Depends(required_user), -) -> view_models.ConfirmDocumentsResponse: - """Confirm uploaded documents and add them to collection""" - return await document_service.confirm_documents( - str(user.id), collection_id, data.document_ids - ) -``` +**1. Local Storage (Local filesystem)** -#### 2.2 Service Layer - Confirmation Logic +Use cases: +- Development and testing environments +- Small-scale deployments +- Single-machine deployments -**Method**: `confirm_documents` +Configuration: +```bash +# Development environment +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=.objects -```python -async def confirm_documents( - self, user_id: str, collection_id: str, document_ids: list[str] -) -> view_models.ConfirmDocumentsResponse: - """Confirm uploaded documents and trigger indexing""" - # 1. Validate collection - collection = await self._validate_collection(user_id, collection_id) - - # 2. Get collection configuration - collection_config = json.loads(collection.config) - index_types = self._get_index_types_for_collection(collection_config) - - confirmed_count = 0 - failed_count = 0 - failed_documents = [] - - # 3. Transaction processing - async def _confirm_documents_atomically(session): - # 3.1 Check quota (deduct quota only at confirmation stage) - await self._check_document_quotas(session, user_id, collection_id, len(document_ids)) - - for document_id in document_ids: - try: - # 3.2 Validate document status - stmt = select(db_models.Document).where( - db_models.Document.id == document_id, - db_models.Document.user == user_id, - db_models.Document.collection_id == collection_id, - db_models.Document.status == db_models.DocumentStatus.UPLOADED - ) - result = await session.execute(stmt) - document = result.scalar_one_or_none() - - if not document: - # Document doesn't exist or wrong status - failed_documents.append(...) - failed_count += 1 - continue - - # 3.3 Update document status: UPLOADED → PENDING - document.status = db_models.DocumentStatus.PENDING - document.gmt_updated = utc_now() - session.add(document) - - # 3.4 Create index records - await document_index_manager.create_or_update_document_indexes( - document_id=document.id, - index_types=index_types, - session=session - ) - - confirmed_count += 1 - - except Exception as e: - logger.error(f"Failed to confirm document {document_id}: {e}") - failed_documents.append(...) - failed_count += 1 - - return confirmed_count, failed_count, failed_documents - - # 4. Execute transaction - await self.db_ops.execute_with_transaction(_confirm_documents_atomically) - - # 5. Trigger index reconciliation task - _trigger_index_reconciliation() - - return view_models.ConfirmDocumentsResponse( - confirmed_count=confirmed_count, - failed_count=failed_count, - failed_documents=failed_documents - ) +# Docker environment +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects ``` -**Index Type Configuration**: - -```python -def _get_index_types_for_collection(self, collection_config: dict) -> list: - """Get the list of index types to create based on collection configuration""" - index_types = [ - db_models.DocumentIndexType.VECTOR, # Vector index (required) - db_models.DocumentIndexType.FULLTEXT, # Fulltext index (required) - ] - - if collection_config.get("enable_knowledge_graph", False): - index_types.append(db_models.DocumentIndexType.GRAPH) - if collection_config.get("enable_summary", False): - index_types.append(db_models.DocumentIndexType.SUMMARY) - if collection_config.get("enable_vision", False): - index_types.append(db_models.DocumentIndexType.VISION) - - return index_types +Storage path example: +``` +.objects/ +└── user-google-oauth2-123456/ + └── col_abc123/ + └── doc_xyz789/ + ├── original.pdf # Original file + ├── converted.pdf # Converted PDF + ├── processed_content.md # Parsed Markdown + ├── chunks/ # Chunked data + │ ├── chunk_0.json + │ └── chunk_1.json + └── images/ # Extracted images + ├── page_0.png + └── page_1.png ``` -### One-step Upload Interface (Legacy Compatibility) +**2. S3 Storage (Compatible with AWS S3/MinIO/OSS, etc.)** -**Interface**: `POST /api/v1/collections/{collection_id}/documents` +Use cases: +- Production environments +- Large-scale deployments +- Distributed deployments +- High availability and disaster recovery needs -```python -@router.post("/collections/{collection_id}/documents", tags=["documents"]) -async def create_documents_view( - request: Request, - collection_id: str, - files: List[UploadFile] = File(...), - user: User = Depends(required_user), -) -> view_models.DocumentList: - return await document_service.create_documents(str(user.id), collection_id, files) +Configuration: +```bash +OBJECT_STORE_TYPE=s3 +OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 address +OBJECT_STORE_S3_REGION=us-east-1 # AWS Region +OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key +OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key +OBJECT_STORE_S3_BUCKET=aperag # Bucket name +OBJECT_STORE_S3_PREFIX_PATH=dev/ # Optional path prefix +OBJECT_STORE_S3_USE_PATH_STYLE=true # Set to true for MinIO ``` -**Core Logic**: +#### 2.2 Object Storage Path Rules -- One-step completion: file upload + confirmation -- Directly create documents with PENDING status -- Immediately create index records -- Support batch upload of multiple files +**Path Format**: +``` +{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +``` -## Data Storage Layer +**Components**: +- `prefix`: Optional global prefix (S3 only) +- `user_id`: User ID (`|` replaced with `-`) +- `collection_id`: Collection ID +- `document_id`: Document ID +- `filename`: Filename (e.g., `original.pdf`, `page_0.png`) -### 1. PostgreSQL - Document Metadata +**Multi-tenancy Isolation**: +- Each user has an independent namespace +- Each collection has an independent storage directory +- Each document has an independent folder -#### 1.1 Document Table +### Phase 3: Document Confirmation and Index Building -**File**: `aperag/db/models.py` +#### 3.1 Confirmation Flow -```python -class Document(Base): - __tablename__ = "document" - __table_args__ = ( - UniqueConstraint("collection_id", "name", "gmt_deleted", - name="uq_document_collection_name_deleted"), - ) - - id = Column(String(24), primary_key=True, default=lambda: "doc" + random_id()) - name = Column(String(1024), nullable=False) - user = Column(String(256), nullable=False, index=True) - collection_id = Column(String(24), nullable=True, index=True) - status = Column(EnumColumn(DocumentStatus), nullable=False, index=True) - size = Column(BigInteger, nullable=False) - content_hash = Column(String(64), nullable=True, index=True) # SHA-256 - object_path = Column(Text, nullable=True) - doc_metadata = Column(Text, nullable=True) # JSON string - gmt_created = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_updated = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_deleted = Column(DateTime(timezone=True), nullable=True, index=True) +``` +User clicks "Save to Collection" + │ + ▼ +Frontend calls confirm API + │ + ▼ +Service layer processes: + │ + ├─► Validate collection configuration + │ + ├─► Check Quota (deduct quota at confirmation stage) + │ + └─► For each document_id: + │ + ├─► Verify document status is UPLOADED + │ + ├─► Update document status: UPLOADED → PENDING + │ + ├─► Create index records based on collection config: + │ ├─ VECTOR (Vector index, required) + │ ├─ FULLTEXT (Full-text index, required) + │ ├─ GRAPH (Knowledge graph, optional) + │ ├─ SUMMARY (Document summary, optional) + │ └─ VISION (Vision index, optional) + │ + └─► Return confirmation result + │ + ▼ +Trigger Celery task: reconcile_document_indexes + │ + ▼ +Background async index building ``` -**Status Enumeration** (`DocumentStatus`): +#### 3.2 Quota Management -| Status | Description | When Set | -|--------|-------------|----------| -| `UPLOADED` | Uploaded to temporary storage | upload_document API | -| `PENDING` | Waiting for index building | confirm_documents API | -| `RUNNING` | Index building in progress | Celery task starts | -| `COMPLETE` | All indexes complete | All index statuses become ACTIVE | -| `FAILED` | Index building failed | Any index fails | -| `DELETED` | Deleted | delete_document API | -| `EXPIRED` | Temporary document expired | Scheduled cleanup task (not implemented) | +**Check Timing**: +- ❌ Not checked during upload phase (temporary storage doesn't consume quota) +- ✅ Checked during confirmation phase (formal addition consumes quota) -#### 1.2 DocumentIndex Table +**Quota Types**: -```python -class DocumentIndex(Base): - __tablename__ = "document_index" - __table_args__ = ( - UniqueConstraint("document_id", "index_type", name="uq_document_index"), - ) - - id = Column(Integer, primary_key=True, index=True) - document_id = Column(String(24), nullable=False, index=True) - index_type = Column(EnumColumn(DocumentIndexType), nullable=False, index=True) - status = Column(EnumColumn(DocumentIndexStatus), nullable=False, - default=DocumentIndexStatus.PENDING, index=True) - version = Column(Integer, nullable=False, default=1) - observed_version = Column(Integer, nullable=False, default=0) - index_data = Column(Text, nullable=True) # JSON data - error_message = Column(Text, nullable=True) - gmt_created = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_updated = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_last_reconciled = Column(DateTime(timezone=True), nullable=True) -``` +1. **User Global Quota** + - `max_document_count`: Total document count limit per user + - Default: 1000 (configurable via `MAX_DOCUMENT_COUNT`) + +2. **Per-Collection Quota** + - `max_document_count_per_collection`: Document count limit per collection + - Excludes `UPLOADED` and `DELETED` status documents -**Index Types** (`DocumentIndexType`): +**Quota Exceeded Handling**: +- Throws `QuotaExceededException` +- Returns HTTP 400 error +- Includes current usage and quota limit information -- `VECTOR`: Vector index (Qdrant, etc.) -- `FULLTEXT`: Fulltext index (Elasticsearch) -- `GRAPH`: Knowledge graph index (Neo4j/PostgreSQL) -- `SUMMARY`: Document summary -- `VISION`: Vision index (image content) +### Phase 4: Document Parsing and Format Conversion -**Index Status** (`DocumentIndexStatus`): +#### 4.1 Parser Architecture -| Status | Description | -|--------|-------------| -| `PENDING` | Waiting for processing | -| `CREATING` | Creating | -| `ACTIVE` | Ready for use | -| `DELETING` | Marked for deletion | -| `DELETION_IN_PROGRESS` | Deleting | -| `FAILED` | Failed | +System uses a **multi-parser chain invocation** architecture, where each parser handles specific file types: -### 2. Object Store - File Storage +``` +DocParser (Main Controller) + │ + ├─► MinerUParser + │ └─ Function: High-precision PDF parsing (commercial API) + │ └─ Supports: .pdf + │ + ├─► DocRayParser + │ └─ Function: Document layout analysis and content extraction + │ └─ Supports: .pdf, .docx, .pptx, .xlsx + │ + ├─► ImageParser + │ └─ Function: Image content recognition (OCR + vision understanding) + │ └─ Supports: .jpg, .png, .gif, .bmp, .tiff + │ + ├─► AudioParser + │ └─ Function: Audio transcription (Speech-to-Text) + │ └─ Supports: .mp3, .wav, .m4a + │ + └─► MarkItDownParser (Fallback) + └─ Function: Universal document to Markdown conversion + └─ Supports: Almost all common formats +``` -#### 2.1 Storage Backend Configuration +#### 4.2 Parser Configuration -**File**: `aperag/config.py` +**Configuration Method**: Dynamically controlled via Collection Config -```python -class Config(BaseSettings): - # Object store type: "local" or "s3" - object_store_type: str = Field("local", alias="OBJECT_STORE_TYPE") - - # Local storage config - object_store_local_config: Optional[LocalObjectStoreConfig] = None - - # S3 storage config - object_store_s3_config: Optional[S3Config] = None +```json +{ + "parser_config": { + "use_mineru": false, // Enable MinerU (requires API Token) + "use_doc_ray": false, // Enable DocRay + "use_markitdown": true, // Enable MarkItDown (default) + "mineru_api_token": "xxx" // MinerU API Token (optional) + } +} ``` **Environment Variable Configuration**: - ```bash -# Local storage (default) -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects - -# S3 storage (MinIO/AWS S3) -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 -OBJECT_STORE_S3_ACCESS_KEY=minioadmin -OBJECT_STORE_S3_SECRET_KEY=minioadmin -OBJECT_STORE_S3_BUCKET=aperag -OBJECT_STORE_S3_REGION=us-east-1 -OBJECT_STORE_S3_PREFIX_PATH= -OBJECT_STORE_S3_USE_PATH_STYLE=true +USE_MINERU_API=false # Globally enable MinerU +MINERU_API_TOKEN=your_token # MinerU API Token ``` -#### 2.2 Object Storage Interface - -**File**: `aperag/objectstore/base.py` +#### 4.3 Parsing Flow -```python -class AsyncObjectStore(ABC): - @abstractmethod - async def put(self, path: str, data: bytes | IO[bytes]): - """Upload object to storage""" - ... - - @abstractmethod - async def get(self, path: str) -> IO[bytes] | None: - """Download object from storage""" - ... - - @abstractmethod - async def delete_objects_by_prefix(self, path_prefix: str): - """Delete all objects with given prefix""" - ... ``` - -**Factory Method**: - -```python -def get_async_object_store() -> AsyncObjectStore: - """Factory function to get an asynchronous AsyncObjectStore instance""" - match settings.object_store_type: - case "local": - from aperag.objectstore.local import AsyncLocal, LocalConfig - return AsyncLocal(LocalConfig(**config_dict)) - case "s3": - from aperag.objectstore.s3 import AsyncS3, S3Config - return AsyncS3(S3Config(**config_dict)) +Celery Worker receives indexing task + │ + ▼ +1. Download original file from object store + │ + ▼ +2. Select Parser based on file extension + │ + ├─► Try first matching Parser + │ ├─ Success: Return parsing result + │ └─ Failure: FallbackError → Try next Parser + │ + └─► Final fallback: MarkItDownParser + │ + ▼ +3. Parsing result (Parts): + │ + ├─► MarkdownPart: Text content + │ └─ Contains: headings, paragraphs, lists, tables, etc. + │ + ├─► PdfPart: PDF file + │ └─ For: linearization, page rendering + │ + └─► AssetBinPart: Binary resources + └─ Contains: images, embedded files, etc. + │ + ▼ +4. Post-processing: + │ + ├─► PDF pages to images (required for Vision index) + │ └─ Each page rendered as PNG image + │ └─ Saved to {document_path}/images/page_N.png + │ + ├─► PDF linearization (speed up browser loading) + │ └─ Use pikepdf to optimize PDF structure + │ └─ Saved to {document_path}/converted.pdf + │ + └─► Extract text content (plain text) + └─ Merge all MarkdownPart content + └─ Saved to {document_path}/processed_content.md + │ + ▼ +5. Save to object store ``` -#### 2.3 Local Storage Implementation - -**File**: `aperag/objectstore/local.py` +#### 4.4 Format Conversion Examples -```python -class AsyncLocal(AsyncObjectStore): - def __init__(self, cfg: LocalConfig): - self._base_storage_path = Path(cfg.root_dir).resolve() - self._base_storage_path.mkdir(parents=True, exist_ok=True) - - def _resolve_object_path(self, path: str) -> Path: - """Resolve and validate object path (security check)""" - path_components = Path(path.lstrip("/")).parts - if ".." in path_components: - raise ValueError("Invalid path: '..' not allowed") - - prospective_path = self._base_storage_path.joinpath(*path_components) - normalized_path = Path(os.path.abspath(prospective_path)) - - if self._base_storage_path not in normalized_path.parents: - raise ValueError("Path traversal attempt detected") - - return prospective_path - - async def put(self, path: str, data: bytes | IO[bytes]): - """Write file to local filesystem""" - full_path = self._resolve_object_path(path) - full_path.parent.mkdir(parents=True, exist_ok=True) - - async with aiofiles.open(full_path, "wb") as f: - if isinstance(data, bytes): - await f.write(data) - else: - await f.write(data.read()) +**Example 1: PDF Document** +``` +Input: user_manual.pdf (5 MB) + │ + ▼ +Parser selection: DocRayParser / MarkItDownParser + │ + ▼ +Output Parts: + ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." + └─ PdfPart: + │ + ▼ +Post-processing: + ├─ Render 50 pages to images → images/page_0.png ~ page_49.png + ├─ Linearize PDF → converted.pdf + └─ Extract text → processed_content.md ``` -**Storage Path Example**: - +**Example 2: Image File** ``` -.objects/ -└── user-google-oauth2-123456/ - └── col_abc123/ - └── doc_xyz789/ - ├── original.pdf # Original file - ├── converted.pdf # Converted PDF - ├── chunks/ # Chunk data - │ ├── chunk_0.json - │ └── chunk_1.json - └── images/ # Extracted images - ├── image_0.png - └── image_1.png +Input: screenshot.png (2 MB) + │ + ▼ +Parser selection: ImageParser + │ + ▼ +Output Parts: + ├─ MarkdownPart: "[OCR extracted text]" + └─ AssetBinPart: (vision_index=true) + │ + ▼ +Post-processing: + └─ Save original image copy → images/file.png ``` -#### 2.4 S3 Storage Implementation - -**File**: `aperag/objectstore/s3.py` - -```python -class AsyncS3(AsyncObjectStore): - def __init__(self, cfg: S3Config): - self.cfg = cfg - self._s3_client = None - - async def put(self, path: str, data: bytes | IO[bytes]): - """Upload file to S3""" - client = await self._get_client() - path = self._final_path(path) - - if isinstance(data, bytes): - data = BytesIO(data) - - await client.upload_fileobj(data, self.cfg.bucket, path) - - def _final_path(self, path: str) -> str: - """Add prefix path if configured""" - if self.cfg.prefix_path: - return f"{self.cfg.prefix_path.rstrip('/')}/{path.lstrip('/')}" - return path.lstrip('/') +**Example 3: Audio File** +``` +Input: meeting_record.mp3 (50 MB) + │ + ▼ +Parser selection: AudioParser + │ + ▼ +Output Parts: + └─ MarkdownPart: "[Transcribed meeting content]" + │ + ▼ +Post-processing: + └─ Save transcription text → processed_content.md ``` -### 3. Vector Database - Vector Indexes +### Phase 5: Index Building -**Supported Vector Databases**: +#### 5.1 Index Types and Functions -- Qdrant (default) -- Elasticsearch -- Other compatible vector databases +| Index Type | Required | Function Description | Storage Location | +|-----------|----------|---------------------|------------------| +| **VECTOR** | ✅ Required | Vector retrieval, semantic search | Qdrant / Elasticsearch | +| **FULLTEXT** | ✅ Required | Full-text search, keyword search | Elasticsearch | +| **GRAPH** | ❌ Optional | Knowledge graph, entity & relation extraction | Neo4j / PostgreSQL | +| **SUMMARY** | ❌ Optional | Document summary, LLM generated | PostgreSQL (index_data) | +| **VISION** | ❌ Optional | Vision understanding, image content analysis | Qdrant (vectors) + PG (metadata) | -**Configuration Example**: +#### 5.2 Index Building Flow -```bash -VECTOR_DB_TYPE=qdrant -VECTOR_DB_CONTEXT='{"url":"http://localhost","port":6333,"distance":"Cosine"}' ``` - -### 4. Elasticsearch - Fulltext Indexes - -**Environment Variables**: - -```bash -ES_HOST_NAME=127.0.0.1 -ES_PORT=9200 -ES_USER= -ES_PASSWORD= -ES_PROTOCOL=http +Celery Worker: reconcile_document_indexes task + │ + ▼ +1. Scan DocumentIndex table, find indexes needing processing + │ + ├─► PENDING status + observed_version < version + │ └─ Need to create or update index + │ + └─► DELETING status + └─ Need to delete index + │ + ▼ +2. Group by document, process one by one + │ + ▼ +3. For each document: + │ + ├─► parse_document (parse document) + │ ├─ Download original file from object store + │ ├─ Call DocParser to parse + │ └─ Return ParsedDocumentData + │ + └─► For each index type: + │ + ├─► create_index (create/update index) + │ │ + │ ├─ VECTOR index: + │ │ ├─ Document chunking + │ │ ├─ Generate vectors using Embedding model + │ │ └─ Write to Qdrant + │ │ + │ ├─ FULLTEXT index: + │ │ ├─ Extract plain text content + │ │ ├─ Chunk by paragraph/section + │ │ └─ Write to Elasticsearch + │ │ + │ ├─ GRAPH index: + │ │ ├─ Extract entities using LightRAG + │ │ ├─ Extract entity relationships + │ │ └─ Write to Neo4j/PostgreSQL + │ │ + │ ├─ SUMMARY index: + │ │ ├─ Generate summary using LLM + │ │ └─ Save to DocumentIndex.index_data + │ │ + │ └─ VISION index: + │ ├─ Extract image Assets + │ ├─ Understand image content using Vision LLM + │ ├─ Generate image description vectors + │ └─ Write to Qdrant + │ + └─► Update index status + ├─ Success: CREATING → ACTIVE + └─ Failure: CREATING → FAILED + │ + ▼ +4. Update document overall status + │ + ├─ All indexes ACTIVE → Document.status = COMPLETE + ├─ Any index FAILED → Document.status = FAILED + └─ Some indexes still processing → Document.status = RUNNING ``` -### 5. Knowledge Graph Storage - -**Supported Backends**: - -- Neo4j (recommended) -- PostgreSQL (simulated graph database) -- NebulaGraph - -## Document Duplicate Detection Mechanism +#### 5.3 Document Chunking -### Detection Logic +**Chunking Strategy**: +- Recursive character splitting (RecursiveCharacterTextSplitter) +- Prioritize splitting by natural paragraphs and sections +- Maintain context overlap -**Method**: `_check_duplicate_document` - -```python -async def _check_duplicate_document( - self, user: str, collection_id: str, filename: str, file_hash: str -) -> db_models.Document | None: - """ - Check if a document with the same name exists in the collection. - Returns the existing document if found, None otherwise. - Raises DocumentNameConflictException if same name but different file hash. - """ - # 1. Query document with same name - existing_doc = await self.db_ops.query_document_by_name_and_collection( - user, collection_id, filename - ) - - if existing_doc: - # 2. If no hash (legacy document), skip hash check - if existing_doc.content_hash is None: - logger.warning(f"Existing document {existing_doc.id} has no file hash") - return existing_doc - - # 3. Same hash: true duplicate (idempotent) - if existing_doc.content_hash == file_hash: - return existing_doc - - # 4. Different hash: filename conflict - raise DocumentNameConflictException(filename, collection_id) - - return None +**Chunking Parameters**: +```json +{ + "chunk_size": 1000, // Max characters per chunk + "chunk_overlap": 200, // Overlap characters + "separators": ["\n\n", "\n", " ", ""] // Separator priority +} ``` -### Hash Algorithm - -**SHA-256 File Hash Calculation**: - -```python -def calculate_file_hash(file_content: bytes) -> str: - """Calculate SHA-256 hash of file content""" - import hashlib - return hashlib.sha256(file_content).hexdigest() +**Chunking Result Storage**: ``` - -### Duplicate Strategy - -| Scenario | Filename | Hash | Behavior | -|----------|----------|------|----------| -| Exact duplicate | Same | Same | Return existing document (idempotent) | -| Filename conflict | Same | Different | Throw `DocumentNameConflictException` | -| New document | Different | - | Create new document | - -## Quota Management - -### Check Timing - -**Check quota at confirmation stage** (not at upload stage), because: - -1. Upload stage is only temporary storage, doesn't consume formal quota -2. Confirmation stage actually consumes resources (index building) -3. Allows users to upload first, then selectively confirm - -### Quota Types - -```python -async def _check_document_quotas( - self, session: AsyncSession, user: str, collection_id: str, count: int -): - """Check and consume document quotas""" - # 1. Check and consume user global quota - await quota_service.check_and_consume_quota( - user, "max_document_count", count, session - ) - - # 2. Check per-collection quota - stmt = select(func.count()).select_from(db_models.Document).where( - db_models.Document.collection_id == collection_id, - db_models.Document.status != db_models.DocumentStatus.DELETED, - db_models.Document.status != db_models.DocumentStatus.UPLOADED, # Don't count temporary documents - ) - existing_doc_count = await session.scalar(stmt) - - # 3. Get quota limit - stmt = select(UserQuota).where( - UserQuota.user == user, - UserQuota.key == "max_document_count_per_collection" - ) - per_collection_quota = (await session.execute(stmt)).scalars().first() - - # 4. Validate not exceeded - if per_collection_quota and (existing_doc_count + count) > per_collection_quota.quota_limit: - raise QuotaExceededException( - "max_document_count_per_collection", - per_collection_quota.quota_limit, - existing_doc_count - ) +{document_path}/chunks/ + ├─ chunk_0.json: {"text": "...", "metadata": {...}} + ├─ chunk_1.json: {"text": "...", "metadata": {...}} + └─ ... ``` -### Default Quotas - -**File**: `aperag/config.py` +## Database Design + +### Table 1: document (Document Metadata) + +**Table Structure**: + +| Field | Type | Description | Index | +|-------|------|-------------|-------| +| `id` | String(24) | Document ID, primary key, format: `doc{random_id}` | PK | +| `name` | String(1024) | Filename | - | +| `user` | String(256) | User ID (supports multiple IDPs) | ✅ Index | +| `collection_id` | String(24) | Collection ID | ✅ Index | +| `status` | Enum | Document status (see table below) | ✅ Index | +| `size` | BigInteger | File size (bytes) | - | +| `content_hash` | String(64) | SHA-256 hash (for deduplication) | ✅ Index | +| `object_path` | Text | Object store path (deprecated, use doc_metadata) | - | +| `doc_metadata` | Text | Document metadata (JSON string) | - | +| `gmt_created` | DateTime(tz) | Creation time (UTC) | - | +| `gmt_updated` | DateTime(tz) | Update time (UTC) | - | +| `gmt_deleted` | DateTime(tz) | Deletion time (soft delete) | ✅ Index | + +**Unique Constraint**: +```sql +UNIQUE INDEX uq_document_collection_name_active + ON document (collection_id, name) + WHERE gmt_deleted IS NULL; +``` +- Within the same collection, active document names cannot be duplicated +- Deleted documents are excluded from uniqueness check + +**Document Status Enum** (`DocumentStatus`): + +| Status | Description | When Set | Visibility | +|--------|-------------|----------|------------| +| `UPLOADED` | Uploaded to temporary storage | `upload_document` API | Frontend file selection UI | +| `PENDING` | Waiting for index building | `confirm_documents` API | Document list (processing) | +| `RUNNING` | Index building in progress | Celery task starts processing | Document list (processing) | +| `COMPLETE` | All indexes completed | All indexes become ACTIVE | Document list (available) | +| `FAILED` | Index building failed | Any index fails | Document list (failed) | +| `DELETED` | Deleted | `delete_document` API | Not visible (soft delete) | +| `EXPIRED` | Temporary document expired | Scheduled cleanup task | Not visible | + +**Document Metadata Example** (`doc_metadata` JSON field): +```json +{ + "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", + "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", + "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", + "images": [ + "user-xxx/col_xxx/doc_xxx/images/page_0.png", + "user-xxx/col_xxx/doc_xxx/images/page_1.png" + ], + "parser_used": "DocRayParser", + "parse_duration_ms": 5420, + "page_count": 50, + "custom_field": "value" +} +``` -```python -class Config(BaseSettings): - max_document_count: int = Field(1000, alias="MAX_DOCUMENT_COUNT") - max_document_size: int = Field(100 * 1024 * 1024, alias="MAX_DOCUMENT_SIZE") # 100MB +### Table 2: document_index (Index Status Management) + +**Table Structure**: + +| Field | Type | Description | Index | +|-------|------|-------------|-------| +| `id` | Integer | Auto-increment ID, primary key | PK | +| `document_id` | String(24) | Related document ID | ✅ Index | +| `index_type` | Enum | Index type (see table below) | ✅ Index | +| `status` | Enum | Index status (see table below) | ✅ Index | +| `version` | Integer | Index version number | - | +| `observed_version` | Integer | Processed version number | - | +| `index_data` | Text | Index data (JSON), e.g., summary content | - | +| `error_message` | Text | Error message (on failure) | - | +| `gmt_created` | DateTime(tz) | Creation time | - | +| `gmt_updated` | DateTime(tz) | Update time | - | +| `gmt_last_reconciled` | DateTime(tz) | Last reconciliation time | - | + +**Unique Constraint**: +```sql +UNIQUE CONSTRAINT uq_document_index + ON document_index (document_id, index_type); ``` +- Each document has only one record per index type -## Asynchronous Task Processing (Celery) +**Index Type Enum** (`DocumentIndexType`): -### Index Reconciliation Mechanism +| Type | Value | Description | External Storage | +|------|-------|-------------|------------------| +| `VECTOR` | "VECTOR" | Vector index | Qdrant / Elasticsearch | +| `FULLTEXT` | "FULLTEXT" | Full-text index | Elasticsearch | +| `GRAPH` | "GRAPH" | Knowledge graph | Neo4j / PostgreSQL | +| `SUMMARY` | "SUMMARY" | Document summary | PostgreSQL (index_data) | +| `VISION` | "VISION" | Vision index | Qdrant + PostgreSQL | -**File**: `aperag/service/document_service.py` +**Index Status Enum** (`DocumentIndexStatus`): +| Status | Description | When Set | +|--------|-------------|----------| +| `PENDING` | Waiting for processing | `confirm_documents` creates index record | +| `CREATING` | Creating | Celery Worker starts processing | +| `ACTIVE` | Ready for use | Index building successful | +| `DELETING` | Marked for deletion | `delete_document` API | +| `DELETION_IN_PROGRESS` | Deleting | Celery Worker is deleting | +| `FAILED` | Failed | Index building failed | + +**Version Control Mechanism**: +- `version`: Expected index version (incremented on document update) +- `observed_version`: Processed version number +- When `version > observed_version`, triggers index update + +**Reconciler**: ```python -def _trigger_index_reconciliation(): - """Trigger index reconciliation task in background""" - try: - from config.celery_tasks import reconcile_document_indexes - reconcile_document_indexes.apply_async() - except Exception as e: - logger.warning(f"Failed to trigger index reconciliation task: {e}") +# Query indexes needing processing +SELECT * FROM document_index +WHERE status = 'PENDING' + AND observed_version < version; + +# Update after processing +UPDATE document_index +SET status = 'ACTIVE', + observed_version = version, + gmt_last_reconciled = NOW() +WHERE id = ?; ``` -**Celery Task**: `config/celery_tasks.py` +### Table Relationship Diagram -```python -@celery_app.task(name="reconcile_document_indexes") -def reconcile_document_indexes(): - """Reconcile document indexes based on their status""" - from aperag.index.manager import document_index_manager - - # Process PENDING status indexes - document_index_manager.reconcile_pending_indexes() - - # Process DELETING status indexes - document_index_manager.reconcile_deleting_indexes() +``` +┌─────────────────────────────────┐ +│ collection │ +│ ───────────────────────────── │ +│ id (PK) │ +│ name │ +│ config (JSON) │ +│ status │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document │ +│ ───────────────────────────── │ +│ id (PK) │ +│ collection_id (FK) │◄──── Unique constraint: (collection_id, name) +│ name │ +│ user │ +│ status (Enum) │ +│ size │ +│ content_hash (SHA-256) │ +│ doc_metadata (JSON) │ +│ gmt_created │ +│ gmt_deleted │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document_index │ +│ ───────────────────────────── │ +│ id (PK) │ +│ document_id (FK) │◄──── Unique constraint: (document_id, index_type) +│ index_type (Enum) │ +│ status (Enum) │ +│ version │ +│ observed_version │ +│ index_data (JSON) │ +│ error_message │ +│ gmt_last_reconciled │ +│ ... │ +└─────────────────────────────────┘ ``` -### Index Building Process +## State Machine and Lifecycle -1. **Document Parsing**: DocParser parses document content -2. **Document Chunking**: Chunking strategy splits document -3. **Vectorization**: Embedding model generates vectors -4. **Vector Index**: Write to vector database -5. **Fulltext Index**: Write to Elasticsearch -6. **Knowledge Graph**: LightRAG extracts entities and relations -7. **Document Summary**: LLM generates summary (optional) -8. **Vision Index**: Extract and analyze images (optional) +### Document State Transitions -## File Validation +``` + ┌─────────────────────────────────────────────┐ + │ │ + │ ▼ + [Upload] ──► UPLOADED ──► [Confirm] ──► PENDING ──► RUNNING ──► COMPLETE + │ │ + │ ▼ + │ FAILED + │ │ + │ ▼ + └──────► [Delete] ──────────────► DELETED + │ + ┌───────────────────────────────────┘ + │ + ▼ + EXPIRED (Scheduled cleanup of unconfirmed docs) +``` -### Supported File Types +**Key Transitions**: +1. **UPLOADED → PENDING**: User clicks "Save to Collection" +2. **PENDING → RUNNING**: Celery Worker starts processing +3. **RUNNING → COMPLETE**: All indexes successful +4. **RUNNING → FAILED**: Any index fails +5. **Any status → DELETED**: User deletes document -**File**: `aperag/docparser/doc_parser.py` +### Index State Transitions -```python -class DocParser: - def supported_extensions(self) -> list: - return [ - ".txt", ".md", ".html", ".pdf", - ".docx", ".doc", ".pptx", ".ppt", - ".xlsx", ".xls", ".csv", - ".json", ".xml", ".yaml", ".yml", - ".png", ".jpg", ".jpeg", ".gif", ".bmp", - ".mp3", ".wav", ".m4a", - # ... more formats - ] ``` - -**Compressed File Support**: - -```python -SUPPORTED_COMPRESSED_EXTENSIONS = [".zip", ".tar", ".gz", ".tgz"] + [Create index record] ──► PENDING ──► CREATING ──► ACTIVE + │ + ▼ + FAILED + │ + ▼ + ┌──────────► PENDING (retry) + │ + [Delete request] ────────┼──────────► DELETING ──► DELETION_IN_PROGRESS ──► (record deleted) + │ + └──────────► (directly delete record, if PENDING/FAILED) ``` -### Size Limit +## Async Task Scheduling (Celery) -```python -def _validate_file(self, filename: str, size: int) -> str: - """Validate file extension and size""" - supported_extensions = DocParser().supported_extensions() - supported_extensions += SUPPORTED_COMPRESSED_EXTENSIONS - - file_suffix = os.path.splitext(filename)[1].lower() - - if file_suffix not in supported_extensions: - raise invalid_param("file_type", f"unsupported file type {file_suffix}") - - if size > settings.max_document_size: - raise invalid_param("file_size", "file size is too large") - - return file_suffix -``` +### Task Definitions -## API Response Format - -### UploadDocumentResponse - -**Schema**: `aperag/api/components/schemas/document.yaml` - -```yaml -uploadDocumentResponse: - type: object - properties: - document_id: - type: string - description: ID of the uploaded document - filename: - type: string - description: Name of the uploaded file - size: - type: integer - description: Size of the uploaded file in bytes - status: - type: string - enum: - - UPLOADED - - PENDING - - RUNNING - - COMPLETE - - FAILED - - DELETED - - EXPIRED - description: Status of the document - required: - - document_id - - filename - - size - - status -``` +**Main Task**: `reconcile_document_indexes` +- Trigger timing: + - After `confirm_documents` API call + - Scheduled task (every 30 seconds) + - Manual trigger (admin interface) +- Function: Scan `document_index` table, process indexes needing reconciliation -**Example**: +**Sub-tasks**: +- `parse_document_task`: Parse document content +- `create_vector_index_task`: Create vector index +- `create_fulltext_index_task`: Create full-text index +- `create_graph_index_task`: Create knowledge graph index +- `create_summary_index_task`: Create summary index +- `create_vision_index_task`: Create vision index -```json -{ - "document_id": "doc_xyz789abc", - "filename": "user_manual.pdf", - "size": 2048576, - "status": "UPLOADED" -} -``` +### Task Scheduling Strategy -### ConfirmDocumentsResponse - -```yaml -confirmDocumentsResponse: - type: object - properties: - confirmed_count: - type: integer - description: Number of documents successfully confirmed - failed_count: - type: integer - description: Number of documents that failed to confirm - failed_documents: - type: array - items: - type: object - properties: - document_id: - type: string - name: - type: string - error: - type: string - required: - - confirmed_count - - failed_count -``` +**Concurrency Control**: +- Each Worker processes at most N documents simultaneously (default 4) +- Multiple indexes of each document can be built in parallel +- Use Celery's `task_acks_late=True` to ensure tasks aren't lost -**Example**: +**Failure Retry**: +- Maximum 3 retries +- Exponential backoff (1 min → 5 min → 15 min) +- Marked as `FAILED` after 3 failures -```json -{ - "confirmed_count": 3, - "failed_count": 1, - "failed_documents": [ - { - "document_id": "doc_fail123", - "name": "corrupted.pdf", - "error": "CONFIRMATION_FAILED" - } - ] -} -``` +**Idempotency**: +- All tasks support repeated execution +- Use `observed_version` mechanism to avoid duplicate processing +- Same input produces same output -## Design Features +## Design Features and Advantages ### 1. Two-Phase Commit Design **Advantages**: +- ✅ **Better User Experience**: Fast upload response, doesn't block user operations +- ✅ **Selective Addition**: Can selectively confirm partial files after batch upload +- ✅ **Reasonable Resource Control**: Unconfirmed documents don't build indexes, don't consume quota +- ✅ **Failure Recovery Friendly**: Temporary documents can be periodically cleaned up without affecting business -- ✅ Users can upload first then select: batch upload then selectively add -- ✅ Reduce unnecessary resource consumption: unconfirmed documents don't build indexes -- ✅ Better user experience: fast upload response, background async processing -- ✅ More reasonable quota control: only consume quota after confirmation - -**Status Transition**: - +**Status Isolation**: ``` -Upload → UPLOADED → (User confirms) → PENDING → (Celery processes) → RUNNING → COMPLETE - ↓ - FAILED +Temporary status (UPLOADED): + - Not counted in quota + - Doesn't trigger indexing + - Can be automatically cleaned up + +Formal status (PENDING/RUNNING/COMPLETE): + - Counted in quota + - Triggers index building + - Won't be automatically cleaned up ``` ### 2. Idempotency Design -**Duplicate Upload Handling**: - -- Same name and content (same hash): return existing document -- Same name different content (different hash): throw conflict exception -- Completely new document: create new record - -**Benefits**: - -- Network retransmission won't create duplicate documents -- Client can safely retry -- Avoid storage space waste +**File-Level Idempotency**: +- SHA-256 hash deduplication +- Same file uploaded multiple times returns same `document_id` +- Avoids storage space waste -### 3. Multi-tenant Isolation +**API-Level Idempotency**: +- `upload_document`: Repeated upload returns existing document +- `confirm_documents`: Repeated confirmation doesn't create duplicate indexes +- `delete_document`: Repeated deletion returns success (soft delete) -**Storage Path Isolation**: +### 3. Multi-Tenancy Isolation +**Storage Isolation**: ``` -user-{user_id}/{collection_id}/{document_id}/... +user-{user_A}/... # User A's files +user-{user_B}/... # User B's files ``` **Database Isolation**: +- All queries filter by `user` field +- Collection-level permission control (`collection.user`) +- Soft delete support (`gmt_deleted`) -- All queries filter by user field -- Collection-level permission control -- Soft delete support (gmt_deleted) +### 4. Flexible Storage Backend -### 4. Flexible Storage Backends - -**Support Local and S3**: +**Unified Interface**: +```python +AsyncObjectStore: + - put(path, data) + - get(path) + - delete_objects_by_prefix(prefix) +``` -- Local: suitable for development, testing, small-scale deployment -- S3: suitable for production, large-scale deployment -- Unified `AsyncObjectStore` interface -- Runtime configuration switching +**Runtime Switching**: +- Switch between Local/S3 via environment variables +- No need to modify business code +- Supports custom storage backends (just implement the interface) ### 5. Transaction Consistency -**Core operations within transactions**: - +**Two-Phase Commit for Database + Object Store**: ```python -async def _upload_document_atomically(session): +async with transaction: # 1. Create database record - # 2. Upload file to object storage + document = create_document_record() + + # 2. Upload to object store + await object_store.put(path, data) + # 3. Update metadata + document.doc_metadata = json.dumps(metadata) + # All operations succeed to commit, any failure rolls back ``` -**Benefits**: +**Failure Handling**: +- Database record creation fails: Don't upload file +- File upload fails: Rollback database record +- Metadata update fails: Rollback previous operations -- Avoid partially successful dirty data -- Database records and object storage remain consistent -- Automatic cleanup on failure +### 6. Observability -### 6. Clear Layered Architecture +**Audit Logging**: +- `@audit` decorator records all document operations +- Includes: user, time, operation type, resource ID -``` -View Layer (views/collections.py) - ↓ calls -Service Layer (service/document_service.py) - ↓ calls -Repository Layer (db/ops.py, objectstore/) - ↓ accesses -Storage Layer (PostgreSQL, S3, Qdrant, ES, Neo4j) -``` +**Task Tracking**: +- `gmt_last_reconciled`: Last processing time +- `error_message`: Failure reason +- Celery task ID: Link log tracing -**Separation of Concerns**: - -- View: HTTP handling, parameter validation, authentication -- Service: business logic, transaction orchestration -- Repository: data access -- Storage: data persistence +**Monitoring Metrics**: +- Document upload rate +- Index building duration +- Failure rate statistics ## Performance Optimization -### 1. Chunked Upload (Planned, Not Implemented) +### 1. Async Processing -```python -# Large file chunked upload support -async def upload_document_chunk( - document_id: str, - chunk_index: int, - chunk_data: bytes, - total_chunks: int -): - # Upload single chunk - # Merge after all chunks complete - pass -``` +**Upload Doesn't Block**: +- Returns immediately after file upload to object store +- Index building executes asynchronously in Celery +- Frontend gets progress via polling or WebSocket ### 2. Batch Operations -- `confirm_documents` supports batch confirmation -- `delete_documents` supports batch deletion -- Batch query index status +**Batch Confirmation**: +```python +confirm_documents(document_ids=[id1, id2, ..., idN]) +``` +- Process multiple documents in one transaction +- Batch create index records +- Reduce database round-trips -### 3. Asynchronous Processing +### 3. Caching Strategy -- File upload returns immediately -- Index building executes asynchronously in Celery -- Frontend polls or uses WebSocket for progress +**Parsing Result Cache**: +- Parsed content saved to `processed_content.md` +- Subsequent index rebuilds can read directly without re-parsing -### 4. Object Storage Optimization +**Chunking Result Cache**: +- Chunking results saved to `chunks/` directory +- Vector index rebuilds can reuse chunking results -- S3 uses multipart upload -- Local uses aiofiles for async writes -- Support Range requests (partial download) +### 4. Parallel Index Building + +**Multiple Indexes in Parallel**: +```python +# VECTOR, FULLTEXT, GRAPH can be built in parallel +await asyncio.gather( + create_vector_index(), + create_fulltext_index(), + create_graph_index() +) +``` ## Error Handling ### Common Exceptions -```python -# 1. Collection doesn't exist or unavailable -raise ResourceNotFoundException("Collection", collection_id) -raise CollectionInactiveException(collection_id) - -# 2. File validation failed -raise invalid_param("file_type", f"unsupported file type {file_suffix}") -raise invalid_param("file_size", "file size is too large") - -# 3. Duplicate conflict -raise DocumentNameConflictException(filename, collection_id) +| Exception Type | HTTP Status | Trigger Scenario | Handling Suggestion | +|---------------|-------------|------------------|---------------------| +| `ResourceNotFoundException` | 404 | Collection/document doesn't exist | Check if ID is correct | +| `CollectionInactiveException` | 400 | Collection not active | Wait for collection initialization | +| `DocumentNameConflictException` | 409 | Same name, different content | Rename file or delete old document | +| `QuotaExceededException` | 429 | Quota exceeded | Upgrade plan or delete old documents | +| `InvalidFileTypeException` | 400 | Unsupported file type | Check supported file type list | +| `FileSizeTooLargeException` | 413 | File too large | Split file or compress | -# 4. Quota exceeded -raise QuotaExceededException("max_document_count", limit, current) +### Exception Propagation -# 5. Document not found -raise DocumentNotFoundException(f"Document not found: {document_id}") ``` - -### Exception Handling Hierarchy - -**Unified exception handling at View layer**: - -```python -# aperag/exception_handlers.py -@app.exception_handler(BusinessException) -async def business_exception_handler(request: Request, exc: BusinessException): - return JSONResponse( - status_code=400, - content={ - "error_code": exc.error_code.name, - "message": str(exc) - } - ) +Service Layer throws exception + │ + ▼ +View Layer catches and converts + │ + ▼ +Exception Handler unified handling + │ + ▼ +Return standard JSON response: +{ + "error_code": "QUOTA_EXCEEDED", + "message": "Document count limit exceeded", + "details": { + "limit": 1000, + "current": 1000 + } +} ``` -## Related Files +## Related Files Index ### Core Implementation -- `aperag/views/collections.py` - View layer interface -- `aperag/service/document_service.py` - Service layer business logic -- `aperag/source/upload.py` - UploadSource implementation -- `aperag/db/models.py` - Database models -- `aperag/db/ops.py` - Database operations -- `aperag/api/components/schemas/document.yaml` - OpenAPI Schema +- **View Layer**: `aperag/views/collections.py` - HTTP interface definition +- **Service Layer**: `aperag/service/document_service.py` - Business logic +- **Database Models**: `aperag/db/models.py` - Document, DocumentIndex table definitions +- **Database Operations**: `aperag/db/ops.py` - CRUD operation encapsulation ### Object Storage -- `aperag/objectstore/base.py` - Storage interface definition -- `aperag/objectstore/local.py` - Local storage implementation -- `aperag/objectstore/s3.py` - S3 storage implementation +- **Interface Definition**: `aperag/objectstore/base.py` - AsyncObjectStore abstract class +- **Local Implementation**: `aperag/objectstore/local.py` - Local filesystem storage +- **S3 Implementation**: `aperag/objectstore/s3.py` - S3-compatible storage -### Document Processing +### Document Parsing -- `aperag/docparser/doc_parser.py` - Document parser -- `aperag/docparser/chunking.py` - Document chunking -- `aperag/index/manager.py` - Index manager -- `aperag/index/vector_index.py` - Vector index -- `aperag/index/fulltext_index.py` - Fulltext index -- `aperag/index/graph_index.py` - Graph index +- **Main Controller**: `aperag/docparser/doc_parser.py` - DocParser +- **Parser Implementations**: + - `aperag/docparser/mineru_parser.py` - MinerU PDF parsing + - `aperag/docparser/docray_parser.py` - DocRay document parsing + - `aperag/docparser/markitdown_parser.py` - MarkItDown universal parsing + - `aperag/docparser/image_parser.py` - Image OCR + - `aperag/docparser/audio_parser.py` - Audio transcription +- **Document Processing**: `aperag/index/document_parser.py` - Parsing flow orchestration -### Task Queue +### Index Building -- `config/celery_tasks.py` - Celery task definitions -- `aperag/tasks/` - Task implementations +- **Index Management**: `aperag/index/manager.py` - DocumentIndexManager +- **Vector Index**: `aperag/index/vector_index.py` - VectorIndexer +- **Full-text Index**: `aperag/index/fulltext_index.py` - FulltextIndexer +- **Knowledge Graph**: `aperag/index/graph_index.py` - GraphIndexer +- **Document Summary**: `aperag/index/summary_index.py` - SummaryIndexer +- **Vision Index**: `aperag/index/vision_index.py` - VisionIndexer -### Frontend Implementation - -- `web/src/app/workspace/collections/[collectionId]/documents/page.tsx` - Document list page -- `web/src/components/documents/upload-documents.tsx` - Upload component - -## Summary +### Task Scheduling -ApeRAG's document upload module adopts a **two-phase commit + idempotency design + flexible storage** architecture: +- **Task Definitions**: `config/celery_tasks.py` - Celery task registration +- **Reconciler**: `aperag/tasks/reconciler.py` - DocumentIndexReconciler +- **Document Tasks**: `aperag/tasks/document.py` - DocumentIndexTask -1. **Two-Phase Commit**: Upload (UPLOADED) → Confirm (PENDING) → Index building -2. **SHA-256 Hash Deduplication**: Avoid duplicate documents, support idempotent uploads -3. **Flexible Storage Backends**: Local/S3 configurable switching -4. **Quota Management**: Deduct quota only at confirmation stage, reasonable resource control -5. **Multi-Index Coordination**: Vector, fulltext, graph, summary, vision multiple index types -6. **Clear Layered Architecture**: View → Service → Repository → Storage -7. **Celery Async Processing**: Index building doesn't block upload response -8. **Transaction Consistency**: Database and object storage operations are atomic +### Frontend Implementation -This design ensures performance while supporting complex document processing scenarios, with good scalability and fault tolerance. +- **Document List**: `web/src/app/workspace/collections/[collectionId]/documents/page.tsx` +- **Document Upload**: `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` +## Summary +ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture design: + +**Core Features**: +1. ✅ **Two-Phase Commit**: Upload (temporary storage) → Confirm (formal addition), providing better user experience +2. ✅ **SHA-256 Deduplication**: Prevents duplicate documents, supports idempotent upload +3. ✅ **Flexible Storage Backend**: Local/S3 configurable switching, unified interface abstraction +4. ✅ **Multi-Parser Architecture**: Supports MinerU, DocRay, MarkItDown and other parsers +5. ✅ **Automatic Format Conversion**: PDF→images, audio→text, images→OCR text +6. ✅ **Multi-Index Coordination**: Five index types: vector, full-text, graph, summary, vision +7. ✅ **Quota Management**: Quota deducted at confirmation stage, reasonable resource control +8. ✅ **Async Processing**: Celery task queue, doesn't block user operations +9. ✅ **Transaction Consistency**: Two-phase commit for database + object store +10. ✅ **Observability**: Audit logs, task tracking, complete error information recording + +This design ensures both high performance and scalability, supports complex document processing scenarios (multi-format, multi-language, multi-modal), while maintaining good fault tolerance and user experience. diff --git a/docs/design/document_upload_design_zh.md b/docs/design/document_upload_design_zh.md index c4b0c94c..307d77d0 100644 --- a/docs/design/document_upload_design_zh.md +++ b/docs/design/document_upload_design_zh.md @@ -1,18 +1,14 @@ -# ApeRAG 文档上传模块数据流程 +# ApeRAG 文档上传架构设计 ## 概述 -本文档详细说明ApeRAG项目中文档上传模块的完整数据流程,从前端文件上传到后端存储、索引构建的全链路实现。 +本文档详细说明 ApeRAG 项目中文档上传模块的完整架构设计,涵盖从文件上传、临时存储、文档解析、格式转换到最终索引构建的全链路流程。 -**核心理念**: 采用**两阶段提交**设计,先上传到临时状态(UPLOADED),用户确认后再正式添加到知识库(PENDING → 索引构建)。 +**核心设计理念**:采用**两阶段提交**模式,将文件上传(临时存储)和文档确认(正式添加)分离,提供更好的用户体验和资源管理能力。 -## 核心接口 +## 系统架构 -1. **上传文件**: `POST /api/v1/collections/{collection_id}/documents/upload` -2. **确认文档**: `POST /api/v1/collections/{collection_id}/documents/confirm` -3. **一步上传**(旧接口): `POST /api/v1/collections/{collection_id}/documents` - -## 数据流图 +### 整体架构图 ``` ┌─────────────────────────────────────────────────────────────┐ @@ -25,1140 +21,1057 @@ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ View Layer: aperag/views/collections.py │ -│ - upload_document_view() │ -│ - confirm_documents_view() │ -│ - JWT身份验证、参数验证 │ +│ - HTTP请求处理 │ +│ - JWT身份验证 │ +│ - 参数验证 │ └────────┬───────────────────────────────────┬────────────────┘ │ │ │ document_service.upload_document() │ document_service.confirm_documents() ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Service Layer: aperag/service/document_service.py │ +│ - 业务逻辑编排 │ │ - 文件验证(类型、大小) │ -│ - 重复检测(SHA-256 hash) │ -│ - Quota检查 │ +│ - SHA-256 哈希去重 │ +│ - Quota 检查 │ │ - 事务管理 │ └────────┬───────────────────────────────────┬────────────────┘ │ │ │ Step 1 │ Step 2 ▼ ▼ ┌────────────────────────┐ ┌────────────────────────────┐ -│ 1. 创建Document记录 │ │ 1. 更新Document状态 │ +│ 1. 创建 Document 记录 │ │ 1. 更新 Document 状态 │ │ status=UPLOADED │ │ UPLOADED → PENDING │ -│ 2. 保存到ObjectStore │ │ 2. 创建DocumentIndex记录 │ -│ 3. 计算content_hash │ │ 3. 触发索引构建任务 │ +│ 2. 保存到 ObjectStore │ │ 2. 创建 DocumentIndex 记录│ +│ 3. 计算 content_hash │ │ 3. 触发索引构建任务 │ └────────┬───────────────┘ └────────┬───────────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Data Storage Layer │ +│ Storage Layer │ │ │ │ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ +│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ │ │ │ │ │ │ │ │ │ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ │ │ - document_ │ │ - 原始文件 │ │ - 向量索引 │ │ │ │ index │ │ - 转换后的文件 │ │ │ │ -│ │ │ │ │ │ │ │ │ └───────────────┘ └──────────────────┘ └─────────────┘ │ │ │ │ ┌───────────────┐ ┌──────────────────┐ │ │ │ Elasticsearch │ │ Neo4j/PG │ │ │ │ │ │ │ │ │ │ - 全文索引 │ │ - 知识图谱 │ │ -│ │ │ │ │ │ │ └───────────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────────┐ │ Celery Workers │ + │ │ │ - 文档解析 │ - │ - 分块处理 │ + │ - 格式转换 │ + │ - 内容提取 │ + │ - 文档分块 │ │ - 索引构建 │ └───────────────────┘ ``` -## 完整流程详解 - -### 阶段1: 文件上传(临时存储) +### 分层架构 -#### 1.1 View层 - HTTP请求处理 - -**文件**: `aperag/views/collections.py` - -```python -@router.post("/collections/{collection_id}/documents/upload", tags=["documents"]) -@audit(resource_type="document", api_name="UploadDocument") -async def upload_document_view( - request: Request, - collection_id: str, - file: UploadFile = File(...), - user: User = Depends(required_user), -) -> view_models.UploadDocumentResponse: - """Upload a single document file to temporary storage""" - return await document_service.upload_document(str(user.id), collection_id, file) ``` - -**职责**: -- 接收 multipart/form-data 文件上传 -- JWT Token身份验证 -- 提取路径参数 (collection_id) -- 调用Service层 -- 返回`UploadDocumentResponse`(包含document_id、filename、size、status) - -#### 1.2 Service层 - 业务逻辑编排 - -**文件**: `aperag/service/document_service.py` - -```python -async def upload_document( - self, user_id: str, collection_id: str, file: UploadFile -) -> view_models.UploadDocumentResponse: - """Upload a single document file to temporary storage with duplicate detection""" - # 1. 验证集合存在且激活 - collection = await self._validate_collection(user_id, collection_id) - - # 2. 验证文件类型和大小 - file_suffix = self._validate_file(file.filename, file.size) - - # 3. 读取文件内容 - file_content = await file.read() - - # 4. 计算文件哈希(SHA-256) - file_hash = calculate_file_hash(file_content) - - # 5. 事务处理 - async def _upload_document_atomically(session): - # 5.1 重复检测 - existing_doc = await self._check_duplicate_document( - user_id, collection.id, file.filename, file_hash - ) - - if existing_doc: - # 幂等操作:返回已存在文档 - return view_models.UploadDocumentResponse( - document_id=existing_doc.id, - filename=existing_doc.name, - size=existing_doc.size, - status=existing_doc.status, - ) - - # 5.2 创建新文档(UPLOADED状态) - document_instance = await self._create_document_record( - session=session, - user=user_id, - collection_id=collection.id, - filename=file.filename, - size=file.size, - status=db_models.DocumentStatus.UPLOADED, # 临时状态 - file_suffix=file_suffix, - file_content=file_content, - content_hash=file_hash, - ) - - return view_models.UploadDocumentResponse( - document_id=document_instance.id, - filename=document_instance.name, - size=document_instance.size, - status=document_instance.status, - ) - - return await self.db_ops.execute_with_transaction(_upload_document_atomically) +┌─────────────────────────────────────────────┐ +│ View Layer (views/collections.py) │ HTTP 处理、认证、参数验证 +└─────────────────┬───────────────────────────┘ + │ 调用 +┌─────────────────▼───────────────────────────┐ +│ Service Layer (service/document_service.py)│ 业务逻辑、事务编排、权限控制 +└─────────────────┬───────────────────────────┘ + │ 调用 +┌─────────────────▼───────────────────────────┐ +│ Repository Layer (db/ops.py, objectstore/) │ 数据访问抽象、对象存储接口 +└─────────────────┬───────────────────────────┘ + │ 访问 +┌─────────────────▼───────────────────────────┐ +│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ 数据持久化 +└─────────────────────────────────────────────┘ ``` -**核心验证逻辑**: +## 核心流程详解 -1. **集合验证** (`_validate_collection`) - - 集合是否存在 - - 集合是否处于ACTIVE状态 +### 阶段 0: API 接口定义 -2. **文件验证** (`_validate_file`) - - 文件扩展名是否支持 - - 文件大小是否超过限制(默认100MB) +系统提供三个主要接口: -3. **重复检测** (`_check_duplicate_document`) - - 按文件名和SHA-256哈希查询 - - 如果文件名相同但哈希不同:抛出`DocumentNameConflictException` - - 如果文件名和哈希都相同:返回已存在文档(幂等) +1. **上传文件**(两阶段模式 - 第一步) + - 接口:`POST /api/v1/collections/{collection_id}/documents/upload` + - 功能:上传文件到临时存储,状态为 `UPLOADED` + - 返回:`document_id`、`filename`、`size`、`status` -#### 1.3 文档创建逻辑 +2. **确认文档**(两阶段模式 - 第二步) + - 接口:`POST /api/v1/collections/{collection_id}/documents/confirm` + - 功能:确认已上传的文档,触发索引构建 + - 参数:`document_ids` 数组 + - 返回:`confirmed_count`、`failed_count`、`failed_documents` -**方法**: `_create_document_record` +3. **一步上传**(传统模式,兼容旧版) + - 接口:`POST /api/v1/collections/{collection_id}/documents` + - 功能:上传并直接添加到知识库,状态直接为 `PENDING` + - 支持批量上传 -```python -async def _create_document_record( - self, - session: AsyncSession, - user: str, - collection_id: str, - filename: str, - size: int, - status: db_models.DocumentStatus, - file_suffix: str, - file_content: bytes, - custom_metadata: dict = None, - content_hash: str = None, -) -> db_models.Document: - # 1. 创建数据库记录 - document_instance = db_models.Document( - user=user, - name=filename, - status=status, - size=size, - collection_id=collection_id, - content_hash=content_hash, - ) - session.add(document_instance) - await session.flush() - await session.refresh(document_instance) - - # 2. 上传到对象存储 - async_obj_store = get_async_object_store() - upload_path = f"{document_instance.object_store_base_path()}/original{file_suffix}" - await async_obj_store.put(upload_path, file_content) - - # 3. 更新元数据 - metadata = {"object_path": upload_path} - if custom_metadata: - metadata.update(custom_metadata) - document_instance.doc_metadata = json.dumps(metadata) - session.add(document_instance) - await session.flush() - - return document_instance -``` +### 阶段 1: 文件上传与临时存储 -**对象存储路径生成**: +#### 1.1 上传流程 -```python -# 模型方法:aperag/db/models.py -def object_store_base_path(self) -> str: - """Generate the base path for object store""" - user = self.user.replace("|", "-") - return f"user-{user}/{self.collection_id}/{self.id}" - -# 实际存储路径示例: -# user-google-oauth2|123456/col_abc123/doc_xyz789/original.pdf ``` - -### 阶段2: 确认文档(正式添加) - -#### 2.1 View层 - -**文件**: `aperag/views/collections.py` - -```python -@router.post("/collections/{collection_id}/documents/confirm", tags=["documents"]) -@audit(resource_type="document", api_name="ConfirmDocuments") -async def confirm_documents_view( - request: Request, - collection_id: str, - data: view_models.ConfirmDocumentsRequest, - user: User = Depends(required_user), -) -> view_models.ConfirmDocumentsResponse: - """Confirm uploaded documents and add them to collection""" - return await document_service.confirm_documents( - str(user.id), collection_id, data.document_ids - ) +用户选择文件 + │ + ▼ +前端调用 upload API + │ + ▼ +View 层验证身份和参数 + │ + ▼ +Service 层处理业务逻辑: + │ + ├─► 验证集合存在且激活 + │ + ├─► 验证文件类型和大小 + │ + ├─► 读取文件内容 + │ + ├─► 计算 SHA-256 哈希 + │ + └─► 事务处理: + │ + ├─► 重复检测(按文件名+哈希) + │ ├─ 完全相同:返回已存在文档(幂等) + │ ├─ 同名不同内容:抛出冲突异常 + │ └─ 新文档:继续创建 + │ + ├─► 创建 Document 记录(status=UPLOADED) + │ + ├─► 上传到对象存储 + │ └─ 路径:user-{user_id}/{collection_id}/{document_id}/original{suffix} + │ + └─► 更新文档元数据(object_path) ``` -#### 2.2 Service层 - 确认逻辑 - -**方法**: `confirm_documents` +#### 1.2 文件验证 -```python -async def confirm_documents( - self, user_id: str, collection_id: str, document_ids: list[str] -) -> view_models.ConfirmDocumentsResponse: - """Confirm uploaded documents and trigger indexing""" - # 1. 验证集合 - collection = await self._validate_collection(user_id, collection_id) - - # 2. 获取集合配置 - collection_config = json.loads(collection.config) - index_types = self._get_index_types_for_collection(collection_config) - - confirmed_count = 0 - failed_count = 0 - failed_documents = [] - - # 3. 事务处理 - async def _confirm_documents_atomically(session): - # 3.1 检查Quota(确认阶段才扣除配额) - await self._check_document_quotas(session, user_id, collection_id, len(document_ids)) - - for document_id in document_ids: - try: - # 3.2 验证文档状态 - stmt = select(db_models.Document).where( - db_models.Document.id == document_id, - db_models.Document.user == user_id, - db_models.Document.collection_id == collection_id, - db_models.Document.status == db_models.DocumentStatus.UPLOADED - ) - result = await session.execute(stmt) - document = result.scalar_one_or_none() - - if not document: - # 文档不存在或状态不对 - failed_documents.append(...) - failed_count += 1 - continue - - # 3.3 更新文档状态:UPLOADED → PENDING - document.status = db_models.DocumentStatus.PENDING - document.gmt_updated = utc_now() - session.add(document) - - # 3.4 创建索引记录 - await document_index_manager.create_or_update_document_indexes( - document_id=document.id, - index_types=index_types, - session=session - ) - - confirmed_count += 1 - - except Exception as e: - logger.error(f"Failed to confirm document {document_id}: {e}") - failed_documents.append(...) - failed_count += 1 - - return confirmed_count, failed_count, failed_documents - - # 4. 执行事务 - await self.db_ops.execute_with_transaction(_confirm_documents_atomically) - - # 5. 触发索引协调任务 - _trigger_index_reconciliation() - - return view_models.ConfirmDocumentsResponse( - confirmed_count=confirmed_count, - failed_count=failed_count, - failed_documents=failed_documents - ) -``` +**支持的文件类型**: +- 文档:`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` +- 文本:`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` +- 图片:`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` +- 音频:`.mp3`, `.wav`, `.m4a` +- 压缩包:`.zip`, `.tar`, `.gz`, `.tgz` -**索引类型配置**: +**大小限制**: +- 默认:100 MB(可通过 `MAX_DOCUMENT_SIZE` 环境变量配置) +- 解压后总大小:5 GB(`MAX_EXTRACTED_SIZE`) -```python -def _get_index_types_for_collection(self, collection_config: dict) -> list: - """Get the list of index types to create based on collection configuration""" - index_types = [ - db_models.DocumentIndexType.VECTOR, # 向量索引(必选) - db_models.DocumentIndexType.FULLTEXT, # 全文索引(必选) - ] - - if collection_config.get("enable_knowledge_graph", False): - index_types.append(db_models.DocumentIndexType.GRAPH) - if collection_config.get("enable_summary", False): - index_types.append(db_models.DocumentIndexType.SUMMARY) - if collection_config.get("enable_vision", False): - index_types.append(db_models.DocumentIndexType.VISION) - - return index_types -``` +#### 1.3 重复检测机制 -### 一步上传接口(兼容旧版) +采用**文件名 + SHA-256 哈希**双重检测: -**接口**: `POST /api/v1/collections/{collection_id}/documents` +| 场景 | 文件名 | 哈希值 | 系统行为 | +|------|--------|--------|----------| +| 完全相同 | 相同 | 相同 | 返回已存在文档(幂等操作) | +| 文件名冲突 | 相同 | 不同 | 抛出 `DocumentNameConflictException` | +| 新文档 | 不同 | - | 创建新文档记录 | -```python -@router.post("/collections/{collection_id}/documents", tags=["documents"]) -async def create_documents_view( - request: Request, - collection_id: str, - files: List[UploadFile] = File(...), - user: User = Depends(required_user), -) -> view_models.DocumentList: - return await document_service.create_documents(str(user.id), collection_id, files) -``` +**优势**: +- ✅ 支持幂等上传:网络重传不会创建重复文档 +- ✅ 避免内容冲突:同名不同内容会提示用户 +- ✅ 节省存储空间:相同内容只存储一次 -**核心逻辑**: +### 阶段 2: 临时存储配置 -- 一次性完成:文件上传 + 确认添加 -- 直接创建状态为 PENDING 的文档 -- 立即创建索引记录 -- 支持批量上传多个文件 +#### 2.1 对象存储类型 -## 数据存储层 +系统支持两种对象存储后端,可通过环境变量切换: -### 1. PostgreSQL - 文档元数据 +**1. Local 存储(本地文件系统)** -#### 1.1 Document表 +适用场景: +- 开发测试环境 +- 小规模部署 +- 单机部署 -**文件**: `aperag/db/models.py` +配置方式: +```bash +# 开发环境 +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=.objects -```python -class Document(Base): - __tablename__ = "document" - __table_args__ = ( - UniqueConstraint("collection_id", "name", "gmt_deleted", - name="uq_document_collection_name_deleted"), - ) - - id = Column(String(24), primary_key=True, default=lambda: "doc" + random_id()) - name = Column(String(1024), nullable=False) - user = Column(String(256), nullable=False, index=True) - collection_id = Column(String(24), nullable=True, index=True) - status = Column(EnumColumn(DocumentStatus), nullable=False, index=True) - size = Column(BigInteger, nullable=False) - content_hash = Column(String(64), nullable=True, index=True) # SHA-256 - object_path = Column(Text, nullable=True) - doc_metadata = Column(Text, nullable=True) # JSON字符串 - gmt_created = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_updated = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_deleted = Column(DateTime(timezone=True), nullable=True, index=True) +# Docker 环境 +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects ``` -**状态枚举** (`DocumentStatus`): +存储路径示例: +``` +.objects/ +└── user-google-oauth2-123456/ + └── col_abc123/ + └── doc_xyz789/ + ├── original.pdf # 原始文件 + ├── converted.pdf # 转换后的 PDF + ├── processed_content.md # 解析后的 Markdown + ├── chunks/ # 分块数据 + │ ├── chunk_0.json + │ └── chunk_1.json + └── images/ # 提取的图片 + ├── page_0.png + └── page_1.png +``` -| 状态 | 说明 | 何时设置 | -|------|------|----------| -| `UPLOADED` | 已上传到临时存储 | upload_document 接口 | -| `PENDING` | 等待索引构建 | confirm_documents 接口 | -| `RUNNING` | 索引构建中 | Celery任务开始处理 | -| `COMPLETE` | 所有索引完成 | 所有索引状态变为ACTIVE | -| `FAILED` | 索引构建失败 | 任一索引失败 | -| `DELETED` | 已删除 | delete_document 接口 | -| `EXPIRED` | 临时文档过期 | 定时清理任务(未实现) | +**2. S3 存储(兼容 AWS S3/MinIO/OSS 等)** -#### 1.2 DocumentIndex表 +适用场景: +- 生产环境 +- 大规模部署 +- 分布式部署 +- 需要高可用和容灾 -```python -class DocumentIndex(Base): - __tablename__ = "document_index" - __table_args__ = ( - UniqueConstraint("document_id", "index_type", name="uq_document_index"), - ) - - id = Column(Integer, primary_key=True, index=True) - document_id = Column(String(24), nullable=False, index=True) - index_type = Column(EnumColumn(DocumentIndexType), nullable=False, index=True) - status = Column(EnumColumn(DocumentIndexStatus), nullable=False, - default=DocumentIndexStatus.PENDING, index=True) - version = Column(Integer, nullable=False, default=1) - observed_version = Column(Integer, nullable=False, default=0) - index_data = Column(Text, nullable=True) # JSON数据 - error_message = Column(Text, nullable=True) - gmt_created = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_updated = Column(DateTime(timezone=True), default=utc_now, nullable=False) - gmt_last_reconciled = Column(DateTime(timezone=True), nullable=True) +配置方式: +```bash +OBJECT_STORE_TYPE=s3 +OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 地址 +OBJECT_STORE_S3_REGION=us-east-1 # AWS Region +OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key +OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key +OBJECT_STORE_S3_BUCKET=aperag # Bucket 名称 +OBJECT_STORE_S3_PREFIX_PATH=dev/ # 可选的路径前缀 +OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO 需要设置为 true ``` -**索引类型** (`DocumentIndexType`): - -- `VECTOR`: 向量索引(Qdrant等) -- `FULLTEXT`: 全文索引(Elasticsearch) -- `GRAPH`: 知识图谱索引(Neo4j/PostgreSQL) -- `SUMMARY`: 文档摘要 -- `VISION`: 视觉索引(图片内容) +#### 2.2 对象存储路径规则 -**索引状态** (`DocumentIndexStatus`): +**路径格式**: +``` +{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +``` -| 状态 | 说明 | -|------|------| -| `PENDING` | 等待处理 | -| `CREATING` | 创建中 | -| `ACTIVE` | 就绪可用 | -| `DELETING` | 标记删除 | -| `DELETION_IN_PROGRESS` | 删除中 | -| `FAILED` | 失败 | +**组成部分**: +- `prefix`:可选的全局前缀(仅 S3) +- `user_id`:用户 ID(`|` 替换为 `-`) +- `collection_id`:集合 ID +- `document_id`:文档 ID +- `filename`:文件名(如 `original.pdf`、`page_0.png`) -### 2. Object Store - 文件存储 +**多租户隔离**: +- 每个用户有独立的命名空间 +- 每个集合有独立的存储目录 +- 每个文档有独立的文件夹 -#### 2.1 存储后端配置 +### 阶段 3: 文档确认与索引构建 -**文件**: `aperag/config.py` +#### 3.1 确认流程 -```python -class Config(BaseSettings): - # Object store type: "local" or "s3" - object_store_type: str = Field("local", alias="OBJECT_STORE_TYPE") - - # Local storage config - object_store_local_config: Optional[LocalObjectStoreConfig] = None - - # S3 storage config - object_store_s3_config: Optional[S3Config] = None ``` - -**环境变量配置**: - -```bash -# Local存储(默认) -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects - -# S3存储(MinIO/AWS S3) -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 -OBJECT_STORE_S3_ACCESS_KEY=minioadmin -OBJECT_STORE_S3_SECRET_KEY=minioadmin -OBJECT_STORE_S3_BUCKET=aperag -OBJECT_STORE_S3_REGION=us-east-1 -OBJECT_STORE_S3_PREFIX_PATH= -OBJECT_STORE_S3_USE_PATH_STYLE=true +用户点击"保存到集合" + │ + ▼ +前端调用 confirm API + │ + ▼ +Service 层处理: + │ + ├─► 验证集合配置 + │ + ├─► 检查 Quota(确认阶段才扣除配额) + │ + └─► 对每个 document_id: + │ + ├─► 验证文档状态为 UPLOADED + │ + ├─► 更新文档状态:UPLOADED → PENDING + │ + ├─► 根据集合配置创建索引记录: + │ ├─ VECTOR(向量索引,必选) + │ ├─ FULLTEXT(全文索引,必选) + │ ├─ GRAPH(知识图谱,可选) + │ ├─ SUMMARY(文档摘要,可选) + │ └─ VISION(视觉索引,可选) + │ + └─► 返回确认结果 + │ + ▼ +触发 Celery 任务:reconcile_document_indexes + │ + ▼ +后台异步处理索引构建 ``` -#### 2.2 对象存储接口 +#### 3.2 Quota(配额)管理 -**文件**: `aperag/objectstore/base.py` +**检查时机**: +- ❌ 不在上传阶段检查(临时存储不占用配额) +- ✅ 在确认阶段检查(正式添加才消耗配额) -```python -class AsyncObjectStore(ABC): - @abstractmethod - async def put(self, path: str, data: bytes | IO[bytes]): - """Upload object to storage""" - ... - - @abstractmethod - async def get(self, path: str) -> IO[bytes] | None: - """Download object from storage""" - ... - - @abstractmethod - async def delete_objects_by_prefix(self, path_prefix: str): - """Delete all objects with given prefix""" - ... -``` +**配额类型**: -**工厂方法**: +1. **用户全局配额** + - `max_document_count`:用户总文档数量限制 + - 默认:1000(可通过 `MAX_DOCUMENT_COUNT` 配置) -```python -def get_async_object_store() -> AsyncObjectStore: - """Factory function to get an asynchronous AsyncObjectStore instance""" - match settings.object_store_type: - case "local": - from aperag.objectstore.local import AsyncLocal, LocalConfig - return AsyncLocal(LocalConfig(**config_dict)) - case "s3": - from aperag.objectstore.s3 import AsyncS3, S3Config - return AsyncS3(S3Config(**config_dict)) -``` +2. **单集合配额** + - `max_document_count_per_collection`:单个集合文档数量限制 + - 不计入 `UPLOADED` 和 `DELETED` 状态的文档 -#### 2.3 Local存储实现 +**配额超限处理**: +- 抛出 `QuotaExceededException` +- 返回 HTTP 400 错误 +- 包含当前用量和配额上限信息 -**文件**: `aperag/objectstore/local.py` +### 阶段 4: 文档解析与格式转换 -```python -class AsyncLocal(AsyncObjectStore): - def __init__(self, cfg: LocalConfig): - self._base_storage_path = Path(cfg.root_dir).resolve() - self._base_storage_path.mkdir(parents=True, exist_ok=True) - - def _resolve_object_path(self, path: str) -> Path: - """Resolve and validate object path (security check)""" - path_components = Path(path.lstrip("/")).parts - if ".." in path_components: - raise ValueError("Invalid path: '..' not allowed") - - prospective_path = self._base_storage_path.joinpath(*path_components) - normalized_path = Path(os.path.abspath(prospective_path)) - - if self._base_storage_path not in normalized_path.parents: - raise ValueError("Path traversal attempt detected") - - return prospective_path - - async def put(self, path: str, data: bytes | IO[bytes]): - """Write file to local filesystem""" - full_path = self._resolve_object_path(path) - full_path.parent.mkdir(parents=True, exist_ok=True) - - async with aiofiles.open(full_path, "wb") as f: - if isinstance(data, bytes): - await f.write(data) - else: - await f.write(data.read()) -``` +#### 4.1 Parser 架构 -**存储路径示例**: +系统采用**多 Parser 链式调用**架构,每个 Parser 负责特定类型的文件解析: ``` -.objects/ -└── user-google-oauth2-123456/ - └── col_abc123/ - └── doc_xyz789/ - ├── original.pdf # 原始文件 - ├── converted.pdf # 转换后的PDF - ├── chunks/ # 分块数据 - │ ├── chunk_0.json - │ └── chunk_1.json - └── images/ # 提取的图片 - ├── image_0.png - └── image_1.png +DocParser(主控制器) + │ + ├─► MinerUParser + │ └─ 功能:高精度 PDF 解析(商业 API) + │ └─ 支持:.pdf + │ + ├─► DocRayParser + │ └─ 功能:文档布局分析和内容提取 + │ └─ 支持:.pdf, .docx, .pptx, .xlsx + │ + ├─► ImageParser + │ └─ 功能:图片内容识别(OCR + 视觉理解) + │ └─ 支持:.jpg, .png, .gif, .bmp, .tiff + │ + ├─► AudioParser + │ └─ 功能:音频转录(Speech-to-Text) + │ └─ 支持:.mp3, .wav, .m4a + │ + └─► MarkItDownParser(兜底) + └─ 功能:通用文档转 Markdown + └─ 支持:几乎所有常见格式 ``` -#### 2.4 S3存储实现 +#### 4.2 Parser 配置 -**文件**: `aperag/objectstore/s3.py` +**配置方式**:通过集合配置(Collection Config)动态控制 -```python -class AsyncS3(AsyncObjectStore): - def __init__(self, cfg: S3Config): - self.cfg = cfg - self._s3_client = None - - async def put(self, path: str, data: bytes | IO[bytes]): - """Upload file to S3""" - client = await self._get_client() - path = self._final_path(path) - - if isinstance(data, bytes): - data = BytesIO(data) - - await client.upload_fileobj(data, self.cfg.bucket, path) - - def _final_path(self, path: str) -> str: - """Add prefix path if configured""" - if self.cfg.prefix_path: - return f"{self.cfg.prefix_path.rstrip('/')}/{path.lstrip('/')}" - return path.lstrip('/') +```json +{ + "parser_config": { + "use_mineru": false, // 是否启用 MinerU(需要 API Token) + "use_doc_ray": false, // 是否启用 DocRay + "use_markitdown": true, // 是否启用 MarkItDown(默认) + "mineru_api_token": "xxx" // MinerU API Token(可选) + } +} ``` -### 3. 向量数据库 - 向量索引 - -**支持的向量数据库**: - -- Qdrant(默认) -- Elasticsearch -- 其他兼容接口的向量数据库 - -**配置示例**: - +**环境变量配置**: ```bash -VECTOR_DB_TYPE=qdrant -VECTOR_DB_CONTEXT='{"url":"http://localhost","port":6333,"distance":"Cosine"}' +USE_MINERU_API=false # 全局启用 MinerU +MINERU_API_TOKEN=your_token # MinerU API Token ``` -### 4. Elasticsearch - 全文索引 +#### 4.3 解析流程 -**环境变量**: - -```bash -ES_HOST_NAME=127.0.0.1 -ES_PORT=9200 -ES_USER= -ES_PASSWORD= -ES_PROTOCOL=http +``` +Celery Worker 收到索引任务 + │ + ▼ +1. 从对象存储下载原始文件 + │ + ▼ +2. 根据文件扩展名选择 Parser + │ + ├─► 尝试第一个匹配的 Parser + │ ├─ 成功:返回解析结果 + │ └─ 失败:FallbackError → 尝试下一个 Parser + │ + └─► 最终兜底:MarkItDownParser + │ + ▼ +3. 解析结果(Parts): + │ + ├─► MarkdownPart:文本内容 + │ └─ 包含:标题、段落、列表、表格等 + │ + ├─► PdfPart:PDF 文件 + │ └─ 用于:线性化、页面渲染 + │ + └─► AssetBinPart:二进制资源 + └─ 包含:图片、嵌入的文件等 + │ + ▼ +4. 后处理(Post-processing): + │ + ├─► PDF 页面转图片(Vision 索引需要) + │ └─ 每页渲染为 PNG 图片 + │ └─ 保存到 {document_path}/images/page_N.png + │ + ├─► PDF 线性化(加速浏览器加载) + │ └─ 使用 pikepdf 优化 PDF 结构 + │ └─ 保存到 {document_path}/converted.pdf + │ + └─► 提取文本内容(纯文本) + └─ 合并所有 MarkdownPart 内容 + └─ 保存到 {document_path}/processed_content.md + │ + ▼ +5. 保存到对象存储 ``` -### 5. 知识图谱存储 - -**支持的后端**: - -- Neo4j(推荐) -- PostgreSQL(自模拟图数据库) -- NebulaGraph - -## 文档重复检测机制 - -### 检测逻辑 - -**方法**: `_check_duplicate_document` +#### 4.4 格式转换示例 -```python -async def _check_duplicate_document( - self, user: str, collection_id: str, filename: str, file_hash: str -) -> db_models.Document | None: - """ - Check if a document with the same name exists in the collection. - Returns the existing document if found, None otherwise. - Raises DocumentNameConflictException if same name but different file hash. - """ - # 1. 查询同名文档 - existing_doc = await self.db_ops.query_document_by_name_and_collection( - user, collection_id, filename - ) - - if existing_doc: - # 2. 如果没有哈希(旧版文档),跳过哈希检查 - if existing_doc.content_hash is None: - logger.warning(f"Existing document {existing_doc.id} has no file hash") - return existing_doc - - # 3. 哈希相同:真正的重复(幂等) - if existing_doc.content_hash == file_hash: - return existing_doc - - # 4. 哈希不同:文件名冲突 - raise DocumentNameConflictException(filename, collection_id) - - return None +**示例 1:PDF 文档** +``` +输入:user_manual.pdf (5 MB) + │ + ▼ +解析器选择:DocRayParser / MarkItDownParser + │ + ▼ +输出 Parts: + ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." + └─ PdfPart: <原始 PDF 数据> + │ + ▼ +后处理: + ├─ 渲染 50 页为图片 → images/page_0.png ~ page_49.png + ├─ 线性化 PDF → converted.pdf + └─ 提取文本 → processed_content.md ``` -### 哈希算法 - -**SHA-256文件哈希计算**: +**示例 2:图片文件** +``` +输入:screenshot.png (2 MB) + │ + ▼ +解析器选择:ImageParser + │ + ▼ +输出 Parts: + ├─ MarkdownPart: "[OCR 提取的文字内容]" + └─ AssetBinPart: <原始图片数据> (vision_index=true) + │ + ▼ +后处理: + └─ 保存原图副本 → images/file.png +``` -```python -def calculate_file_hash(file_content: bytes) -> str: - """Calculate SHA-256 hash of file content""" - import hashlib - return hashlib.sha256(file_content).hexdigest() +**示例 3:音频文件** +``` +输入:meeting_record.mp3 (50 MB) + │ + ▼ +解析器选择:AudioParser + │ + ▼ +输出 Parts: + └─ MarkdownPart: "[转录的会议内容文本]" + │ + ▼ +后处理: + └─ 保存转录文本 → processed_content.md ``` -### 重复策略 +### 阶段 5: 索引构建 -| 场景 | 文件名 | 哈希 | 行为 | -|------|-------|------|------| -| 完全相同 | 相同 | 相同 | 返回已存在文档(幂等) | -| 文件名冲突 | 相同 | 不同 | 抛出`DocumentNameConflictException` | -| 新文档 | 不同 | - | 创建新文档 | +#### 5.1 索引类型与功能 -## Quota(配额)管理 +| 索引类型 | 是否必选 | 功能描述 | 存储位置 | +|---------|---------|----------|----------| +| **VECTOR** | ✅ 必选 | 向量化检索,支持语义搜索 | Qdrant / Elasticsearch | +| **FULLTEXT** | ✅ 必选 | 全文检索,支持关键词搜索 | Elasticsearch | +| **GRAPH** | ❌ 可选 | 知识图谱,提取实体和关系 | Neo4j / PostgreSQL | +| **SUMMARY** | ❌ 可选 | 文档摘要,LLM 生成 | PostgreSQL (index_data) | +| **VISION** | ❌ 可选 | 视觉理解,图片内容分析 | Qdrant (向量) + PG (metadata) | -### 检查时机 +#### 5.2 索引构建流程 -**在确认阶段检查配额**(不在上传阶段),因为: +``` +Celery Worker: reconcile_document_indexes 任务 + │ + ▼ +1. 扫描 DocumentIndex 表,找到需要处理的索引 + │ + ├─► PENDING 状态 + observed_version < version + │ └─ 需要创建或更新索引 + │ + └─► DELETING 状态 + └─ 需要删除索引 + │ + ▼ +2. 按文档分组,逐个处理 + │ + ▼ +3. 对每个文档: + │ + ├─► parse_document(解析文档) + │ ├─ 从对象存储下载原始文件 + │ ├─ 调用 DocParser 解析 + │ └─ 返回 ParsedDocumentData + │ + └─► 对每个索引类型: + │ + ├─► create_index (创建/更新索引) + │ │ + │ ├─ VECTOR 索引: + │ │ ├─ 文档分块(Chunking) + │ │ ├─ Embedding 模型生成向量 + │ │ └─ 写入 Qdrant + │ │ + │ ├─ FULLTEXT 索引: + │ │ ├─ 提取纯文本内容 + │ │ ├─ 按段落/章节分块 + │ │ └─ 写入 Elasticsearch + │ │ + │ ├─ GRAPH 索引: + │ │ ├─ 使用 LightRAG 提取实体 + │ │ ├─ 提取实体间关系 + │ │ └─ 写入 Neo4j/PostgreSQL + │ │ + │ ├─ SUMMARY 索引: + │ │ ├─ 调用 LLM 生成摘要 + │ │ └─ 保存到 DocumentIndex.index_data + │ │ + │ └─ VISION 索引: + │ ├─ 提取图片 Assets + │ ├─ Vision LLM 理解图片内容 + │ ├─ 生成图片描述向量 + │ └─ 写入 Qdrant + │ + └─► 更新索引状态 + ├─ 成功:CREATING → ACTIVE + └─ 失败:CREATING → FAILED + │ + ▼ +4. 更新文档总体状态 + │ + ├─ 所有索引都 ACTIVE → Document.status = COMPLETE + ├─ 任一索引 FAILED → Document.status = FAILED + └─ 部分索引仍在处理 → Document.status = RUNNING +``` -1. 上传阶段只是临时存储,不占用正式配额 -2. 确认阶段才真正消耗资源(索引构建) -3. 允许用户先上传后选择性确认 +#### 5.3 文档分块(Chunking) -### 配额类型 +**分块策略**: +- 递归字符分割(RecursiveCharacterTextSplitter) +- 按自然段落、章节优先切分 +- 保留上下文重叠(Overlap) -```python -async def _check_document_quotas( - self, session: AsyncSession, user: str, collection_id: str, count: int -): - """Check and consume document quotas""" - # 1. 检查并消耗用户全局配额 - await quota_service.check_and_consume_quota( - user, "max_document_count", count, session - ) - - # 2. 检查单个集合配额 - stmt = select(func.count()).select_from(db_models.Document).where( - db_models.Document.collection_id == collection_id, - db_models.Document.status != db_models.DocumentStatus.DELETED, - db_models.Document.status != db_models.DocumentStatus.UPLOADED, # 不计入临时文档 - ) - existing_doc_count = await session.scalar(stmt) - - # 3. 获取配额限制 - stmt = select(UserQuota).where( - UserQuota.user == user, - UserQuota.key == "max_document_count_per_collection" - ) - per_collection_quota = (await session.execute(stmt)).scalars().first() - - # 4. 验证是否超出 - if per_collection_quota and (existing_doc_count + count) > per_collection_quota.quota_limit: - raise QuotaExceededException( - "max_document_count_per_collection", - per_collection_quota.quota_limit, - existing_doc_count - ) +**分块参数**: +```json +{ + "chunk_size": 1000, // 每块最大字符数 + "chunk_overlap": 200, // 重叠字符数 + "separators": ["\n\n", "\n", " ", ""] // 分隔符优先级 +} ``` -### 默认配额 +**分块结果存储**: +``` +{document_path}/chunks/ + ├─ chunk_0.json: {"text": "...", "metadata": {...}} + ├─ chunk_1.json: {"text": "...", "metadata": {...}} + └─ ... +``` -**文件**: `aperag/config.py` +## 数据库设计 + +### 表 1: document(文档元数据) + +**表结构**: + +| 字段名 | 类型 | 说明 | 索引 | +|--------|------|------|------| +| `id` | String(24) | 文档 ID,主键,格式:`doc{random_id}` | PK | +| `name` | String(1024) | 文件名 | - | +| `user` | String(256) | 用户 ID(支持多种 IDP) | ✅ Index | +| `collection_id` | String(24) | 所属集合 ID | ✅ Index | +| `status` | Enum | 文档状态(见下表) | ✅ Index | +| `size` | BigInteger | 文件大小(字节) | - | +| `content_hash` | String(64) | SHA-256 哈希(用于去重) | ✅ Index | +| `object_path` | Text | 对象存储路径(已废弃,用 doc_metadata) | - | +| `doc_metadata` | Text | 文档元数据(JSON 字符串) | - | +| `gmt_created` | DateTime(tz) | 创建时间(UTC) | - | +| `gmt_updated` | DateTime(tz) | 更新时间(UTC) | - | +| `gmt_deleted` | DateTime(tz) | 删除时间(软删除) | ✅ Index | + +**唯一约束**: +```sql +UNIQUE INDEX uq_document_collection_name_active + ON document (collection_id, name) + WHERE gmt_deleted IS NULL; +``` +- 同一集合内,活跃文档的名称不能重复 +- 已删除的文档不参与唯一性检查 + +**文档状态枚举**(`DocumentStatus`): + +| 状态 | 说明 | 何时设置 | 可见性 | +|------|------|----------|--------| +| `UPLOADED` | 已上传到临时存储 | `upload_document` 接口 | 前端文件选择界面 | +| `PENDING` | 等待索引构建 | `confirm_documents` 接口 | 文档列表(处理中) | +| `RUNNING` | 索引构建中 | Celery 任务开始处理 | 文档列表(处理中) | +| `COMPLETE` | 所有索引完成 | 所有索引变为 ACTIVE | 文档列表(可用) | +| `FAILED` | 索引构建失败 | 任一索引失败 | 文档列表(失败) | +| `DELETED` | 已删除 | `delete_document` 接口 | 不可见(软删除) | +| `EXPIRED` | 临时文档过期 | 定时清理任务 | 不可见 | + +**文档元数据示例**(`doc_metadata` JSON 字段): +```json +{ + "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", + "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", + "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", + "images": [ + "user-xxx/col_xxx/doc_xxx/images/page_0.png", + "user-xxx/col_xxx/doc_xxx/images/page_1.png" + ], + "parser_used": "DocRayParser", + "parse_duration_ms": 5420, + "page_count": 50, + "custom_field": "value" +} +``` -```python -class Config(BaseSettings): - max_document_count: int = Field(1000, alias="MAX_DOCUMENT_COUNT") - max_document_size: int = Field(100 * 1024 * 1024, alias="MAX_DOCUMENT_SIZE") # 100MB +### 表 2: document_index(索引状态管理) + +**表结构**: + +| 字段名 | 类型 | 说明 | 索引 | +|--------|------|------|------| +| `id` | Integer | 自增 ID,主键 | PK | +| `document_id` | String(24) | 关联的文档 ID | ✅ Index | +| `index_type` | Enum | 索引类型(见下表) | ✅ Index | +| `status` | Enum | 索引状态(见下表) | ✅ Index | +| `version` | Integer | 索引版本号 | - | +| `observed_version` | Integer | 已处理的版本号 | - | +| `index_data` | Text | 索引数据(JSON),如摘要内容 | - | +| `error_message` | Text | 错误信息(失败时) | - | +| `gmt_created` | DateTime(tz) | 创建时间 | - | +| `gmt_updated` | DateTime(tz) | 更新时间 | - | +| `gmt_last_reconciled` | DateTime(tz) | 最后协调时间 | - | + +**唯一约束**: +```sql +UNIQUE CONSTRAINT uq_document_index + ON document_index (document_id, index_type); ``` +- 每个文档的每种索引类型只有一条记录 -## 异步任务处理(Celery) +**索引类型枚举**(`DocumentIndexType`): -### 索引协调机制 +| 类型 | 值 | 说明 | 外部存储 | +|------|-----|------|----------| +| `VECTOR` | "VECTOR" | 向量索引 | Qdrant / Elasticsearch | +| `FULLTEXT` | "FULLTEXT" | 全文索引 | Elasticsearch | +| `GRAPH` | "GRAPH" | 知识图谱 | Neo4j / PostgreSQL | +| `SUMMARY` | "SUMMARY" | 文档摘要 | PostgreSQL (index_data) | +| `VISION` | "VISION" | 视觉索引 | Qdrant + PostgreSQL | -**文件**: `aperag/service/document_service.py` +**索引状态枚举**(`DocumentIndexStatus`): +| 状态 | 说明 | 何时设置 | +|------|------|----------| +| `PENDING` | 等待处理 | `confirm_documents` 创建索引记录 | +| `CREATING` | 创建中 | Celery Worker 开始处理 | +| `ACTIVE` | 就绪可用 | 索引构建成功 | +| `DELETING` | 标记删除 | `delete_document` 接口 | +| `DELETION_IN_PROGRESS` | 删除中 | Celery Worker 正在删除 | +| `FAILED` | 失败 | 索引构建失败 | + +**版本控制机制**: +- `version`:期望的索引版本(每次文档更新时 +1) +- `observed_version`:已处理的版本号 +- `version > observed_version` 时,触发索引更新 + +**协调器(Reconciler)**: ```python -def _trigger_index_reconciliation(): - """Trigger index reconciliation task in background""" - try: - from config.celery_tasks import reconcile_document_indexes - reconcile_document_indexes.apply_async() - except Exception as e: - logger.warning(f"Failed to trigger index reconciliation task: {e}") +# 查询需要处理的索引 +SELECT * FROM document_index +WHERE status = 'PENDING' + AND observed_version < version; + +# 处理后更新 +UPDATE document_index +SET status = 'ACTIVE', + observed_version = version, + gmt_last_reconciled = NOW() +WHERE id = ?; ``` -**Celery任务**: `config/celery_tasks.py` +### 表关系图 -```python -@celery_app.task(name="reconcile_document_indexes") -def reconcile_document_indexes(): - """Reconcile document indexes based on their status""" - from aperag.index.manager import document_index_manager - - # 处理PENDING状态的索引 - document_index_manager.reconcile_pending_indexes() - - # 处理DELETING状态的索引 - document_index_manager.reconcile_deleting_indexes() +``` +┌─────────────────────────────────┐ +│ collection │ +│ ───────────────────────────── │ +│ id (PK) │ +│ name │ +│ config (JSON) │ +│ status │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document │ +│ ───────────────────────────── │ +│ id (PK) │ +│ collection_id (FK) │◄──── 唯一约束: (collection_id, name) +│ name │ +│ user │ +│ status (Enum) │ +│ size │ +│ content_hash (SHA-256) │ +│ doc_metadata (JSON) │ +│ gmt_created │ +│ gmt_deleted │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document_index │ +│ ───────────────────────────── │ +│ id (PK) │ +│ document_id (FK) │◄──── 唯一约束: (document_id, index_type) +│ index_type (Enum) │ +│ status (Enum) │ +│ version │ +│ observed_version │ +│ index_data (JSON) │ +│ error_message │ +│ gmt_last_reconciled │ +│ ... │ +└─────────────────────────────────┘ ``` -### 索引构建流程 +## 状态机与生命周期 -1. **文档解析**: DocParser解析文档内容 -2. **文档分块**: Chunking策略切分文档 -3. **向量化**: Embedding模型生成向量 -4. **向量索引**: 写入向量数据库 -5. **全文索引**: 写入Elasticsearch -6. **知识图谱**: LightRAG提取实体关系 -7. **文档摘要**: LLM生成摘要(可选) -8. **视觉索引**: 提取和分析图片(可选) +### 文档状态转换 -## 文件验证 +``` + ┌─────────────────────────────────────────────┐ + │ │ + │ ▼ + [上传文件] ──► UPLOADED ──► [确认] ──► PENDING ──► RUNNING ──► COMPLETE + │ │ + │ ▼ + │ FAILED + │ │ + │ ▼ + └──────► [删除] ──────────────► DELETED + │ + ┌───────────────────────────────────┘ + │ + ▼ + EXPIRED (定时清理未确认的文档) +``` -### 支持的文件类型 +**关键转换**: +1. **UPLOADED → PENDING**:用户点击"保存到集合" +2. **PENDING → RUNNING**:Celery Worker 开始处理 +3. **RUNNING → COMPLETE**:所有索引都成功 +4. **RUNNING → FAILED**:任一索引失败 +5. **任何状态 → DELETED**:用户删除文档 -**文件**: `aperag/docparser/doc_parser.py` +### 索引状态转换 -```python -class DocParser: - def supported_extensions(self) -> list: - return [ - ".txt", ".md", ".html", ".pdf", - ".docx", ".doc", ".pptx", ".ppt", - ".xlsx", ".xls", ".csv", - ".json", ".xml", ".yaml", ".yml", - ".png", ".jpg", ".jpeg", ".gif", ".bmp", - ".mp3", ".wav", ".m4a", - # ... 更多格式 - ] ``` - -**压缩文件支持**: - -```python -SUPPORTED_COMPRESSED_EXTENSIONS = [".zip", ".tar", ".gz", ".tgz"] + [创建索引记录] ──► PENDING ──► CREATING ──► ACTIVE + │ + ▼ + FAILED + │ + ▼ + ┌──────────► PENDING (重试) + │ + [删除请求] ──────┼──────────► DELETING ──► DELETION_IN_PROGRESS ──► (记录删除) + │ + └──────────► (直接删除记录,如果 PENDING/FAILED) ``` -### 大小限制 +## 异步任务调度(Celery) -```python -def _validate_file(self, filename: str, size: int) -> str: - """Validate file extension and size""" - supported_extensions = DocParser().supported_extensions() - supported_extensions += SUPPORTED_COMPRESSED_EXTENSIONS - - file_suffix = os.path.splitext(filename)[1].lower() - - if file_suffix not in supported_extensions: - raise invalid_param("file_type", f"unsupported file type {file_suffix}") - - if size > settings.max_document_size: - raise invalid_param("file_size", "file size is too large") - - return file_suffix -``` +### 任务定义 -## API响应格式 - -### UploadDocumentResponse - -**Schema**: `aperag/api/components/schemas/document.yaml` - -```yaml -uploadDocumentResponse: - type: object - properties: - document_id: - type: string - description: ID of the uploaded document - filename: - type: string - description: Name of the uploaded file - size: - type: integer - description: Size of the uploaded file in bytes - status: - type: string - enum: - - UPLOADED - - PENDING - - RUNNING - - COMPLETE - - FAILED - - DELETED - - EXPIRED - description: Status of the document - required: - - document_id - - filename - - size - - status -``` +**主任务**:`reconcile_document_indexes` +- 触发时机: + - `confirm_documents` 接口调用后 + - 定时任务(每 30 秒) + - 手动触发(管理界面) +- 功能:扫描 `document_index` 表,处理需要协调的索引 -**示例**: +**子任务**: +- `parse_document_task`:解析文档内容 +- `create_vector_index_task`:创建向量索引 +- `create_fulltext_index_task`:创建全文索引 +- `create_graph_index_task`:创建知识图谱索引 +- `create_summary_index_task`:创建摘要索引 +- `create_vision_index_task`:创建视觉索引 -```json -{ - "document_id": "doc_xyz789abc", - "filename": "user_manual.pdf", - "size": 2048576, - "status": "UPLOADED" -} -``` +### 任务调度策略 -### ConfirmDocumentsResponse - -```yaml -confirmDocumentsResponse: - type: object - properties: - confirmed_count: - type: integer - description: Number of documents successfully confirmed - failed_count: - type: integer - description: Number of documents that failed to confirm - failed_documents: - type: array - items: - type: object - properties: - document_id: - type: string - name: - type: string - error: - type: string - required: - - confirmed_count - - failed_count -``` +**并发控制**: +- 每个 Worker 最多同时处理 N 个文档(默认 4) +- 每个文档的多个索引可以并行构建 +- 使用 Celery 的 `task_acks_late=True` 确保任务不丢失 -**示例**: +**失败重试**: +- 最多重试 3 次 +- 指数退避(1分钟 → 5分钟 → 15分钟) +- 3 次失败后标记为 `FAILED` -```json -{ - "confirmed_count": 3, - "failed_count": 1, - "failed_documents": [ - { - "document_id": "doc_fail123", - "name": "corrupted.pdf", - "error": "CONFIRMATION_FAILED" - } - ] -} -``` +**幂等性**: +- 所有任务支持重复执行 +- 使用 `observed_version` 机制避免重复处理 +- 相同输入产生相同输出 -## 设计特点 +## 设计特点与优势 ### 1. 两阶段提交设计 -**优势**: - -- ✅ 用户可以先上传后选择:批量上传后选择性添加 -- ✅ 减少不必要的资源消耗:未确认的文档不构建索引 -- ✅ 更好的用户体验:快速上传响应,后台异步处理 -- ✅ 配额控制更合理:只有确认后才消耗配额 - -**状态转换**: +**优势**: +- ✅ **用户体验更好**:快速上传响应,不阻塞用户操作 +- ✅ **选择性添加**:批量上传后可选择性确认部分文件 +- ✅ **资源控制合理**:未确认的文档不构建索引,不消耗配额 +- ✅ **故障恢复友好**:临时文档可以定期清理,不影响业务 +**状态隔离**: ``` -上传 → UPLOADED → (用户确认) → PENDING → (Celery处理) → RUNNING → COMPLETE - ↓ - FAILED +临时状态(UPLOADED): + - 不计入配额 + - 不触发索引 + - 可以被自动清理 + +正式状态(PENDING/RUNNING/COMPLETE): + - 计入配额 + - 触发索引构建 + - 不会被自动清理 ``` ### 2. 幂等性设计 -**重复上传处理**: - -- 同名同内容(哈希相同):返回已存在文档 -- 同名不同内容(哈希不同):抛出冲突异常 -- 完全新文档:创建新记录 - -**好处**: - -- 网络重传不会创建重复文档 -- 客户端可以安全重试 +**文件级别幂等**: +- SHA-256 哈希去重 +- 相同文件多次上传返回同一 `document_id` - 避免存储空间浪费 -### 3. 多租户隔离 +**接口级别幂等**: +- `upload_document`:重复上传返回已存在文档 +- `confirm_documents`:重复确认不会创建重复索引 +- `delete_document`:重复删除返回成功(软删除) -**存储路径隔离**: +### 3. 多租户隔离 +**存储隔离**: ``` -user-{user_id}/{collection_id}/{document_id}/... +user-{user_A}/... # 用户 A 的文件 +user-{user_B}/... # 用户 B 的文件 ``` -**数据库隔离**: - -- 所有查询都带 user 字段过滤 -- 集合级别的权限控制 -- 软删除支持(gmt_deleted) +**数据库隔离**: +- 所有查询都带 `user` 字段过滤 +- 集合级别的权限控制(`collection.user`) +- 软删除支持(`gmt_deleted`) ### 4. 灵活的存储后端 -**支持Local和S3**: +**统一接口**: +```python +AsyncObjectStore: + - put(path, data) + - get(path) + - delete_objects_by_prefix(prefix) +``` -- Local: 适合开发测试、小规模部署 -- S3: 适合生产环境、大规模部署 -- 统一的`AsyncObjectStore`接口 -- 运行时配置切换 +**运行时切换**: +- 通过环境变量切换 Local/S3 +- 无需修改业务代码 +- 支持自定义存储后端(实现接口即可) ### 5. 事务一致性 -**核心操作都在事务内**: - +**数据库 + 对象存储的两阶段提交**: ```python -async def _upload_document_atomically(session): +async with transaction: # 1. 创建数据库记录 - # 2. 上传文件到对象存储 + document = create_document_record() + + # 2. 上传到对象存储 + await object_store.put(path, data) + # 3. 更新元数据 + document.doc_metadata = json.dumps(metadata) + # 所有操作成功才提交,任一失败则回滚 ``` -**好处**: - -- 避免部分成功的脏数据 -- 数据库记录和对象存储保持一致 -- 失败自动清理 +**失败处理**: +- 数据库记录创建失败:不上传文件 +- 文件上传失败:回滚数据库记录 +- 元数据更新失败:回滚前面的操作 -### 6. 分层架构清晰 +### 6. 可观测性 -``` -View Layer (views/collections.py) - ↓ 调用 -Service Layer (service/document_service.py) - ↓ 调用 -Repository Layer (db/ops.py, objectstore/) - ↓ 访问 -Storage Layer (PostgreSQL, S3, Qdrant, ES, Neo4j) -``` +**审计日志**: +- `@audit` 装饰器记录所有文档操作 +- 包含:用户、时间、操作类型、资源 ID -**职责分离**: +**任务追踪**: +- `gmt_last_reconciled`:最后处理时间 +- `error_message`:失败原因 +- Celery 任务 ID:关联日志追踪 -- View: HTTP处理、参数验证、认证 -- Service: 业务逻辑、事务编排 -- Repository: 数据访问 -- Storage: 数据持久化 +**监控指标**: +- 文档上传速率 +- 索引构建耗时 +- 失败率统计 ## 性能优化 -### 1. 分块上传(未实现,规划中) +### 1. 异步处理 -```python -# 大文件分块上传支持 -async def upload_document_chunk( - document_id: str, - chunk_index: int, - chunk_data: bytes, - total_chunks: int -): - # 上传单个分块 - # 所有分块完成后合并 - pass -``` +**上传不阻塞**: +- 文件上传到对象存储后立即返回 +- 索引构建在 Celery 中异步执行 +- 前端通过轮询或 WebSocket 获取进度 ### 2. 批量操作 -- `confirm_documents`支持批量确认 -- `delete_documents`支持批量删除 -- 批量查询索引状态 +**批量确认**: +```python +confirm_documents(document_ids=[id1, id2, ..., idN]) +``` +- 一次事务处理多个文档 +- 批量创建索引记录 +- 减少数据库往返 + +### 3. 缓存策略 -### 3. 异步处理 +**解析结果缓存**: +- 解析后的内容保存到 `processed_content.md` +- 后续索引重建可直接读取,无需重新解析 -- 文件上传后立即返回 -- 索引构建在Celery中异步执行 -- 前端轮询或WebSocket获取进度 +**分块结果缓存**: +- 分块结果保存到 `chunks/` 目录 +- 向量索引重建可复用分块结果 -### 4. 对象存储优化 +### 4. 并行索引构建 -- S3使用分段上传(multipart upload) -- Local使用aiofiles异步写入 -- 支持Range请求(部分下载) +**多索引并行**: +```python +# VECTOR、FULLTEXT、GRAPH 可以并行构建 +await asyncio.gather( + create_vector_index(), + create_fulltext_index(), + create_graph_index() +) +``` ## 错误处理 ### 常见异常 -```python -# 1. 集合不存在或不可用 -raise ResourceNotFoundException("Collection", collection_id) -raise CollectionInactiveException(collection_id) - -# 2. 文件验证失败 -raise invalid_param("file_type", f"unsupported file type {file_suffix}") -raise invalid_param("file_size", "file size is too large") +| 异常类型 | HTTP 状态码 | 触发场景 | 处理建议 | +|---------|------------|----------|----------| +| `ResourceNotFoundException` | 404 | 集合/文档不存在 | 检查 ID 是否正确 | +| `CollectionInactiveException` | 400 | 集合未激活 | 等待集合初始化完成 | +| `DocumentNameConflictException` | 409 | 同名不同内容 | 重命名文件或删除旧文档 | +| `QuotaExceededException` | 429 | 配额超限 | 升级套餐或删除旧文档 | +| `InvalidFileTypeException` | 400 | 不支持的文件类型 | 查看支持的文件类型列表 | +| `FileSizeTooLargeException` | 413 | 文件过大 | 分割文件或压缩 | -# 3. 重复冲突 -raise DocumentNameConflictException(filename, collection_id) +### 异常传播 -# 4. 配额超限 -raise QuotaExceededException("max_document_count", limit, current) - -# 5. 文档不存在 -raise DocumentNotFoundException(f"Document not found: {document_id}") ``` - -### 异常处理层级 - -**View层统一异常处理**: - -```python -# aperag/exception_handlers.py -@app.exception_handler(BusinessException) -async def business_exception_handler(request: Request, exc: BusinessException): - return JSONResponse( - status_code=400, - content={ - "error_code": exc.error_code.name, - "message": str(exc) - } - ) +Service Layer 抛出异常 + │ + ▼ +View Layer 捕获并转换 + │ + ▼ +Exception Handler 统一处理 + │ + ▼ +返回标准 JSON 响应: +{ + "error_code": "QUOTA_EXCEEDED", + "message": "Document count limit exceeded", + "details": { + "limit": 1000, + "current": 1000 + } +} ``` -## 相关文件 +## 相关文件索引 ### 核心实现 -- `aperag/views/collections.py` - View层接口 -- `aperag/service/document_service.py` - Service层业务逻辑 -- `aperag/source/upload.py` - UploadSource实现 -- `aperag/db/models.py` - 数据库模型 -- `aperag/db/ops.py` - 数据库操作 -- `aperag/api/components/schemas/document.yaml` - OpenAPI Schema +- **View 层**:`aperag/views/collections.py` - HTTP 接口定义 +- **Service 层**:`aperag/service/document_service.py` - 业务逻辑 +- **数据库模型**:`aperag/db/models.py` - Document, DocumentIndex 表定义 +- **数据库操作**:`aperag/db/ops.py` - CRUD 操作封装 ### 对象存储 -- `aperag/objectstore/base.py` - 存储接口定义 -- `aperag/objectstore/local.py` - Local存储实现 -- `aperag/objectstore/s3.py` - S3存储实现 - -### 文档处理 +- **接口定义**:`aperag/objectstore/base.py` - AsyncObjectStore 抽象类 +- **Local 实现**:`aperag/objectstore/local.py` - 本地文件系统存储 +- **S3 实现**:`aperag/objectstore/s3.py` - S3 兼容存储 -- `aperag/docparser/doc_parser.py` - 文档解析器 -- `aperag/docparser/chunking.py` - 文档分块 -- `aperag/index/manager.py` - 索引管理器 -- `aperag/index/vector_index.py` - 向量索引 -- `aperag/index/fulltext_index.py` - 全文索引 -- `aperag/index/graph_index.py` - 图索引 +### 文档解析 -### 任务队列 +- **主控制器**:`aperag/docparser/doc_parser.py` - DocParser +- **Parser 实现**: + - `aperag/docparser/mineru_parser.py` - MinerU PDF 解析 + - `aperag/docparser/docray_parser.py` - DocRay 文档解析 + - `aperag/docparser/markitdown_parser.py` - MarkItDown 通用解析 + - `aperag/docparser/image_parser.py` - 图片 OCR + - `aperag/docparser/audio_parser.py` - 音频转录 +- **文档处理**:`aperag/index/document_parser.py` - 解析流程编排 -- `config/celery_tasks.py` - Celery任务定义 -- `aperag/tasks/` - 任务实现 +### 索引构建 -### 前端实现 +- **索引管理**:`aperag/index/manager.py` - DocumentIndexManager +- **向量索引**:`aperag/index/vector_index.py` - VectorIndexer +- **全文索引**:`aperag/index/fulltext_index.py` - FulltextIndexer +- **知识图谱**:`aperag/index/graph_index.py` - GraphIndexer +- **文档摘要**:`aperag/index/summary_index.py` - SummaryIndexer +- **视觉索引**:`aperag/index/vision_index.py` - VisionIndexer -- `web/src/app/workspace/collections/[collectionId]/documents/page.tsx` - 文档列表页面 -- `web/src/components/documents/upload-documents.tsx` - 上传组件 +### 任务调度 -## 总结 +- **任务定义**:`config/celery_tasks.py` - Celery 任务注册 +- **协调器**:`aperag/tasks/reconciler.py` - DocumentIndexReconciler +- **文档任务**:`aperag/tasks/document.py` - DocumentIndexTask -ApeRAG的文档上传模块采用**两阶段提交 + 幂等设计 + 灵活存储**架构: +### 前端实现 -1. **两阶段提交**:上传(UPLOADED)→ 确认(PENDING)→ 索引构建 -2. **SHA-256哈希去重**:避免重复文档,支持幂等上传 -3. **灵活存储后端**:Local/S3可配置切换 -4. **配额管理**:确认阶段才扣除配额,合理控制资源 -5. **多索引协调**:向量、全文、图谱、摘要、视觉多种索引类型 -6. **清晰的分层架构**:View → Service → Repository → Storage -7. **Celery异步处理**:索引构建不阻塞上传响应 -8. **事务一致性**:数据库和对象存储操作原子化 +- **文档列表**:`web/src/app/workspace/collections/[collectionId]/documents/page.tsx` +- **文档上传**:`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` -这种设计既保证了性能,又支持复杂的文档处理场景,同时具有良好的可扩展性和容错能力。 +## 总结 +ApeRAG 的文档上传模块采用**两阶段提交 + 多 Parser 链式调用 + 多索引并行构建**的架构设计: + +**核心特性**: +1. ✅ **两阶段提交**:上传(临时存储)→ 确认(正式添加),提供更好的用户体验 +2. ✅ **SHA-256 去重**:避免重复文档,支持幂等上传 +3. ✅ **灵活存储后端**:Local/S3 可配置切换,统一接口抽象 +4. ✅ **多 Parser 架构**:支持 MinerU、DocRay、MarkItDown 等多种解析器 +5. ✅ **格式自动转换**:PDF→图片、音频→文本、图片→OCR 文本 +6. ✅ **多索引协调**:向量、全文、图谱、摘要、视觉五种索引类型 +7. ✅ **配额管理**:确认阶段才扣除配额,合理控制资源 +8. ✅ **异步处理**:Celery 任务队列,不阻塞用户操作 +9. ✅ **事务一致性**:数据库 + 对象存储的两阶段提交 +10. ✅ **可观测性**:审计日志、任务追踪、错误信息完整记录 + +这种设计既保证了高性能和可扩展性,又支持复杂的文档处理场景(多格式、多语言、多模态),同时具有良好的容错能力和用户体验。 diff --git a/web/docs/en-US/design/document_upload_design.md b/web/docs/en-US/design/document_upload_design.md index b499b57d..3b2937ba 100644 --- a/web/docs/en-US/design/document_upload_design.md +++ b/web/docs/en-US/design/document_upload_design.md @@ -1,880 +1,227 @@ --- -title: Document Upload Flow Design -description: Detailed explanation of ApeRAG frontend document upload functionality, including three-step upload process, state management, concurrency control, and user interaction design -keywords: [document upload, file upload, two-phase commit, progress tracking, batch upload, react, next.js] +title: Document Upload Architecture Design +description: Detailed explanation of ApeRAG document upload module's complete architecture design, including upload process, temporary storage configuration, document parsing, format conversion, database design, etc. +keywords: [document upload, architecture, object store, parser, index building, two-phase commit] --- -# Document Upload Flow Design +# ApeRAG Document Upload Architecture Design ## Overview -ApeRAG's document upload feature adopts a **three-step guided upload** design, providing intuitive user experience and reliable upload mechanism. +This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction. -**Core Features**: -- 📤 **Three-step Guided Process**: Select Files → Upload to Temporary Storage → Confirm Addition to Knowledge Base -- 🔄 **Smart Duplicate Detection**: Frontend deduplication based on filename, size, modification time, and type -- 📊 **Real-time Progress Tracking**: Each file displays upload progress and status independently -- ⚡ **Concurrent Upload Control**: Limit to 3 concurrent uploads to avoid browser resource exhaustion -- 🎯 **Batch Operation Support**: Support batch selection, deletion, and confirmation +**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities. + +## System Architecture -## Three-Step Upload Process +### Overall Architecture ``` ┌─────────────────────────────────────────────────────────────┐ -│ Step 1: Select Files │ -│ - Drag & drop or click to select files │ -│ - Frontend file validation (type, size, duplicate) │ -│ - Display file list with pending status │ -└────────────────────────┬────────────────────────────────────┘ - │ - ▼ +│ Frontend │ +│ (Next.js) │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ Step 1: Upload │ Step 2: Confirm + │ POST /documents/upload │ POST /documents/confirm + ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Step 2: Upload Files │ -│ - Concurrent upload to temporary storage (max 3) │ -│ - Real-time progress display (0-100%) │ -│ - Independent status per file: uploading → success/failed │ -│ - Backend returns document_id (status: UPLOADED) │ -└────────────────────────┬────────────────────────────────────┘ - │ - ▼ +│ View Layer: aperag/views/collections.py │ +│ - HTTP request handling │ +│ - JWT authentication │ +│ - Parameter validation │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ document_service.upload_document() │ document_service.confirm_documents() + ▼ ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Service Layer: aperag/service/document_service.py │ +│ - Business logic orchestration │ +│ - File validation (type, size) │ +│ - SHA-256 hash deduplication │ +│ - Quota checking │ +│ - Transaction management │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ Step 1 │ Step 2 + ▼ ▼ +┌────────────────────────┐ ┌────────────────────────────┐ +│ 1. Create Document │ │ 1. Update Document status │ +│ status=UPLOADED │ │ UPLOADED → PENDING │ +│ 2. Save to ObjectStore│ │ 2. Create DocumentIndex │ +│ 3. Calculate hash │ │ 3. Trigger indexing tasks │ +└────────┬───────────────┘ └────────┬───────────────────┘ + │ │ + ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Step 3: Confirm Addition │ -│ - Enter this step after all files uploaded successfully │ -│ - User can selectively confirm partial files │ -│ - Click "Save to Collection" to trigger confirm API │ -│ - Backend starts index building, document status → PENDING │ -└────────────────────────┬────────────────────────────────────┘ +│ Storage Layer │ +│ │ +│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ +│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ +│ │ │ │ │ │ │ │ +│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ +│ │ - document_ │ │ - Original files │ │ - Vectors │ │ +│ │ index │ │ - Converted files│ │ │ │ +│ └───────────────┘ └──────────────────┘ └─────────────┘ │ +│ │ +│ ┌───────────────┐ ┌──────────────────┐ │ +│ │ Elasticsearch │ │ Neo4j/PG │ │ +│ │ │ │ │ │ +│ │ - Full-text │ │ - Knowledge Graph│ │ +│ └───────────────┘ └──────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ │ ▼ - Navigate to document list page -``` - -## Component Architecture - -### Core Component: DocumentUpload - -**File Path**: `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - -**Component Structure**: - -```tsx -DocumentUpload -├── FileUpload (File upload area) -│ ├── FileUploadDropzone (Drag & drop) -│ └── FileUploadTrigger (Click to select) -│ -├── Progress Indicators -│ ├── Step 1: Select Files -│ ├── Step 2: Upload Files -│ └── Step 3: Save to Collection -│ -├── DataGrid (File list table) -│ ├── Checkbox (Batch selection) -│ ├── FileIcon (File type icon) -│ ├── Progress Bar (Upload progress) -│ └── Actions (Action menu) -│ -└── Action Buttons - ├── Upload Button (Start upload) - ├── Stop Upload Button (Cancel upload) - ├── Clear All (Clear list) - └── Save to Collection (Confirm addition) -``` - -## Data Structures - -### DocumentsWithFile Type - -```typescript -type DocumentsWithFile = { - // Frontend file object - file: File; - - // Upload progress (0-100) - progress: number; - - // Upload status - progress_status: 'pending' | 'uploading' | 'success' | 'failed'; - - // Backend returned data (populated after successful upload) - document_id?: string; // Document ID - filename?: string; // Filename - size?: number; // File size - status?: UploadDocumentResponseStatusEnum; // Document status (UPLOADED) -}; -``` - -### State Management - -```typescript -const [documents, setDocuments] = useState([]); // File list -const [step, setStep] = useState(1); // Current step -const [rowSelection, setRowSelection] = useState({}); // Selected rows -const [isUploading, setIsUploading] = useState(false); // Uploading flag -const [pagination, setPagination] = useState({ // Pagination state - pageIndex: 0, - pageSize: 20, -}); - -// Set of files being uploaded (to avoid duplicate uploads) -const uploadingFilesRef = useRef>(new Set()); -``` - -## Core Feature Implementation - -### 1. File Selection and Validation - -**File Validation Logic**: - -```typescript -const onFileValidate = useCallback( - (file: File): string | null => { - // Check if same file already exists - const doc = documents.some( - (doc) => - doc.file.name === file.name && - doc.file.size === file.size && - doc.file.lastModified === file.lastModified && - doc.file.type === file.type, - ); - if (doc) { - return 'File already exists.'; - } - return null; - }, - [documents], -); + ┌───────────────────┐ + │ Celery Workers │ + │ │ + │ - Doc parsing │ + │ - Format convert │ + │ - Content extract│ + │ - Doc chunking │ + │ - Index building │ + └───────────────────┘ ``` -**File Rejection Handling**: +### Layered Architecture -```typescript -const onFileReject = useCallback((file: File, message: string) => { - toast.error(message, { - description: `"${file.name.length > 20 ? `${file.name.slice(0, 20)}...` : file.name}" has been rejected`, - }); -}, []); ``` - -**Duplicate Detection Strategy**: - -| Check Item | Description | Purpose | -|------------|-------------|---------| -| `file.name` | Filename | Basic deduplication | -| `file.size` | File size (bytes) | Exact match | -| `file.lastModified` | Last modified timestamp | Distinguish same-name files | -| `file.type` | MIME type | Ensure complete match | - -### 2. Concurrent Upload Control - -**Using async.eachLimit to Control Concurrency**: - -```typescript -import async from 'async'; - -const startUpload = useCallback((docs: DocumentsWithFile[]) => { - // 1. Filter files to upload - const filesToUpload = docs.filter((doc) => { - const fileKey = `${doc.file.name}-${doc.file.size}-${doc.file.lastModified}`; - return ( - doc.progress_status === 'pending' && - !doc.document_id && - !uploadingFilesRef.current.has(fileKey) // Avoid duplicate upload - ); - }); - - // 2. Mark as uploading - filesToUpload.forEach((doc) => { - const fileKey = `${doc.file.name}-${doc.file.size}-${doc.file.lastModified}`; - uploadingFilesRef.current.add(fileKey); - }); - - // 3. Create upload tasks - const tasks: AsyncTask[] = filesToUpload.map((_doc) => async (callback) => { - // ... upload logic - }); - - // 4. Execute concurrently (max 3 concurrent) - async.eachLimit( - tasks, - 3, // Concurrency limit - (task, callback) => { - if (uploadController?.signal.aborted) { - callback(new Error('stop upload')); - } else { - task(callback); - } - }, - (err) => { - setIsUploading(false); - }, - ); -}, [collection.id]); +┌─────────────────────────────────────────────┐ +│ View Layer (views/collections.py) │ HTTP handling, auth, validation +└─────────────────┬───────────────────────────┘ + │ calls +┌─────────────────▼───────────────────────────┐ +│ Service Layer (service/document_service.py)│ Business logic, transaction, permission +└─────────────────┬───────────────────────────┘ + │ calls +┌─────────────────▼───────────────────────────┐ +│ Repository Layer (db/ops.py, objectstore/) │ Data access abstraction +└─────────────────┬───────────────────────────┘ + │ accesses +┌─────────────────▼───────────────────────────┐ +│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ Data persistence +└─────────────────────────────────────────────┘ ``` -**Concurrency Control Benefits**: - -- ✅ Limit browser simultaneous requests to avoid resource exhaustion -- ✅ Avoid backend overload -- ✅ Support canceling all uploads mid-way -- ✅ Better progress tracking - -### 3. Upload Progress Tracking - -**Simulated Progress Display** (Actual upload + progress animation): - -```typescript -const networkSimulation = async () => { - const totalChunks = 100; - let uploadedChunks = 0; - - for (let i = 0; i < totalChunks; i++) { - // Update progress every 5-10ms - await new Promise((resolve) => - setTimeout(resolve, Math.random() * 5 + 5), - ); - - uploadedChunks++; - const progress = (uploadedChunks / totalChunks) * 99; // Max 99% - - // Update specific file's progress - setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc) { - doc.progress = Number(progress.toFixed(0)); - doc.progress_status = 'uploading'; - } - return [...docs]; - }); - } -}; - -// Execute upload and progress animation in parallel -const [res] = await Promise.all([ - apiClient.defaultApi.collectionsCollectionIdDocumentsUploadPost({ - collectionId: collection.id, - file: _doc.file, - }), - networkSimulation(), // Progress animation -]); - -// Upload successful, set progress to 100% -setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc && res.data.document_id) { - Object.assign(doc, { - ...res.data, - progress: 100, - progress_status: 'success', - }); - } - return [...docs]; -}); -``` - -**Why Simulate Progress?** +## Core Process Details -1. HTTP upload cannot get real-time progress (browser limitation) -2. Provide better user experience, avoid long periods without feedback -3. Visually smoother, better user perception +For the complete documentation including: +- API Interface definitions +- File upload and temporary storage +- Document confirmation and index building +- Parser architecture and format conversion +- Index building flow +- Database design (document and document_index tables) +- State machine and lifecycle +- Async task scheduling (Celery) +- Design features and advantages +- Performance optimization +- Error handling -### 4. Cancel Upload +Please refer to the main design document at `/docs/design/document_upload_design.md`. -**Using AbortController**: +## Quick Reference -```typescript -let uploadController: AbortController | undefined; +### API Endpoints -// Stop upload -const stopUpload = useCallback(() => { - setIsUploading(false); - uploadController?.abort(); // Abort all ongoing requests -}, []); +1. **Upload File**: `POST /api/v1/collections/{collection_id}/documents/upload` +2. **Confirm Documents**: `POST /api/v1/collections/{collection_id}/documents/confirm` +3. **One-step Upload**: `POST /api/v1/collections/{collection_id}/documents` -// Auto-stop when page unmounts -useEffect(() => stopUpload, [stopUpload]); +### Document Status Flow -// Create new controller when starting upload -const startUpload = () => { - uploadController = new AbortController(); - // ... -}; ``` - -### 5. Confirm Addition to Knowledge Base - -**Step 3: Save to Collection**: - -```typescript -const handleSaveToCollection = useCallback(async () => { - if (!collection.id) return; - - // Call confirm API - const res = await apiClient.defaultApi.collectionsCollectionIdDocumentsConfirmPost({ - collectionId: collection.id, - confirmDocumentsRequest: { - document_ids: documents - .map((doc) => doc.document_id || '') - .filter((id) => !_.isEmpty(id)), - }, - }); - - if (res.status === 200) { - toast.success('Document added successfully'); - // Navigate back to document list - router.push(`/workspace/collections/${collection.id}/documents`); - } -}, [collection.id, documents, router]); +[Upload] → UPLOADED → [Confirm] → PENDING → RUNNING → COMPLETE + ↓ ↓ + [Delete] FAILED + ↓ ↓ + DELETED ←──────────────┘ ``` -## API Integration +### Object Storage Configuration -### 1. Upload File API - -**Endpoint**: `POST /api/v1/collections/{collectionId}/documents/upload` - -**Request**: - -```typescript -apiClient.defaultApi.collectionsCollectionIdDocumentsUploadPost({ - collectionId: collection.id, - file: file, // File object -}, { - timeout: 1000 * 30, // 30 second timeout -}); +**Local Storage**: +```bash +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=.objects ``` -**Response**: - -```typescript -{ - document_id: "doc_xyz789", - filename: "example.pdf", - size: 2048576, - status: "UPLOADED" -} +**S3 Storage**: +```bash +OBJECT_STORE_TYPE=s3 +OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 +OBJECT_STORE_S3_BUCKET=aperag +OBJECT_STORE_S3_ACCESS_KEY=minioadmin +OBJECT_STORE_S3_SECRET_KEY=minioadmin ``` -### 2. Confirm Documents API - -**Endpoint**: `POST /api/v1/collections/{collectionId}/documents/confirm` - -**Request**: - -```typescript -apiClient.defaultApi.collectionsCollectionIdDocumentsConfirmPost({ - collectionId: collection.id, - confirmDocumentsRequest: { - document_ids: ["doc_xyz789", "doc_abc123", ...] - } -}); -``` - -**Response**: - -```typescript -{ - confirmed_count: 3, - failed_count: 1, - failed_documents: [ - { - document_id: "doc_fail123", - name: "corrupted.pdf", - error: "CONFIRMATION_FAILED" - } - ] -} -``` - -## UI Component Details - -### 1. File Upload Area - -```tsx - doc.file)} - onValueChange={(files) => { - const newFilesToUpload: DocumentsWithFile[] = []; - files.forEach((file) => { - if ( - !documents.some( - (doc) => - doc.file.name === file.name && - doc.file.size === file.size && - doc.file.lastModified === file.lastModified && - doc.file.type === file.type, - ) - ) { - newFilesToUpload.push({ - file, - progress: 0, - progress_status: 'pending', - }); - } - }); - if (newFilesToUpload.length > 0) { - setDocuments((docs) => [...docs, ...newFilesToUpload]); - } - }} - onFileReject={onFileReject} - onFileValidate={onFileValidate} -> - -
- -
- Drag and drop files here -
-
- or -
- - - -
-
-
-``` - -**Features**: -- Support drag & drop upload -- Support click to select files -- Automatic file validation -- Duplicate file detection - -### 2. Progress Indicators - -```tsx -
- {/* Step 1 */} -
- -
Select Files
-
- - - - {/* Step 2 */} -
- -
Upload Files
-
- - - - {/* Step 3 */} -
- -
Save to Collection
-
-
-``` - -**Step Auto-switching Logic**: - -```typescript -useEffect(() => { - if (documents.length === 0) { - setStep(1); // No files → Step 1 - } else if ( - documents.filter((doc) => doc.progress_status === 'success').length !== - documents.length - ) { - setStep(2); // Has incomplete uploads → Step 2 - } else { - setStep(3); // All uploads complete → Step 3 - } -}, [documents]); -``` - -### 3. File List Table - -Implemented using `@tanstack/react-table`: - -```typescript -const columns: ColumnDef[] = [ - { - id: 'select', - header: ({ table }) => ( - table.toggleAllPageRowsSelected(!!value)} - /> - ), - cell: ({ row }) => ( - row.toggleSelected(!!value)} - /> - ), - }, - { - accessorKey: 'filename', - header: 'Filename', - cell: ({ row }) => { - const file = row.original.file; - const extension = _.last(file.type.split('/')) || ''; - return ( -
- -
-
{file.name}
-
- {(file.size / 1000).toFixed(0)} KB -
-
-
- ); - }, - }, - { - header: 'Upload Progress', - cell: ({ row }) => ( -
- -
-
{row.original.progress}%
-
- {row.original.progress_status} -
-
-
- ), - }, - { - id: 'actions', - cell: ({ row }) => ( - - - - - - handleRemoveFile(row.original)} - > - Remove - - - - ), - }, -]; -``` - -**Table Features**: -- ✅ Checkbox batch selection -- ✅ File type icon display -- ✅ Real-time progress bar -- ✅ Status color coding -- ✅ Pagination support (20 items per page) -- ✅ Delete action - -### 4. Action Buttons - -```tsx -
- {/* Clear All */} - - - {/* Start Upload */} - - - {/* Stop Upload */} - {isUploading && ( - - )} - - {/* Save to Collection */} - -
-``` - -## State Management Flow - -``` -Initial State -├── documents: [] -├── step: 1 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ User selects files - -Step 1: File Selection Complete -├── documents: [{file, progress: 0, progress_status: 'pending'}, ...] -├── step: 1 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ Click "Start Upload" - -Step 2: Uploading -├── documents: [{..., progress: 45, progress_status: 'uploading'}, ...] -├── step: 2 -├── isUploading: true -└── uploadingFilesRef.current: Set('file1-key', 'file2-key', ...) - -↓ Upload complete - -Step 3: Waiting for Confirmation -├── documents: [{..., progress: 100, progress_status: 'success', document_id: 'doc_xyz'}, ...] -├── step: 3 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ Click "Save to Collection" - -Navigate to document list page -``` - -## Error Handling - -### 1. Upload Failure - -```typescript -catch (err) { - setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc) { - Object.assign(doc, { - progress: 0, - progress_status: 'failed', - }); - } - return [...docs]; - }); -} -``` +### Supported Parsers -**Actions After Failure**: -- Reset progress to 0 -- Mark status as `failed` -- Can click "Start Upload" again to retry -- Can delete failed files +- **MinerUParser**: High-precision PDF parsing +- **DocRayParser**: Document layout analysis +- **ImageParser**: Image OCR and vision understanding +- **AudioParser**: Audio transcription +- **MarkItDownParser**: Universal fallback parser -### 2. File Validation Failure +### Index Types -```typescript -// Return error message in onFileValidate -return 'File already exists.'; - -// Or handle in onFileReject -onFileReject={(file, message) => { - toast.error(message, { - description: `"${file.name}" has been rejected`, - }); -}} -``` - -### 3. Network Interruption - -```typescript -// User can click "Stop Upload" -const stopUpload = () => { - uploadController?.abort(); // Abort all requests - setIsUploading(false); -}; - -// Auto-stop when page unmounts -useEffect(() => stopUpload, [stopUpload]); -``` - -## Performance Optimization - -### 1. Debounce and Throttle - -```typescript -// Use lodash for file comparison (efficient) -_.isEqual(doc.file, file) - -// File key generation (fast lookup) -const fileKey = `${file.name}-${file.size}-${file.lastModified}`; -``` - -### 2. State Update Optimization - -```typescript -// Use functional update to avoid closure trap -setDocuments((docs) => { - const doc = docs.find(...); - // Modify - return [...docs]; // Return new array to trigger update -}); -``` - -### 3. Pagination Display - -```typescript -// Default 20 items per page to avoid large list rendering lag -const [pagination, setPagination] = useState({ - pageIndex: 0, - pageSize: 20, -}); -``` - -### 4. Virtual Scrolling (Not Implemented, Can Optimize) - -For very large file lists (1000+), can use virtual scrolling: - -```typescript -import { useVirtualizer } from '@tanstack/react-virtual'; -``` - -## User Experience Design - -### 1. Instant Feedback - -- ✅ Show highlight area when dragging -- ✅ Show animation icon during upload -- ✅ Real-time progress bar updates -- ✅ Distinguish status by color (pending/uploading/success/failed) - -### 2. Error Messages - -- ✅ File validation failed: Toast notification -- ✅ Upload failed: Status marked red -- ✅ Confirmation failed: Show specific error message - -### 3. Operation Guidance - -- ✅ Three-step progress indicator -- ✅ Buttons enabled/disabled based on state -- ✅ Empty state prompt -- ✅ Auto-navigate after successful operation - -### 4. Responsive Design - -- ✅ Table adapts on small screens -- ✅ Action buttons stack on mobile -- ✅ Long filenames truncated - -## Internationalization Support - -Using `next-intl` for internationalization: - -```typescript -const page_documents = useTranslations('page_documents'); - -// Usage -page_documents('filename') -page_documents('upload_progress') -page_documents('drag_and_drop_files_here') -page_documents('step1_select_files') -page_documents('step2_upload_files') -page_documents('step3_save_to_collection') -``` - -**Translation File Locations**: -- `web/src/locales/en-US/page_documents.json` -- `web/src/locales/zh-CN/page_documents.json` - -## Best Practices - -### 1. File Size Limit - -```typescript -// Frontend check (optional) -const MAX_FILE_SIZE = 100 * 1024 * 1024; // 100MB - -if (file.size > MAX_FILE_SIZE) { - return 'File size exceeds 100MB'; -} -``` - -### 2. Supported File Types - -Frontend can limit file types, but final validation is on backend: - -```typescript -const ALLOWED_TYPES = [ - 'application/pdf', - 'application/msword', - 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', - 'text/plain', - // ... -]; - -if (!ALLOWED_TYPES.includes(file.type)) { - return 'File type not supported'; -} -``` - -### 3. Auto-retry Mechanism (Not Implemented, Recommended) - -```typescript -const uploadWithRetry = async (file: File, retries = 3) => { - for (let i = 0; i < retries; i++) { - try { - return await apiClient.upload(file); - } catch (err) { - if (i === retries - 1) throw err; - await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, i))); - } - } -}; -``` +| Type | Required | Storage | +|------|----------|---------| +| VECTOR | ✅ | Qdrant | +| FULLTEXT | ✅ | Elasticsearch | +| GRAPH | ❌ | Neo4j/PostgreSQL | +| SUMMARY | ❌ | PostgreSQL | +| VISION | ❌ | Qdrant + PostgreSQL | ## Related Files -### Frontend Components -- `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - Main upload component -- `web/src/app/workspace/collections/[collectionId]/documents/upload/page.tsx` - Upload page -- `web/src/components/ui/file-upload.tsx` - File upload UI component -- `web/src/components/ui/progress.tsx` - Progress bar component -- `web/src/components/data-grid.tsx` - Data table component - -### API Client -- `web/src/lib/api/client.ts` - API client configuration -- `web/src/api/` - Auto-generated API interfaces - -### Internationalization -- `web/src/locales/en-US/page_documents.json` - English translations -- `web/src/locales/zh-CN/page_documents.json` - Chinese translations +### Backend Core +- `aperag/views/collections.py` - View layer +- `aperag/service/document_service.py` - Service layer +- `aperag/db/models.py` - Database models + +### Object Storage +- `aperag/objectstore/base.py` - Storage interface +- `aperag/objectstore/local.py` - Local storage +- `aperag/objectstore/s3.py` - S3 storage + +### Document Parsing +- `aperag/docparser/doc_parser.py` - Main parser +- `aperag/docparser/mineru_parser.py` - MinerU parser +- `aperag/docparser/docray_parser.py` - DocRay parser +- `aperag/docparser/markitdown_parser.py` - MarkItDown parser +- `aperag/docparser/image_parser.py` - Image parser +- `aperag/docparser/audio_parser.py` - Audio parser + +### Index Building +- `aperag/index/vector_index.py` - Vector indexer +- `aperag/index/fulltext_index.py` - Full-text indexer +- `aperag/index/graph_index.py` - Graph indexer +- `aperag/index/summary_index.py` - Summary indexer +- `aperag/index/vision_index.py` - Vision indexer + +### Task Scheduling +- `config/celery_tasks.py` - Celery tasks +- `aperag/tasks/reconciler.py` - Index reconciler +- `aperag/tasks/document.py` - Document tasks + +### Frontend +- `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - Upload component ## Summary -ApeRAG's document upload feature provides intuitive and reliable user experience through a **three-step guided process**: - -1. **Step 1 - Select Files**: Drag & drop or click to select, instant frontend validation -2. **Step 2 - Upload Files**: Concurrent upload to temporary storage, real-time progress tracking -3. **Step 3 - Confirm Addition**: User selective confirmation, triggers index building - -**Core Advantages**: -- 🎯 **User-Friendly**: Clear three-step process, explicit operation guidance -- ⚡ **Performance Optimized**: Concurrency control, pagination display, state management optimization -- 🔒 **High Reliability**: Duplicate detection, error handling, mid-upload cancellation support -- 🌍 **Internationalized**: Complete multi-language support -- 📱 **Responsive**: Adapts to mobile and desktop - -This design ensures functional completeness while providing excellent user experience and system stability. +ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture: +**Core Features**: +1. ✅ **Two-Phase Commit**: Upload (temporary) → Confirm (formal), better UX +2. ✅ **SHA-256 Deduplication**: Prevents duplicates, idempotent upload +3. ✅ **Flexible Storage**: Local/S3 configurable, unified interface +4. ✅ **Multi-Parser**: MinerU, DocRay, MarkItDown, and more +5. ✅ **Auto Conversion**: PDF→images, audio→text, image→OCR +6. ✅ **Multi-Index**: Vector, full-text, graph, summary, vision +7. ✅ **Quota Management**: Deducted at confirmation stage +8. ✅ **Async Processing**: Celery task queue, non-blocking +9. ✅ **Transaction Consistency**: Database + object store 2PC +10. ✅ **Observability**: Audit logs, task tracking, error recording + +For complete details, please refer to `/docs/design/document_upload_design.md`. diff --git a/web/docs/zh-CN/design/document_upload_design.md b/web/docs/zh-CN/design/document_upload_design.md index 4c91f4ab..3a0a0ec6 100644 --- a/web/docs/zh-CN/design/document_upload_design.md +++ b/web/docs/zh-CN/design/document_upload_design.md @@ -1,881 +1,1083 @@ --- -title: 文档上传流程设计 -description: 详细说明ApeRAG前端文档上传功能的完整实现,包括三步上传流程、状态管理、并发控制和用户交互设计 -keywords: [document upload, file upload, two-phase commit, progress tracking, batch upload, react, next.js] +title: 文档上传架构设计 +description: 详细说明ApeRAG文档上传模块的完整架构设计,包括上传流程、临时存储配置、文档解析、格式转换、数据库设计等 +keywords: [document upload, architecture, object store, parser, index building, two-phase commit] --- -# 文档上传流程设计 +# ApeRAG 文档上传架构设计 ## 概述 -ApeRAG的文档上传功能采用**三步引导式上传**设计,提供直观的用户体验和可靠的上传机制。 +本文档详细说明 ApeRAG 项目中文档上传模块的完整架构设计,涵盖从文件上传、临时存储、文档解析、格式转换到最终索引构建的全链路流程。 -**核心特性**: -- 📤 **三步引导流程**: 选择文件 → 上传到临时存储 → 确认添加到知识库 -- 🔄 **智能重复检测**: 基于文件名、大小、修改时间和类型的前端去重 -- 📊 **实时进度跟踪**: 每个文件独立显示上传进度和状态 -- ⚡ **并发上传控制**: 限制同时上传3个文件,避免浏览器资源耗尽 -- 🎯 **批量操作支持**: 支持批量选择、批量删除、批量确认 +**核心设计理念**:采用**两阶段提交**模式,将文件上传(临时存储)和文档确认(正式添加)分离,提供更好的用户体验和资源管理能力。 -## 三步上传流程 +## 系统架构 + +### 整体架构图 ``` ┌─────────────────────────────────────────────────────────────┐ -│ Step 1: 选择文件 │ -│ - 拖拽上传或点击选择文件 │ -│ - 前端文件验证(类型、大小、重复) │ -│ - 显示文件列表,状态为 pending │ -└────────────────────────┬────────────────────────────────────┘ - │ - ▼ +│ Frontend │ +│ (Next.js) │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ Step 1: Upload │ Step 2: Confirm + │ POST /documents/upload │ POST /documents/confirm + ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Step 2: 上传文件 │ -│ - 并发上传到临时存储(最多3个并发) │ -│ - 实时显示上传进度(0-100%) │ -│ - 每个文件独立状态:uploading → success/failed │ -│ - 后端返回 document_id(状态:UPLOADED) │ -└────────────────────────┬────────────────────────────────────┘ - │ - ▼ +│ View Layer: aperag/views/collections.py │ +│ - HTTP请求处理 │ +│ - JWT身份验证 │ +│ - 参数验证 │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ document_service.upload_document() │ document_service.confirm_documents() + ▼ ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Service Layer: aperag/service/document_service.py │ +│ - 业务逻辑编排 │ +│ - 文件验证(类型、大小) │ +│ - SHA-256 哈希去重 │ +│ - Quota 检查 │ +│ - 事务管理 │ +└────────┬───────────────────────────────────┬────────────────┘ + │ │ + │ Step 1 │ Step 2 + ▼ ▼ +┌────────────────────────┐ ┌────────────────────────────┐ +│ 1. 创建 Document 记录 │ │ 1. 更新 Document 状态 │ +│ status=UPLOADED │ │ UPLOADED → PENDING │ +│ 2. 保存到 ObjectStore │ │ 2. 创建 DocumentIndex 记录│ +│ 3. 计算 content_hash │ │ 3. 触发索引构建任务 │ +└────────┬───────────────┘ └────────┬───────────────────┘ + │ │ + ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ -│ Step 3: 确认添加 │ -│ - 所有文件上传成功后进入此步骤 │ -│ - 用户可以选择性确认部分文件 │ -│ - 点击"保存到知识库"触发确认API │ -│ - 后端开始索引构建,文档状态变为 PENDING │ -└────────────────────────┬────────────────────────────────────┘ +│ Storage Layer │ +│ │ +│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ +│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ +│ │ │ │ │ │ │ │ +│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ +│ │ - document_ │ │ - 原始文件 │ │ - 向量索引 │ │ +│ │ index │ │ - 转换后的文件 │ │ │ │ +│ └───────────────┘ └──────────────────┘ └─────────────┘ │ +│ │ +│ ┌───────────────┐ ┌──────────────────┐ │ +│ │ Elasticsearch │ │ Neo4j/PG │ │ +│ │ │ │ │ │ +│ │ - 全文索引 │ │ - 知识图谱 │ │ +│ └───────────────┘ └──────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ │ ▼ - 跳转到文档列表页面 -``` - -## 组件架构 - -### 核心组件: DocumentUpload - -**文件路径**: `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - -**组件结构**: - -```tsx -DocumentUpload -├── FileUpload (文件上传区域) -│ ├── FileUploadDropzone (拖拽上传) -│ └── FileUploadTrigger (点击选择) -│ -├── Progress Indicators (进度指示器) -│ ├── Step 1: 选择文件 -│ ├── Step 2: 上传文件 -│ └── Step 3: 保存到集合 -│ -├── DataGrid (文件列表表格) -│ ├── Checkbox (批量选择) -│ ├── FileIcon (文件类型图标) -│ ├── Progress Bar (上传进度) -│ └── Actions (操作菜单) -│ -└── Action Buttons - ├── Upload Button (开始上传) - ├── Stop Upload Button (停止上传) - ├── Clear All (清空列表) - └── Save to Collection (保存到知识库) -``` - -## 数据结构 - -### DocumentsWithFile 类型 - -```typescript -type DocumentsWithFile = { - // 前端文件对象 - file: File; - - // 上传进度 (0-100) - progress: number; - - // 上传状态 - progress_status: 'pending' | 'uploading' | 'success' | 'failed'; - - // 后端返回的数据(上传成功后填充) - document_id?: string; // 文档ID - filename?: string; // 文件名 - size?: number; // 文件大小 - status?: UploadDocumentResponseStatusEnum; // 文档状态(UPLOADED) -}; -``` - -### 状态管理 - -```typescript -const [documents, setDocuments] = useState([]); // 文件列表 -const [step, setStep] = useState(1); // 当前步骤 -const [rowSelection, setRowSelection] = useState({}); // 选中的行 -const [isUploading, setIsUploading] = useState(false); // 上传中标志 -const [pagination, setPagination] = useState({ // 分页状态 - pageIndex: 0, - pageSize: 20, -}); - -// 上传中的文件集合(用于避免重复上传) -const uploadingFilesRef = useRef>(new Set()); -``` - -## 核心功能实现 - -### 1. 文件选择和验证 - -**文件验证逻辑**: - -```typescript -const onFileValidate = useCallback( - (file: File): string | null => { - // 检查是否已存在相同文件 - const doc = documents.some( - (doc) => - doc.file.name === file.name && - doc.file.size === file.size && - doc.file.lastModified === file.lastModified && - doc.file.type === file.type, - ); - if (doc) { - return 'File already exists.'; - } - return null; - }, - [documents], -); -``` - -**文件拒绝处理**: - -```typescript -const onFileReject = useCallback((file: File, message: string) => { - toast.error(message, { - description: `"${file.name.length > 20 ? `${file.name.slice(0, 20)}...` : file.name}" has been rejected`, - }); -}, []); -``` - -**重复检测策略**: - -| 检查项 | 说明 | 用途 | -|--------|------|------| -| `file.name` | 文件名 | 基础去重 | -| `file.size` | 文件大小(字节) | 精确匹配 | -| `file.lastModified` | 最后修改时间戳 | 区分同名文件 | -| `file.type` | MIME类型 | 确保完全一致 | - -### 2. 并发上传控制 - -**使用 async.eachLimit 控制并发**: - -```typescript -import async from 'async'; - -const startUpload = useCallback((docs: DocumentsWithFile[]) => { - // 1. 过滤出待上传的文件 - const filesToUpload = docs.filter((doc) => { - const fileKey = `${doc.file.name}-${doc.file.size}-${doc.file.lastModified}`; - return ( - doc.progress_status === 'pending' && - !doc.document_id && - !uploadingFilesRef.current.has(fileKey) // 避免重复上传 - ); - }); - - // 2. 标记为上传中 - filesToUpload.forEach((doc) => { - const fileKey = `${doc.file.name}-${doc.file.size}-${doc.file.lastModified}`; - uploadingFilesRef.current.add(fileKey); - }); - - // 3. 创建上传任务 - const tasks: AsyncTask[] = filesToUpload.map((_doc) => async (callback) => { - // ... 上传逻辑 - }); - - // 4. 并发执行(最多3个并发) - async.eachLimit( - tasks, - 3, // 并发数 - (task, callback) => { - if (uploadController?.signal.aborted) { - callback(new Error('stop upload')); - } else { - task(callback); - } - }, - (err) => { - setIsUploading(false); - }, - ); -}, [collection.id]); -``` - -**并发控制优势**: - -- ✅ 限制浏览器同时请求数,避免资源耗尽 -- ✅ 避免后端过载 -- ✅ 支持中途取消所有上传 -- ✅ 更好的进度追踪 - -### 3. 上传进度追踪 - -**模拟进度显示**(实际上传 + 进度动画): - -```typescript -const networkSimulation = async () => { - const totalChunks = 100; - let uploadedChunks = 0; - - for (let i = 0; i < totalChunks; i++) { - // 每5-10ms更新一次进度 - await new Promise((resolve) => - setTimeout(resolve, Math.random() * 5 + 5), - ); - - uploadedChunks++; - const progress = (uploadedChunks / totalChunks) * 99; // 最多到99% - - // 更新特定文件的进度 - setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc) { - doc.progress = Number(progress.toFixed(0)); - doc.progress_status = 'uploading'; - } - return [...docs]; - }); - } -}; - -// 并行执行上传和进度动画 -const [res] = await Promise.all([ - apiClient.defaultApi.collectionsCollectionIdDocumentsUploadPost({ - collectionId: collection.id, - file: _doc.file, - }), - networkSimulation(), // 进度动画 -]); - -// 上传成功,进度设为100% -setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc && res.data.document_id) { - Object.assign(doc, { - ...res.data, - progress: 100, - progress_status: 'success', - }); - } - return [...docs]; -}); + ┌───────────────────┐ + │ Celery Workers │ + │ │ + │ - 文档解析 │ + │ - 格式转换 │ + │ - 内容提取 │ + │ - 文档分块 │ + │ - 索引构建 │ + └───────────────────┘ ``` -**为什么模拟进度?** +### 分层架构 -1. HTTP上传无法获取实时进度(浏览器限制) -2. 提供更好的用户体验,避免长时间无反馈 -3. 视觉上更流畅,用户感知更好 +``` +┌─────────────────────────────────────────────┐ +│ View Layer (views/collections.py) │ HTTP 处理、认证、参数验证 +└─────────────────┬───────────────────────────┘ + │ 调用 +┌─────────────────▼───────────────────────────┐ +│ Service Layer (service/document_service.py)│ 业务逻辑、事务编排、权限控制 +└─────────────────┬───────────────────────────┘ + │ 调用 +┌─────────────────▼───────────────────────────┐ +│ Repository Layer (db/ops.py, objectstore/) │ 数据访问抽象、对象存储接口 +└─────────────────┬───────────────────────────┘ + │ 访问 +┌─────────────────▼───────────────────────────┐ +│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ 数据持久化 +└─────────────────────────────────────────────┘ +``` -### 4. 取消上传 +## 核心流程详解 -**使用 AbortController**: +### 阶段 0: API 接口定义 -```typescript -let uploadController: AbortController | undefined; +系统提供三个主要接口: -// 停止上传 -const stopUpload = useCallback(() => { - setIsUploading(false); - uploadController?.abort(); // 中止所有正在进行的请求 -}, []); +1. **上传文件**(两阶段模式 - 第一步) + - 接口:`POST /api/v1/collections/{collection_id}/documents/upload` + - 功能:上传文件到临时存储,状态为 `UPLOADED` + - 返回:`document_id`、`filename`、`size`、`status` -// 页面卸载时自动停止 -useEffect(() => stopUpload, [stopUpload]); +2. **确认文档**(两阶段模式 - 第二步) + - 接口:`POST /api/v1/collections/{collection_id}/documents/confirm` + - 功能:确认已上传的文档,触发索引构建 + - 参数:`document_ids` 数组 + - 返回:`confirmed_count`、`failed_count`、`failed_documents` -// 开始上传时创建新的 controller -const startUpload = () => { - uploadController = new AbortController(); - // ... -}; -``` +3. **一步上传**(传统模式,兼容旧版) + - 接口:`POST /api/v1/collections/{collection_id}/documents` + - 功能:上传并直接添加到知识库,状态直接为 `PENDING` + - 支持批量上传 -### 5. 确认添加到知识库 +### 阶段 1: 文件上传与临时存储 -**Step 3: 保存到集合**: +#### 1.1 上传流程 -```typescript -const handleSaveToCollection = useCallback(async () => { - if (!collection.id) return; - - // 调用确认API - const res = await apiClient.defaultApi.collectionsCollectionIdDocumentsConfirmPost({ - collectionId: collection.id, - confirmDocumentsRequest: { - document_ids: documents - .map((doc) => doc.document_id || '') - .filter((id) => !_.isEmpty(id)), - }, - }); - - if (res.status === 200) { - toast.success('Document added successfully'); - // 跳转回文档列表 - router.push(`/workspace/collections/${collection.id}/documents`); - } -}, [collection.id, documents, router]); +``` +用户选择文件 + │ + ▼ +前端调用 upload API + │ + ▼ +View 层验证身份和参数 + │ + ▼ +Service 层处理业务逻辑: + │ + ├─► 验证集合存在且激活 + │ + ├─► 验证文件类型和大小 + │ + ├─► 读取文件内容 + │ + ├─► 计算 SHA-256 哈希 + │ + └─► 事务处理: + │ + ├─► 重复检测(按文件名+哈希) + │ ├─ 完全相同:返回已存在文档(幂等) + │ ├─ 同名不同内容:抛出冲突异常 + │ └─ 新文档:继续创建 + │ + ├─► 创建 Document 记录(status=UPLOADED) + │ + ├─► 上传到对象存储 + │ └─ 路径:user-{user_id}/{collection_id}/{document_id}/original{suffix} + │ + └─► 更新文档元数据(object_path) ``` -## API集成 +#### 1.2 文件验证 -### 1. 上传文件 API +**支持的文件类型**: +- 文档:`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` +- 文本:`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` +- 图片:`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` +- 音频:`.mp3`, `.wav`, `.m4a` +- 压缩包:`.zip`, `.tar`, `.gz`, `.tgz` -**接口**: `POST /api/v1/collections/{collectionId}/documents/upload` +**大小限制**: +- 默认:100 MB(可通过 `MAX_DOCUMENT_SIZE` 环境变量配置) +- 解压后总大小:5 GB(`MAX_EXTRACTED_SIZE`) -**请求**: +#### 1.3 重复检测机制 -```typescript -apiClient.defaultApi.collectionsCollectionIdDocumentsUploadPost({ - collectionId: collection.id, - file: file, // File对象 -}, { - timeout: 1000 * 30, // 30秒超时 -}); +采用**文件名 + SHA-256 哈希**双重检测: + +| 场景 | 文件名 | 哈希值 | 系统行为 | +|------|--------|--------|----------| +| 完全相同 | 相同 | 相同 | 返回已存在文档(幂等操作) | +| 文件名冲突 | 相同 | 不同 | 抛出 `DocumentNameConflictException` | +| 新文档 | 不同 | - | 创建新文档记录 | + +**优势**: +- ✅ 支持幂等上传:网络重传不会创建重复文档 +- ✅ 避免内容冲突:同名不同内容会提示用户 +- ✅ 节省存储空间:相同内容只存储一次 + +### 阶段 2: 临时存储配置 + +#### 2.1 对象存储类型 + +系统支持两种对象存储后端,可通过环境变量切换: + +**1. Local 存储(本地文件系统)** + +适用场景: +- 开发测试环境 +- 小规模部署 +- 单机部署 + +配置方式: +```bash +# 开发环境 +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=.objects + +# Docker 环境 +OBJECT_STORE_TYPE=local +OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects ``` -**响应**: +存储路径示例: +``` +.objects/ +└── user-google-oauth2-123456/ + └── col_abc123/ + └── doc_xyz789/ + ├── original.pdf # 原始文件 + ├── converted.pdf # 转换后的 PDF + ├── processed_content.md # 解析后的 Markdown + ├── chunks/ # 分块数据 + │ ├── chunk_0.json + │ └── chunk_1.json + └── images/ # 提取的图片 + ├── page_0.png + └── page_1.png +``` -```typescript -{ - document_id: "doc_xyz789", - filename: "example.pdf", - size: 2048576, - status: "UPLOADED" -} +**2. S3 存储(兼容 AWS S3/MinIO/OSS 等)** + +适用场景: +- 生产环境 +- 大规模部署 +- 分布式部署 +- 需要高可用和容灾 + +配置方式: +```bash +OBJECT_STORE_TYPE=s3 +OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 地址 +OBJECT_STORE_S3_REGION=us-east-1 # AWS Region +OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key +OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key +OBJECT_STORE_S3_BUCKET=aperag # Bucket 名称 +OBJECT_STORE_S3_PREFIX_PATH=dev/ # 可选的路径前缀 +OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO 需要设置为 true ``` -### 2. 确认文档 API +#### 2.2 对象存储路径规则 -**接口**: `POST /api/v1/collections/{collectionId}/documents/confirm` +**路径格式**: +``` +{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +``` -**请求**: +**组成部分**: +- `prefix`:可选的全局前缀(仅 S3) +- `user_id`:用户 ID(`|` 替换为 `-`) +- `collection_id`:集合 ID +- `document_id`:文档 ID +- `filename`:文件名(如 `original.pdf`、`page_0.png`) -```typescript -apiClient.defaultApi.collectionsCollectionIdDocumentsConfirmPost({ - collectionId: collection.id, - confirmDocumentsRequest: { - document_ids: ["doc_xyz789", "doc_abc123", ...] - } -}); +**多租户隔离**: +- 每个用户有独立的命名空间 +- 每个集合有独立的存储目录 +- 每个文档有独立的文件夹 + +### 阶段 3: 文档确认与索引构建 + +#### 3.1 确认流程 + +``` +用户点击"保存到集合" + │ + ▼ +前端调用 confirm API + │ + ▼ +Service 层处理: + │ + ├─► 验证集合配置 + │ + ├─► 检查 Quota(确认阶段才扣除配额) + │ + └─► 对每个 document_id: + │ + ├─► 验证文档状态为 UPLOADED + │ + ├─► 更新文档状态:UPLOADED → PENDING + │ + ├─► 根据集合配置创建索引记录: + │ ├─ VECTOR(向量索引,必选) + │ ├─ FULLTEXT(全文索引,必选) + │ ├─ GRAPH(知识图谱,可选) + │ ├─ SUMMARY(文档摘要,可选) + │ └─ VISION(视觉索引,可选) + │ + └─► 返回确认结果 + │ + ▼ +触发 Celery 任务:reconcile_document_indexes + │ + ▼ +后台异步处理索引构建 +``` + +#### 3.2 Quota(配额)管理 + +**检查时机**: +- ❌ 不在上传阶段检查(临时存储不占用配额) +- ✅ 在确认阶段检查(正式添加才消耗配额) + +**配额类型**: + +1. **用户全局配额** + - `max_document_count`:用户总文档数量限制 + - 默认:1000(可通过 `MAX_DOCUMENT_COUNT` 配置) + +2. **单集合配额** + - `max_document_count_per_collection`:单个集合文档数量限制 + - 不计入 `UPLOADED` 和 `DELETED` 状态的文档 + +**配额超限处理**: +- 抛出 `QuotaExceededException` +- 返回 HTTP 400 错误 +- 包含当前用量和配额上限信息 + +### 阶段 4: 文档解析与格式转换 + +#### 4.1 Parser 架构 + +系统采用**多 Parser 链式调用**架构,每个 Parser 负责特定类型的文件解析: + +``` +DocParser(主控制器) + │ + ├─► MinerUParser + │ └─ 功能:高精度 PDF 解析(商业 API) + │ └─ 支持:.pdf + │ + ├─► DocRayParser + │ └─ 功能:文档布局分析和内容提取 + │ └─ 支持:.pdf, .docx, .pptx, .xlsx + │ + ├─► ImageParser + │ └─ 功能:图片内容识别(OCR + 视觉理解) + │ └─ 支持:.jpg, .png, .gif, .bmp, .tiff + │ + ├─► AudioParser + │ └─ 功能:音频转录(Speech-to-Text) + │ └─ 支持:.mp3, .wav, .m4a + │ + └─► MarkItDownParser(兜底) + └─ 功能:通用文档转 Markdown + └─ 支持:几乎所有常见格式 ``` -**响应**: +#### 4.2 Parser 配置 -```typescript +**配置方式**:通过集合配置(Collection Config)动态控制 + +```json { - confirmed_count: 3, - failed_count: 1, - failed_documents: [ - { - document_id: "doc_fail123", - name: "corrupted.pdf", - error: "CONFIRMATION_FAILED" - } - ] + "parser_config": { + "use_mineru": false, // 是否启用 MinerU(需要 API Token) + "use_doc_ray": false, // 是否启用 DocRay + "use_markitdown": true, // 是否启用 MarkItDown(默认) + "mineru_api_token": "xxx" // MinerU API Token(可选) + } } ``` -## UI组件详解 - -### 1. 文件上传区域 - -```tsx - doc.file)} - onValueChange={(files) => { - const newFilesToUpload: DocumentsWithFile[] = []; - files.forEach((file) => { - if ( - !documents.some( - (doc) => - doc.file.name === file.name && - doc.file.size === file.size && - doc.file.lastModified === file.lastModified && - doc.file.type === file.type, - ) - ) { - newFilesToUpload.push({ - file, - progress: 0, - progress_status: 'pending', - }); - } - }); - if (newFilesToUpload.length > 0) { - setDocuments((docs) => [...docs, ...newFilesToUpload]); - } - }} - onFileReject={onFileReject} - onFileValidate={onFileValidate} -> - -
- -
- {page_documents('drag_and_drop_files_here')} -
-
- {page_documents('or')} -
- - - -
-
-
-``` - -**特性**: -- 支持拖拽上传 -- 支持点击选择文件 -- 自动文件验证 -- 重复文件检测 - -### 2. 进度指示器 - -```tsx -
- {/* Step 1 */} -
- -
{page_documents('step1_select_files')}
-
- - - - {/* Step 2 */} -
- -
{page_documents('step2_upload_files')}
-
- - - - {/* Step 3 */} -
- -
{page_documents('step3_save_to_collection')}
-
-
-``` - -**步骤自动切换逻辑**: - -```typescript -useEffect(() => { - if (documents.length === 0) { - setStep(1); // 无文件 → Step 1 - } else if ( - documents.filter((doc) => doc.progress_status === 'success').length !== - documents.length - ) { - setStep(2); // 有未完成上传 → Step 2 - } else { - setStep(3); // 全部上传完成 → Step 3 - } -}, [documents]); -``` - -### 3. 文件列表表格 - -使用 `@tanstack/react-table` 实现: - -```typescript -const columns: ColumnDef[] = [ - { - id: 'select', - header: ({ table }) => ( - table.toggleAllPageRowsSelected(!!value)} - /> - ), - cell: ({ row }) => ( - row.toggleSelected(!!value)} - /> - ), - }, - { - accessorKey: 'filename', - header: 'Filename', - cell: ({ row }) => { - const file = row.original.file; - const extension = _.last(file.type.split('/')) || ''; - return ( -
- -
-
{file.name}
-
- {(file.size / 1000).toFixed(0)} KB -
-
-
- ); - }, - }, - { - header: 'Upload Progress', - cell: ({ row }) => ( -
- -
-
{row.original.progress}%
-
- {row.original.progress_status} -
-
-
- ), - }, - { - id: 'actions', - cell: ({ row }) => ( - - - - - - handleRemoveFile(row.original)} - > - Remove - - - - ), - }, -]; -``` - -**表格特性**: -- ✅ 复选框批量选择 -- ✅ 文件类型图标显示 -- ✅ 实时进度条 -- ✅ 状态颜色标识 -- ✅ 分页支持(每页20条) -- ✅ 删除操作 - -### 4. 操作按钮 - -```tsx -
- {/* 清空所有 */} - - - {/* 开始上传 */} - - - {/* 停止上传 */} - {isUploading && ( - - )} - - {/* 保存到集合 */} - -
-``` - -## 状态管理流程 - -``` -初始状态 -├── documents: [] -├── step: 1 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ 用户选择文件 - -Step 1: 文件选择完成 -├── documents: [{file, progress: 0, progress_status: 'pending'}, ...] -├── step: 1 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ 点击"开始上传" - -Step 2: 上传中 -├── documents: [{..., progress: 45, progress_status: 'uploading'}, ...] -├── step: 2 -├── isUploading: true -└── uploadingFilesRef.current: Set('file1-key', 'file2-key', ...) - -↓ 上传完成 - -Step 3: 等待确认 -├── documents: [{..., progress: 100, progress_status: 'success', document_id: 'doc_xyz'}, ...] -├── step: 3 -├── isUploading: false -└── uploadingFilesRef.current: Set() - -↓ 点击"保存到集合" - -跳转到文档列表页面 +**环境变量配置**: +```bash +USE_MINERU_API=false # 全局启用 MinerU +MINERU_API_TOKEN=your_token # MinerU API Token ``` -## 错误处理 +#### 4.3 解析流程 -### 1. 上传失败 - -```typescript -catch (err) { - setDocuments((docs) => { - const doc = docs.find((doc) => _.isEqual(doc.file, file)); - if (doc) { - Object.assign(doc, { - progress: 0, - progress_status: 'failed', - }); - } - return [...docs]; - }); -} +``` +Celery Worker 收到索引任务 + │ + ▼ +1. 从对象存储下载原始文件 + │ + ▼ +2. 根据文件扩展名选择 Parser + │ + ├─► 尝试第一个匹配的 Parser + │ ├─ 成功:返回解析结果 + │ └─ 失败:FallbackError → 尝试下一个 Parser + │ + └─► 最终兜底:MarkItDownParser + │ + ▼ +3. 解析结果(Parts): + │ + ├─► MarkdownPart:文本内容 + │ └─ 包含:标题、段落、列表、表格等 + │ + ├─► PdfPart:PDF 文件 + │ └─ 用于:线性化、页面渲染 + │ + └─► AssetBinPart:二进制资源 + └─ 包含:图片、嵌入的文件等 + │ + ▼ +4. 后处理(Post-processing): + │ + ├─► PDF 页面转图片(Vision 索引需要) + │ └─ 每页渲染为 PNG 图片 + │ └─ 保存到 {document_path}/images/page_N.png + │ + ├─► PDF 线性化(加速浏览器加载) + │ └─ 使用 pikepdf 优化 PDF 结构 + │ └─ 保存到 {document_path}/converted.pdf + │ + └─► 提取文本内容(纯文本) + └─ 合并所有 MarkdownPart 内容 + └─ 保存到 {document_path}/processed_content.md + │ + ▼ +5. 保存到对象存储 ``` -**失败后的操作**: -- 进度重置为0 -- 状态标记为 `failed` -- 可以重新点击"开始上传"重试 -- 可以删除失败的文件 +#### 4.4 格式转换示例 -### 2. 文件验证失败 +**示例 1:PDF 文档** +``` +输入:user_manual.pdf (5 MB) + │ + ▼ +解析器选择:DocRayParser / MarkItDownParser + │ + ▼ +输出 Parts: + ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." + └─ PdfPart: <原始 PDF 数据> + │ + ▼ +后处理: + ├─ 渲染 50 页为图片 → images/page_0.png ~ page_49.png + ├─ 线性化 PDF → converted.pdf + └─ 提取文本 → processed_content.md +``` -```typescript -// 在 onFileValidate 中返回错误信息 -return 'File already exists.'; +**示例 2:图片文件** +``` +输入:screenshot.png (2 MB) + │ + ▼ +解析器选择:ImageParser + │ + ▼ +输出 Parts: + ├─ MarkdownPart: "[OCR 提取的文字内容]" + └─ AssetBinPart: <原始图片数据> (vision_index=true) + │ + ▼ +后处理: + └─ 保存原图副本 → images/file.png +``` -// 或在 onFileReject 中处理 -onFileReject={(file, message) => { - toast.error(message, { - description: `"${file.name}" has been rejected`, - }); -}} +**示例 3:音频文件** ``` +输入:meeting_record.mp3 (50 MB) + │ + ▼ +解析器选择:AudioParser + │ + ▼ +输出 Parts: + └─ MarkdownPart: "[转录的会议内容文本]" + │ + ▼ +后处理: + └─ 保存转录文本 → processed_content.md +``` + +### 阶段 5: 索引构建 -### 3. 网络中断 +#### 5.1 索引类型与功能 -```typescript -// 用户可以点击"停止上传" -const stopUpload = () => { - uploadController?.abort(); // 中止所有请求 - setIsUploading(false); -}; +| 索引类型 | 是否必选 | 功能描述 | 存储位置 | +|---------|---------|----------|----------| +| **VECTOR** | ✅ 必选 | 向量化检索,支持语义搜索 | Qdrant / Elasticsearch | +| **FULLTEXT** | ✅ 必选 | 全文检索,支持关键词搜索 | Elasticsearch | +| **GRAPH** | ❌ 可选 | 知识图谱,提取实体和关系 | Neo4j / PostgreSQL | +| **SUMMARY** | ❌ 可选 | 文档摘要,LLM 生成 | PostgreSQL (index_data) | +| **VISION** | ❌ 可选 | 视觉理解,图片内容分析 | Qdrant (向量) + PG (metadata) | -// 页面卸载时自动停止 -useEffect(() => stopUpload, [stopUpload]); +#### 5.2 索引构建流程 + +``` +Celery Worker: reconcile_document_indexes 任务 + │ + ▼ +1. 扫描 DocumentIndex 表,找到需要处理的索引 + │ + ├─► PENDING 状态 + observed_version < version + │ └─ 需要创建或更新索引 + │ + └─► DELETING 状态 + └─ 需要删除索引 + │ + ▼ +2. 按文档分组,逐个处理 + │ + ▼ +3. 对每个文档: + │ + ├─► parse_document(解析文档) + │ ├─ 从对象存储下载原始文件 + │ ├─ 调用 DocParser 解析 + │ └─ 返回 ParsedDocumentData + │ + └─► 对每个索引类型: + │ + ├─► create_index (创建/更新索引) + │ │ + │ ├─ VECTOR 索引: + │ │ ├─ 文档分块(Chunking) + │ │ ├─ Embedding 模型生成向量 + │ │ └─ 写入 Qdrant + │ │ + │ ├─ FULLTEXT 索引: + │ │ ├─ 提取纯文本内容 + │ │ ├─ 按段落/章节分块 + │ │ └─ 写入 Elasticsearch + │ │ + │ ├─ GRAPH 索引: + │ │ ├─ 使用 LightRAG 提取实体 + │ │ ├─ 提取实体间关系 + │ │ └─ 写入 Neo4j/PostgreSQL + │ │ + │ ├─ SUMMARY 索引: + │ │ ├─ 调用 LLM 生成摘要 + │ │ └─ 保存到 DocumentIndex.index_data + │ │ + │ └─ VISION 索引: + │ ├─ 提取图片 Assets + │ ├─ Vision LLM 理解图片内容 + │ ├─ 生成图片描述向量 + │ └─ 写入 Qdrant + │ + └─► 更新索引状态 + ├─ 成功:CREATING → ACTIVE + └─ 失败:CREATING → FAILED + │ + ▼ +4. 更新文档总体状态 + │ + ├─ 所有索引都 ACTIVE → Document.status = COMPLETE + ├─ 任一索引 FAILED → Document.status = FAILED + └─ 部分索引仍在处理 → Document.status = RUNNING ``` -## 性能优化 +#### 5.3 文档分块(Chunking) -### 1. 防抖和节流 +**分块策略**: +- 递归字符分割(RecursiveCharacterTextSplitter) +- 按自然段落、章节优先切分 +- 保留上下文重叠(Overlap) -```typescript -// 使用 lodash 进行文件比较(高效) -_.isEqual(doc.file, file) +**分块参数**: +```json +{ + "chunk_size": 1000, // 每块最大字符数 + "chunk_overlap": 200, // 重叠字符数 + "separators": ["\n\n", "\n", " ", ""] // 分隔符优先级 +} +``` -// 文件key生成(快速查找) -const fileKey = `${file.name}-${file.size}-${file.lastModified}`; +**分块结果存储**: +``` +{document_path}/chunks/ + ├─ chunk_0.json: {"text": "...", "metadata": {...}} + ├─ chunk_1.json: {"text": "...", "metadata": {...}} + └─ ... ``` -### 2. 状态更新优化 +## 数据库设计 + +### 表 1: document(文档元数据) + +**表结构**: + +| 字段名 | 类型 | 说明 | 索引 | +|--------|------|------|------| +| `id` | String(24) | 文档 ID,主键,格式:`doc{random_id}` | PK | +| `name` | String(1024) | 文件名 | - | +| `user` | String(256) | 用户 ID(支持多种 IDP) | ✅ Index | +| `collection_id` | String(24) | 所属集合 ID | ✅ Index | +| `status` | Enum | 文档状态(见下表) | ✅ Index | +| `size` | BigInteger | 文件大小(字节) | - | +| `content_hash` | String(64) | SHA-256 哈希(用于去重) | ✅ Index | +| `object_path` | Text | 对象存储路径(已废弃,用 doc_metadata) | - | +| `doc_metadata` | Text | 文档元数据(JSON 字符串) | - | +| `gmt_created` | DateTime(tz) | 创建时间(UTC) | - | +| `gmt_updated` | DateTime(tz) | 更新时间(UTC) | - | +| `gmt_deleted` | DateTime(tz) | 删除时间(软删除) | ✅ Index | + +**唯一约束**: +```sql +UNIQUE INDEX uq_document_collection_name_active + ON document (collection_id, name) + WHERE gmt_deleted IS NULL; +``` +- 同一集合内,活跃文档的名称不能重复 +- 已删除的文档不参与唯一性检查 + +**文档状态枚举**(`DocumentStatus`): + +| 状态 | 说明 | 何时设置 | 可见性 | +|------|------|----------|--------| +| `UPLOADED` | 已上传到临时存储 | `upload_document` 接口 | 前端文件选择界面 | +| `PENDING` | 等待索引构建 | `confirm_documents` 接口 | 文档列表(处理中) | +| `RUNNING` | 索引构建中 | Celery 任务开始处理 | 文档列表(处理中) | +| `COMPLETE` | 所有索引完成 | 所有索引变为 ACTIVE | 文档列表(可用) | +| `FAILED` | 索引构建失败 | 任一索引失败 | 文档列表(失败) | +| `DELETED` | 已删除 | `delete_document` 接口 | 不可见(软删除) | +| `EXPIRED` | 临时文档过期 | 定时清理任务 | 不可见 | + +**文档元数据示例**(`doc_metadata` JSON 字段): +```json +{ + "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", + "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", + "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", + "images": [ + "user-xxx/col_xxx/doc_xxx/images/page_0.png", + "user-xxx/col_xxx/doc_xxx/images/page_1.png" + ], + "parser_used": "DocRayParser", + "parse_duration_ms": 5420, + "page_count": 50, + "custom_field": "value" +} +``` -```typescript -// 使用函数式更新,避免闭包陷阱 -setDocuments((docs) => { - const doc = docs.find(...); - // 修改 - return [...docs]; // 返回新数组触发更新 -}); +### 表 2: document_index(索引状态管理) + +**表结构**: + +| 字段名 | 类型 | 说明 | 索引 | +|--------|------|------|------| +| `id` | Integer | 自增 ID,主键 | PK | +| `document_id` | String(24) | 关联的文档 ID | ✅ Index | +| `index_type` | Enum | 索引类型(见下表) | ✅ Index | +| `status` | Enum | 索引状态(见下表) | ✅ Index | +| `version` | Integer | 索引版本号 | - | +| `observed_version` | Integer | 已处理的版本号 | - | +| `index_data` | Text | 索引数据(JSON),如摘要内容 | - | +| `error_message` | Text | 错误信息(失败时) | - | +| `gmt_created` | DateTime(tz) | 创建时间 | - | +| `gmt_updated` | DateTime(tz) | 更新时间 | - | +| `gmt_last_reconciled` | DateTime(tz) | 最后协调时间 | - | + +**唯一约束**: +```sql +UNIQUE CONSTRAINT uq_document_index + ON document_index (document_id, index_type); +``` +- 每个文档的每种索引类型只有一条记录 + +**索引类型枚举**(`DocumentIndexType`): + +| 类型 | 值 | 说明 | 外部存储 | +|------|-----|------|----------| +| `VECTOR` | "VECTOR" | 向量索引 | Qdrant / Elasticsearch | +| `FULLTEXT` | "FULLTEXT" | 全文索引 | Elasticsearch | +| `GRAPH` | "GRAPH" | 知识图谱 | Neo4j / PostgreSQL | +| `SUMMARY` | "SUMMARY" | 文档摘要 | PostgreSQL (index_data) | +| `VISION` | "VISION" | 视觉索引 | Qdrant + PostgreSQL | + +**索引状态枚举**(`DocumentIndexStatus`): + +| 状态 | 说明 | 何时设置 | +|------|------|----------| +| `PENDING` | 等待处理 | `confirm_documents` 创建索引记录 | +| `CREATING` | 创建中 | Celery Worker 开始处理 | +| `ACTIVE` | 就绪可用 | 索引构建成功 | +| `DELETING` | 标记删除 | `delete_document` 接口 | +| `DELETION_IN_PROGRESS` | 删除中 | Celery Worker 正在删除 | +| `FAILED` | 失败 | 索引构建失败 | + +**版本控制机制**: +- `version`:期望的索引版本(每次文档更新时 +1) +- `observed_version`:已处理的版本号 +- `version > observed_version` 时,触发索引更新 + +**协调器(Reconciler)**: +```python +# 查询需要处理的索引 +SELECT * FROM document_index +WHERE status = 'PENDING' + AND observed_version < version; + +# 处理后更新 +UPDATE document_index +SET status = 'ACTIVE', + observed_version = version, + gmt_last_reconciled = NOW() +WHERE id = ?; ``` -### 3. 分页显示 +### 表关系图 -```typescript -// 默认每页20条,避免大列表渲染卡顿 -const [pagination, setPagination] = useState({ - pageIndex: 0, - pageSize: 20, -}); +``` +┌─────────────────────────────────┐ +│ collection │ +│ ───────────────────────────── │ +│ id (PK) │ +│ name │ +│ config (JSON) │ +│ status │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document │ +│ ───────────────────────────── │ +│ id (PK) │ +│ collection_id (FK) │◄──── 唯一约束: (collection_id, name) +│ name │ +│ user │ +│ status (Enum) │ +│ size │ +│ content_hash (SHA-256) │ +│ doc_metadata (JSON) │ +│ gmt_created │ +│ gmt_deleted │ +│ ... │ +└────────────┬────────────────────┘ + │ 1:N + ▼ +┌─────────────────────────────────┐ +│ document_index │ +│ ───────────────────────────── │ +│ id (PK) │ +│ document_id (FK) │◄──── 唯一约束: (document_id, index_type) +│ index_type (Enum) │ +│ status (Enum) │ +│ version │ +│ observed_version │ +│ index_data (JSON) │ +│ error_message │ +│ gmt_last_reconciled │ +│ ... │ +└─────────────────────────────────┘ ``` -### 4. 虚拟滚动(未实现,可优化) +## 状态机与生命周期 -对于超大文件列表(1000+),可以使用虚拟滚动: +### 文档状态转换 -```typescript -import { useVirtualizer } from '@tanstack/react-virtual'; +``` + ┌─────────────────────────────────────────────┐ + │ │ + │ ▼ + [上传文件] ──► UPLOADED ──► [确认] ──► PENDING ──► RUNNING ──► COMPLETE + │ │ + │ ▼ + │ FAILED + │ │ + │ ▼ + └──────► [删除] ──────────────► DELETED + │ + ┌───────────────────────────────────┘ + │ + ▼ + EXPIRED (定时清理未确认的文档) ``` -## 用户体验设计 +**关键转换**: +1. **UPLOADED → PENDING**:用户点击"保存到集合" +2. **PENDING → RUNNING**:Celery Worker 开始处理 +3. **RUNNING → COMPLETE**:所有索引都成功 +4. **RUNNING → FAILED**:任一索引失败 +5. **任何状态 → DELETED**:用户删除文档 -### 1. 即时反馈 +### 索引状态转换 -- ✅ 拖拽时显示高亮区域 -- ✅ 上传中显示动画图标 -- ✅ 进度条实时更新 -- ✅ 状态用颜色区分(pending/uploading/success/failed) +``` + [创建索引记录] ──► PENDING ──► CREATING ──► ACTIVE + │ + ▼ + FAILED + │ + ▼ + ┌──────────► PENDING (重试) + │ + [删除请求] ──────┼──────────► DELETING ──► DELETION_IN_PROGRESS ──► (记录删除) + │ + └──────────► (直接删除记录,如果 PENDING/FAILED) +``` -### 2. 错误提示 +## 异步任务调度(Celery) -- ✅ 文件验证失败:Toast通知 -- ✅ 上传失败:状态标红 -- ✅ 确认失败:显示具体错误信息 +### 任务定义 -### 3. 操作引导 +**主任务**:`reconcile_document_indexes` +- 触发时机: + - `confirm_documents` 接口调用后 + - 定时任务(每 30 秒) + - 手动触发(管理界面) +- 功能:扫描 `document_index` 表,处理需要协调的索引 -- ✅ 三步进度指示器 -- ✅ 按钮根据状态启用/禁用 -- ✅ 空状态提示 -- ✅ 操作成功后自动跳转 +**子任务**: +- `parse_document_task`:解析文档内容 +- `create_vector_index_task`:创建向量索引 +- `create_fulltext_index_task`:创建全文索引 +- `create_graph_index_task`:创建知识图谱索引 +- `create_summary_index_task`:创建摘要索引 +- `create_vision_index_task`:创建视觉索引 -### 4. 响应式设计 +### 任务调度策略 -- ✅ 表格在小屏幕自适应 -- ✅ 操作按钮在移动端堆叠 -- ✅ 文件名过长时截断显示 +**并发控制**: +- 每个 Worker 最多同时处理 N 个文档(默认 4) +- 每个文档的多个索引可以并行构建 +- 使用 Celery 的 `task_acks_late=True` 确保任务不丢失 -## 国际化支持 +**失败重试**: +- 最多重试 3 次 +- 指数退避(1分钟 → 5分钟 → 15分钟) +- 3 次失败后标记为 `FAILED` -使用 `next-intl` 进行国际化: +**幂等性**: +- 所有任务支持重复执行 +- 使用 `observed_version` 机制避免重复处理 +- 相同输入产生相同输出 -```typescript -const page_documents = useTranslations('page_documents'); +## 设计特点与优势 -// 使用 -page_documents('filename') -page_documents('upload_progress') -page_documents('drag_and_drop_files_here') -page_documents('step1_select_files') -page_documents('step2_upload_files') -page_documents('step3_save_to_collection') +### 1. 两阶段提交设计 + +**优势**: +- ✅ **用户体验更好**:快速上传响应,不阻塞用户操作 +- ✅ **选择性添加**:批量上传后可选择性确认部分文件 +- ✅ **资源控制合理**:未确认的文档不构建索引,不消耗配额 +- ✅ **故障恢复友好**:临时文档可以定期清理,不影响业务 + +**状态隔离**: +``` +临时状态(UPLOADED): + - 不计入配额 + - 不触发索引 + - 可以被自动清理 + +正式状态(PENDING/RUNNING/COMPLETE): + - 计入配额 + - 触发索引构建 + - 不会被自动清理 ``` -**翻译文件位置**: -- `web/src/locales/en-US/page_documents.json` -- `web/src/locales/zh-CN/page_documents.json` +### 2. 幂等性设计 -## 最佳实践 +**文件级别幂等**: +- SHA-256 哈希去重 +- 相同文件多次上传返回同一 `document_id` +- 避免存储空间浪费 -### 1. 文件大小限制 +**接口级别幂等**: +- `upload_document`:重复上传返回已存在文档 +- `confirm_documents`:重复确认不会创建重复索引 +- `delete_document`:重复删除返回成功(软删除) -```typescript -// 前端检查(可选) -const MAX_FILE_SIZE = 100 * 1024 * 1024; // 100MB +### 3. 多租户隔离 -if (file.size > MAX_FILE_SIZE) { - return 'File size exceeds 100MB'; -} +**存储隔离**: +``` +user-{user_A}/... # 用户 A 的文件 +user-{user_B}/... # 用户 B 的文件 ``` -### 2. 支持的文件类型 +**数据库隔离**: +- 所有查询都带 `user` 字段过滤 +- 集合级别的权限控制(`collection.user`) +- 软删除支持(`gmt_deleted`) -前端可以限制文件类型,但最终验证在后端: +### 4. 灵活的存储后端 -```typescript -const ALLOWED_TYPES = [ - 'application/pdf', - 'application/msword', - 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', - 'text/plain', - // ... -]; +**统一接口**: +```python +AsyncObjectStore: + - put(path, data) + - get(path) + - delete_objects_by_prefix(prefix) +``` -if (!ALLOWED_TYPES.includes(file.type)) { - return 'File type not supported'; -} +**运行时切换**: +- 通过环境变量切换 Local/S3 +- 无需修改业务代码 +- 支持自定义存储后端(实现接口即可) + +### 5. 事务一致性 + +**数据库 + 对象存储的两阶段提交**: +```python +async with transaction: + # 1. 创建数据库记录 + document = create_document_record() + + # 2. 上传到对象存储 + await object_store.put(path, data) + + # 3. 更新元数据 + document.doc_metadata = json.dumps(metadata) + + # 所有操作成功才提交,任一失败则回滚 +``` + +**失败处理**: +- 数据库记录创建失败:不上传文件 +- 文件上传失败:回滚数据库记录 +- 元数据更新失败:回滚前面的操作 + +### 6. 可观测性 + +**审计日志**: +- `@audit` 装饰器记录所有文档操作 +- 包含:用户、时间、操作类型、资源 ID + +**任务追踪**: +- `gmt_last_reconciled`:最后处理时间 +- `error_message`:失败原因 +- Celery 任务 ID:关联日志追踪 + +**监控指标**: +- 文档上传速率 +- 索引构建耗时 +- 失败率统计 + +## 性能优化 + +### 1. 异步处理 + +**上传不阻塞**: +- 文件上传到对象存储后立即返回 +- 索引构建在 Celery 中异步执行 +- 前端通过轮询或 WebSocket 获取进度 + +### 2. 批量操作 + +**批量确认**: +```python +confirm_documents(document_ids=[id1, id2, ..., idN]) ``` +- 一次事务处理多个文档 +- 批量创建索引记录 +- 减少数据库往返 + +### 3. 缓存策略 + +**解析结果缓存**: +- 解析后的内容保存到 `processed_content.md` +- 后续索引重建可直接读取,无需重新解析 + +**分块结果缓存**: +- 分块结果保存到 `chunks/` 目录 +- 向量索引重建可复用分块结果 + +### 4. 并行索引构建 + +**多索引并行**: +```python +# VECTOR、FULLTEXT、GRAPH 可以并行构建 +await asyncio.gather( + create_vector_index(), + create_fulltext_index(), + create_graph_index() +) +``` + +## 错误处理 + +### 常见异常 -### 3. 自动重试机制(未实现,建议) +| 异常类型 | HTTP 状态码 | 触发场景 | 处理建议 | +|---------|------------|----------|----------| +| `ResourceNotFoundException` | 404 | 集合/文档不存在 | 检查 ID 是否正确 | +| `CollectionInactiveException` | 400 | 集合未激活 | 等待集合初始化完成 | +| `DocumentNameConflictException` | 409 | 同名不同内容 | 重命名文件或删除旧文档 | +| `QuotaExceededException` | 429 | 配额超限 | 升级套餐或删除旧文档 | +| `InvalidFileTypeException` | 400 | 不支持的文件类型 | 查看支持的文件类型列表 | +| `FileSizeTooLargeException` | 413 | 文件过大 | 分割文件或压缩 | -```typescript -const uploadWithRetry = async (file: File, retries = 3) => { - for (let i = 0; i < retries; i++) { - try { - return await apiClient.upload(file); - } catch (err) { - if (i === retries - 1) throw err; - await new Promise(resolve => setTimeout(resolve, 1000 * Math.pow(2, i))); - } +### 异常传播 + +``` +Service Layer 抛出异常 + │ + ▼ +View Layer 捕获并转换 + │ + ▼ +Exception Handler 统一处理 + │ + ▼ +返回标准 JSON 响应: +{ + "error_code": "QUOTA_EXCEEDED", + "message": "Document count limit exceeded", + "details": { + "limit": 1000, + "current": 1000 } -}; +} ``` -## 相关文件 +## 相关文件索引 -### 前端组件 -- `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - 主上传组件 -- `web/src/app/workspace/collections/[collectionId]/documents/upload/page.tsx` - 上传页面 -- `web/src/components/ui/file-upload.tsx` - 文件上传UI组件 -- `web/src/components/ui/progress.tsx` - 进度条组件 -- `web/src/components/data-grid.tsx` - 数据表格组件 +### 核心实现 -### API客户端 -- `web/src/lib/api/client.ts` - API客户端配置 -- `web/src/api/` - 自动生成的API接口 +- **View 层**:`aperag/views/collections.py` - HTTP 接口定义 +- **Service 层**:`aperag/service/document_service.py` - 业务逻辑 +- **数据库模型**:`aperag/db/models.py` - Document, DocumentIndex 表定义 +- **数据库操作**:`aperag/db/ops.py` - CRUD 操作封装 -### 国际化 -- `web/src/locales/en-US/page_documents.json` - 英文翻译 -- `web/src/locales/zh-CN/page_documents.json` - 中文翻译 +### 对象存储 -## 总结 +- **接口定义**:`aperag/objectstore/base.py` - AsyncObjectStore 抽象类 +- **Local 实现**:`aperag/objectstore/local.py` - 本地文件系统存储 +- **S3 实现**:`aperag/objectstore/s3.py` - S3 兼容存储 + +### 文档解析 + +- **主控制器**:`aperag/docparser/doc_parser.py` - DocParser +- **Parser 实现**: + - `aperag/docparser/mineru_parser.py` - MinerU PDF 解析 + - `aperag/docparser/docray_parser.py` - DocRay 文档解析 + - `aperag/docparser/markitdown_parser.py` - MarkItDown 通用解析 + - `aperag/docparser/image_parser.py` - 图片 OCR + - `aperag/docparser/audio_parser.py` - 音频转录 +- **文档处理**:`aperag/index/document_parser.py` - 解析流程编排 -ApeRAG的文档上传功能通过**三步引导流程**提供了直观且可靠的用户体验: +### 索引构建 -1. **Step 1 - 选择文件**: 拖拽或点击选择,前端即时验证 -2. **Step 2 - 上传文件**: 并发上传到临时存储,实时进度追踪 -3. **Step 3 - 确认添加**: 用户选择性确认,触发索引构建 +- **索引管理**:`aperag/index/manager.py` - DocumentIndexManager +- **向量索引**:`aperag/index/vector_index.py` - VectorIndexer +- **全文索引**:`aperag/index/fulltext_index.py` - FulltextIndexer +- **知识图谱**:`aperag/index/graph_index.py` - GraphIndexer +- **文档摘要**:`aperag/index/summary_index.py` - SummaryIndexer +- **视觉索引**:`aperag/index/vision_index.py` - VisionIndexer -**核心优势**: -- 🎯 **用户友好**: 三步流程清晰,操作引导明确 -- ⚡ **性能优化**: 并发控制、分页显示、状态管理优化 -- 🔒 **可靠性高**: 重复检测、错误处理、中途取消支持 -- 🌍 **国际化**: 完整的多语言支持 -- 📱 **响应式**: 适配移动端和桌面端 +### 任务调度 -这种设计在保证功能完整性的同时,提供了出色的用户体验和系统稳定性。 +- **任务定义**:`config/celery_tasks.py` - Celery 任务注册 +- **协调器**:`aperag/tasks/reconciler.py` - DocumentIndexReconciler +- **文档任务**:`aperag/tasks/document.py` - DocumentIndexTask +### 前端实现 + +- **文档列表**:`web/src/app/workspace/collections/[collectionId]/documents/page.tsx` +- **文档上传**:`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` + +## 总结 +ApeRAG 的文档上传模块采用**两阶段提交 + 多 Parser 链式调用 + 多索引并行构建**的架构设计: + +**核心特性**: +1. ✅ **两阶段提交**:上传(临时存储)→ 确认(正式添加),提供更好的用户体验 +2. ✅ **SHA-256 去重**:避免重复文档,支持幂等上传 +3. ✅ **灵活存储后端**:Local/S3 可配置切换,统一接口抽象 +4. ✅ **多 Parser 架构**:支持 MinerU、DocRay、MarkItDown 等多种解析器 +5. ✅ **格式自动转换**:PDF→图片、音频→文本、图片→OCR 文本 +6. ✅ **多索引协调**:向量、全文、图谱、摘要、视觉五种索引类型 +7. ✅ **配额管理**:确认阶段才扣除配额,合理控制资源 +8. ✅ **异步处理**:Celery 任务队列,不阻塞用户操作 +9. ✅ **事务一致性**:数据库 + 对象存储的两阶段提交 +10. ✅ **可观测性**:审计日志、任务追踪、错误信息完整记录 + +这种设计既保证了高性能和可扩展性,又支持复杂的文档处理场景(多格式、多语言、多模态),同时具有良好的容错能力和用户体验。