Skip to content

Commit 6c53621

Browse files
authored
feat: support download file (#1424)
1 parent 54d7819 commit 6c53621

File tree

11 files changed

+675
-15
lines changed

11 files changed

+675
-15
lines changed

aperag/api/openapi.merged.yaml

Whitespace-only changes.

aperag/api/openapi.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ paths:
6767
$ref: './paths/collections.yaml#/document_preview'
6868
/collections/{collection_id}/documents/{document_id}/object:
6969
$ref: './paths/collections.yaml#/document_object'
70+
/collections/{collection_id}/documents/{document_id}/download:
71+
$ref: './paths/collections.yaml#/document_download'
7072
/collections/{collection_id}/documents/upload:
7173
$ref: './paths/collections.yaml#/upload_document'
7274
/collections/{collection_id}/documents/confirm:

aperag/api/paths/collections.yaml

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,94 @@ document_object:
398398
schema:
399399
$ref: '../components/schemas/common.yaml#/failResponse'
400400

401+
document_download:
402+
get:
403+
summary: Download document file
404+
description: |
405+
Download the original document file.
406+
Returns the file as a streaming response with appropriate Content-Type and Content-Disposition headers.
407+
The file is streamed through the backend to support internal network deployments and maintain access control.
408+
409+
**Document Lifecycle and Download Availability:**
410+
- UPLOADED: Downloadable (temporary status, auto-deleted after 24 hours if not confirmed)
411+
- PENDING/RUNNING/COMPLETE/FAILED: Downloadable (permanent, not auto-deleted)
412+
- EXPIRED: Not downloadable (file deleted by cleanup task)
413+
- DELETED: Not downloadable (soft-deleted by user)
414+
415+
**Auto-Cleanup Mechanism:**
416+
A scheduled task runs every 10 minutes to clean up documents in UPLOADED status that are older than 24 hours.
417+
Once confirmed, documents will never be auto-deleted.
418+
operationId: download_document
419+
tags:
420+
- documents
421+
security:
422+
- BearerAuth: []
423+
parameters:
424+
- name: collection_id
425+
in: path
426+
required: true
427+
schema:
428+
type: string
429+
description: Collection ID
430+
- name: document_id
431+
in: path
432+
required: true
433+
schema:
434+
type: string
435+
description: Document ID
436+
responses:
437+
'200':
438+
description: Document file stream
439+
content:
440+
application/octet-stream:
441+
schema:
442+
type: string
443+
format: binary
444+
headers:
445+
Content-Type:
446+
description: MIME type of the document (e.g., application/pdf, text/plain)
447+
schema:
448+
type: string
449+
Content-Disposition:
450+
description: Attachment header with original filename
451+
schema:
452+
type: string
453+
example: 'attachment; filename="document.pdf"'
454+
Content-Length:
455+
description: Size of the file in bytes
456+
schema:
457+
type: integer
458+
'400':
459+
description: Bad request - document status does not allow download (EXPIRED or DELETED)
460+
content:
461+
application/json:
462+
schema:
463+
$ref: '../components/schemas/common.yaml#/failResponse'
464+
'401':
465+
description: Unauthorized
466+
content:
467+
application/json:
468+
schema:
469+
$ref: '../components/schemas/common.yaml#/failResponse'
470+
'403':
471+
description: Forbidden - user does not have access to this document
472+
content:
473+
application/json:
474+
schema:
475+
$ref: '../components/schemas/common.yaml#/failResponse'
476+
'404':
477+
description: Document not found or file not found in storage
478+
content:
479+
application/json:
480+
schema:
481+
$ref: '../components/schemas/common.yaml#/failResponse'
482+
'500':
483+
description: Internal server error - failed to download from storage
484+
content:
485+
application/json:
486+
schema:
487+
$ref: '../components/schemas/common.yaml#/failResponse'
488+
401489
rebuild_indexes:
402490
post:
403491
summary: Rebuild document indexes

aperag/schema/view_models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
# generated by datamodel-codegen:
1616
# filename: openapi.merged.yaml
17-
# timestamp: 2025-11-11T06:17:00+00:00
17+
# timestamp: 2026-01-13T12:50:23+00:00
1818

1919
from __future__ import annotations
2020

aperag/service/document_service.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1004,6 +1004,80 @@ async def _get_document_preview(session: AsyncSession):
10041004
# Execute query with proper session management
10051005
return await self.db_ops._execute_query(_get_document_preview)
10061006

1007+
async def download_document(self, user_id: str, collection_id: str, document_id: str):
1008+
"""
1009+
Download the original document file.
1010+
Returns a StreamingResponse with the file content.
1011+
"""
1012+
1013+
async def _download_document(session):
1014+
# 1. Verify user has access to the document
1015+
stmt = select(db_models.Document).filter(
1016+
db_models.Document.id == document_id,
1017+
db_models.Document.collection_id == collection_id,
1018+
db_models.Document.user == user_id,
1019+
db_models.Document.gmt_deleted.is_(None), # Only allow downloading non-deleted documents
1020+
)
1021+
result = await session.execute(stmt)
1022+
document = result.scalars().first()
1023+
if not document:
1024+
raise DocumentNotFoundException(document_id)
1025+
1026+
# 2. Check document status - only disallow downloading expired/deleted documents
1027+
# UPLOADED documents can be downloaded (before confirmation, within 24 hours)
1028+
# Once expired or deleted, files may no longer exist in storage
1029+
if document.status in [db_models.DocumentStatus.EXPIRED, db_models.DocumentStatus.DELETED]:
1030+
raise HTTPException(
1031+
status_code=400, detail=f"Document status is {document.status.value}, cannot download"
1032+
)
1033+
1034+
# 3. Get object path from doc_metadata
1035+
try:
1036+
metadata = json.loads(document.doc_metadata) if document.doc_metadata else {}
1037+
object_path = metadata.get("object_path")
1038+
if not object_path:
1039+
raise HTTPException(status_code=404, detail="Document file not found in storage")
1040+
except json.JSONDecodeError:
1041+
logger.error(f"Invalid JSON in doc_metadata for document {document_id}")
1042+
raise HTTPException(status_code=500, detail="Document metadata is corrupted")
1043+
1044+
# 4. Stream file from object store
1045+
try:
1046+
async_obj_store = get_async_object_store()
1047+
1048+
# Get file stream and size
1049+
get_result = await async_obj_store.get(object_path)
1050+
if not get_result:
1051+
raise HTTPException(status_code=404, detail="Document file not found in object store")
1052+
1053+
data_stream, file_size = get_result
1054+
1055+
# Determine content type from filename
1056+
content_type, _ = mimetypes.guess_type(document.name)
1057+
if content_type is None:
1058+
content_type = "application/octet-stream"
1059+
1060+
# Set headers for file download
1061+
headers = {
1062+
"Content-Type": content_type,
1063+
"Content-Disposition": f'attachment; filename="{document.name}"',
1064+
"Content-Length": str(file_size),
1065+
}
1066+
1067+
logger.info(
1068+
f"User {user_id} downloading document {document_id} ({document.name}) "
1069+
f"from collection {collection_id}, size: {file_size} bytes"
1070+
)
1071+
1072+
return StreamingResponse(data_stream, headers=headers)
1073+
1074+
except Exception as e:
1075+
logger.error(f"Failed to download document {document_id} from path {object_path}: {e}", exc_info=True)
1076+
raise HTTPException(status_code=500, detail="Failed to download document from storage")
1077+
1078+
# Execute query with proper session management
1079+
return await self.db_ops._execute_query(_download_document)
1080+
10071081
async def get_document_object(
10081082
self, user_id: str, collection_id: str, document_id: str, path: str, range_header: str = None
10091083
):

aperag/views/collections.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,21 @@ async def get_document_view(
269269
return await document_service.get_document(str(user.id), collection_id, document_id)
270270

271271

272+
@router.get("/collections/{collection_id}/documents/{document_id}/download", tags=["documents"])
273+
@audit(resource_type="document", api_name="DownloadDocument")
274+
async def download_document_view(
275+
request: Request,
276+
collection_id: str,
277+
document_id: str,
278+
user: User = Depends(required_user),
279+
):
280+
"""
281+
Download the original document file.
282+
Returns the file as a streaming response with appropriate headers.
283+
"""
284+
return await document_service.download_document(str(user.id), collection_id, document_id)
285+
286+
272287
@router.delete("/collections/{collection_id}/documents/{document_id}", tags=["documents"])
273288
@audit(resource_type="document", api_name="DeleteDocument")
274289
async def delete_document_view(

docs/design/document_export_design_zh.md

Lines changed: 75 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
│ └──────────────────┘ └──────────────────┘ │
2626
└─────────┬──────────────────────────┬────────────────────────┘
2727
│ │
28-
│ GET /documents/{id}/download (同步,流式返回)
29-
│ POST /collections/{id}/export (异步,生成下载链接)
28+
│ GET /collections/{id}/documents/{id}/download (同步,流式返回)
29+
│ POST /collections/{id}/export (异步,生成下载链接)
3030
▼ ▼
3131
┌─────────────────────────────────────────────────────────────┐
3232
│ View Layer │
@@ -111,7 +111,7 @@
111111

112112
| 场景 | API | 模式 | 说明 |
113113
|------|-----|------|------|
114-
| **单个文档下载** | `GET /documents/{id}/download` | 同步流式 | 直接返回文件流 |
114+
| **单个文档下载** | `GET /collections/{collection_id}/documents/{id}/download` | 同步流式 | 直接返回文件流 |
115115
| **知识库导出** | `POST /collections/{id}/export` | 异步 | 生成后端下载 URL |
116116

117117
## 核心流程详解
@@ -124,26 +124,32 @@
124124
用户点击"下载"按钮
125125
126126
127-
GET /api/v1/documents/{document_id}/download
127+
GET /api/v1/collections/{collection_id}/documents/{document_id}/download
128128
129129
130130
后端处理:
131131
132132
├─► 验证用户身份(JWT)
133133
134-
├─► 验证文档访问权限
134+
├─► 验证文档访问权限(user、collection_id 匹配)
135135
136-
├─► 查询 Document 记录
136+
├─► 查询 Document 记录(过滤软删除文档)
137137
138-
├─► 从 doc_metadata 获取 object_path
138+
├─► 检查文档状态(只禁止 EXPIRED/DELETED 状态)
139+
│ ├─ UPLOADED: ✅ 允许下载(上传后 24 小时内)
140+
│ ├─ PENDING/RUNNING/COMPLETE/FAILED: ✅ 允许下载(永久)
141+
│ ├─ EXPIRED: ❌ 禁止(文件已被清理)
142+
│ └─ DELETED: ❌ 禁止(用户已删除)
143+
144+
├─► 从 doc_metadata JSON 获取 object_path
139145
140146
├─► 从对象存储读取文件(流式)
141-
│ └─ 路径:user-{user_id}/{collection_id}/{doc_id}/original.pdf
147+
│ └─ 路径:user-{user_id}/{collection_id}/{doc_id}/original.xxx
142148
143149
└─► 返回 StreamingResponse
144-
├─ Content-Type: application/octet-stream
145-
├─ Content-Disposition: attachment; filename="xxx.pdf"
146-
└─ Transfer-Encoding: chunked (流式传输)
150+
├─ Content-Type: 根据文件扩展名判断(默认 application/octet-stream
151+
├─ Content-Disposition: attachment; filename="原始文件名"
152+
└─ Content-Length: 文件大小(从对象存储获取)
147153
148154
149155
文件通过后端流式传输给客户端
@@ -156,7 +162,7 @@ GET /api/v1/documents/{document_id}/download
156162

157163
**请求**
158164
```http
159-
GET /api/v1/documents/{document_id}/download
165+
GET /api/v1/collections/{collection_id}/documents/{document_id}/download
160166
Authorization: Bearer {token}
161167
```
162168

@@ -166,11 +172,14 @@ HTTP/1.1 200 OK
166172
Content-Type: application/octet-stream
167173
Content-Disposition: attachment; filename="user_manual.pdf"
168174
Content-Length: 5242880
169-
Transfer-Encoding: chunked
170175
171176
[文件二进制流]
172177
```
173178

179+
**说明**
180+
- 实际实现中不使用 `Transfer-Encoding: chunked` 响应头,而是通过 FastAPI 的 `StreamingResponse` 自动处理流式传输
181+
- `Content-Length` 会从对象存储获取文件大小后设置
182+
174183
#### 1.3 关键特性
175184

176185
- **流式读取**:从对象存储按块读取(chunk size = 64KB)
@@ -179,6 +188,58 @@ Transfer-Encoding: chunked
179188
- **超时控制**:设置合理的读取超时(如 30 分钟)
180189
- **权限控制**:每次下载都验证用户权限
181190
- **审计日志**:记录下载操作(用户、时间、文档)
191+
- **状态检查**:只禁止下载 EXPIRED/DELETED 状态的文档
192+
193+
#### 1.4 文档生命周期与下载可用性
194+
195+
**文档状态说明**
196+
197+
| 状态 | 说明 | 可下载 | 自动清理 | 触发条件 |
198+
|------|------|--------|----------|----------|
199+
| `UPLOADED` | 已上传,未确认 || 是(24小时后) | 用户上传文件 |
200+
| `PENDING` | 已确认,等待处理 ||| 用户确认文档 |
201+
| `RUNNING` | 正在处理索引 ||| 后台任务开始处理 |
202+
| `COMPLETE` | 处理完成 ||| 索引创建成功 |
203+
| `FAILED` | 处理失败 ||| 索引创建失败 |
204+
| `EXPIRED` | 已过期 || - | 自动清理任务 |
205+
| `DELETED` | 已删除 || - | 用户删除操作 |
206+
207+
**自动清理机制**
208+
209+
```
210+
定时任务:每 10 分钟运行一次
211+
清理目标:UPLOADED 状态 且 创建时间 > 24 小时 的文档
212+
清理操作:
213+
1. 删除对象存储中的文件(包括所有相关文件)
214+
2. 将文档状态更新为 EXPIRED
215+
3. 记录清理日志
216+
217+
配置位置:config/celery.py
218+
任务名称:cleanup_expired_documents_task
219+
执行频率:600 秒(10 分钟)
220+
```
221+
222+
**设计理念**
223+
-**用户友好**:上传后即可下载预览,无需等待确认
224+
-**资源优化**:未确认的临时文件自动清理,节省存储空间
225+
-**数据安全**:确认后的文档永久保留,不会被自动删除
226+
-**清晰提示**:EXPIRED 状态的文档返回明确错误信息
227+
228+
**典型使用流程**
229+
230+
```
231+
1. 用户上传文档
232+
└─► 状态:UPLOADED(可下载,24小时有效期)
233+
234+
2. 场景 A:用户及时确认(< 24 小时)
235+
└─► 状态:PENDING → RUNNING → COMPLETE
236+
└─► 可永久下载,不会被清理 ✅
237+
238+
3. 场景 B:用户未及时确认(> 24 小时)
239+
└─► 自动清理任务执行
240+
└─► 状态:EXPIRED(无法下载) ❌
241+
└─► 用户需要重新上传
242+
```
182243

183244
### 场景 2: 知识库导出(异步打包)
184245

@@ -617,7 +678,7 @@ Celery 任务超时:
617678

618679
| 方法 | 路径 | 说明 | 模式 |
619680
|------|------|------|------|
620-
| GET | `/documents/{id}/download` | 下载单个文档 | 同步流式 |
681+
| GET | `/collections/{collection_id}/documents/{id}/download` | 下载单个文档 | 同步流式 |
621682
| POST | `/collections/{id}/export` | 知识库导出 | 异步 |
622683
| GET | `/export-tasks/{id}` | 查询导出任务状态 | - |
623684
| GET | `/export-tasks/{id}/download` | 下载导出结果 | 同步流式 |

0 commit comments

Comments
 (0)