Skip to content

Commit d70a3ed

Browse files
authored
feat(generation_service): add document filtering to remove short documents based on chunk size (#180)
* fix(chart): update Helm chart helpers and values for improved configuration * feat(SynthesisTaskTab): enhance task table with tooltip support and improved column widths * feat(CreateTask, SynthFileTask): improve task creation and detail view with enhanced payload handling and UI updates * feat(SynthFileTask): enhance file display with progress tracking and delete action * feat(SynthFileTask): enhance file display with progress tracking and delete action * feat(SynthDataDetail): add delete action for chunks with confirmation prompt * feat(SynthDataDetail): update edit and delete buttons to icon-only format * feat(SynthDataDetail): add confirmation modals for chunk and synthesis data deletion * feat(DocumentSplitter): add enhanced document splitting functionality with CJK support and metadata detection * feat(DataSynthesis): refactor data synthesis models and update task handling logic * feat(DataSynthesis): streamline synthesis task handling and enhance chunk processing logic * feat(DataSynthesis): refactor data synthesis models and update task handling logic * fix(generation_service): ensure processed chunks are incremented regardless of question generation success * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(CreateTask): enhance task creation with new synthesis templates and improved configuration options * feat(model_chat): enhance JSON parsing by removing additional thought tags and improving fallback logic * feat(generation_service): add document filtering to remove short documents based on chunk size
1 parent be87508 commit d70a3ed

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

runtime/datamate-python/app/module/generation/service/generation_service.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,17 @@
2525
from app.module.system.service.common_service import chat, get_model_by_id, get_chat_client
2626

2727

28+
def _filter_docs(split_docs, chunk_size):
29+
"""
30+
过滤文档,移除长度小于 chunk_size 的文档
31+
"""
32+
filtered_docs = []
33+
for doc in split_docs:
34+
if len(doc.page_content) >= chunk_size * 0.7:
35+
filtered_docs.append(doc)
36+
return filtered_docs
37+
38+
2839
class GenerationService:
2940
def __init__(self, db: AsyncSession):
3041
self.db = db
@@ -464,7 +475,7 @@ def _load_and_split(file_path: str, chunk_size: int, chunk_overlap: int):
464475
try:
465476
docs = load_documents(file_path)
466477
split_docs = DocumentSplitter.auto_split(docs, chunk_size, chunk_overlap)
467-
return split_docs
478+
return _filter_docs(split_docs, chunk_size)
468479
except Exception as e:
469480
logger.error(f"Error loading or splitting file {file_path}: {e}")
470481
raise

0 commit comments

Comments
 (0)