Skip to content

Commit 8a003af

Browse files
authored
✨[Request] #694 implement document-level vectorization and K-means clustering to summary knowledge
✨[Request] #694 implement document-level vectorization and K-means clustering to summary knowledge
2 parents 86f4185 + 65fc9a5 commit 8a003af

21 files changed

+3292
-255
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
system_prompt: |-
2+
You are a professional knowledge summarization assistant. Your task is to generate a concise summary of a document cluster based on multiple documents.
3+
4+
**Summary Requirements:**
5+
1. The input contains multiple documents (each document has title and content snippets)
6+
2. You need to extract the common themes and key topics from these documents
7+
3. Generate a summary that represents the collective content of the cluster
8+
4. The summary should be accurate, coherent, and written in natural language
9+
5. Keep the summary within the specified word limit
10+
11+
**Guidelines:**
12+
- Focus on identifying shared themes and topics across documents
13+
- Highlight key concepts, domains, or subject matter
14+
- Use clear and concise language
15+
- Avoid listing individual document titles unless necessary
16+
- The summary should help users understand what this group of documents covers
17+
18+
user_prompt: |
19+
Please generate a concise summary of the following document cluster:
20+
21+
{{ cluster_content }}
22+
23+
Summary ({{ max_words }} words):
24+
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
system_prompt: |-
2+
You are a professional cluster summarization assistant. Your task is to merge multiple document summaries into a cohesive cluster summary.
3+
4+
**Summary Requirements:**
5+
1. The input contains summaries of multiple documents that belong to the same cluster
6+
2. These documents share similar themes or topics (grouped by clustering)
7+
3. You need to synthesize a unified summary that captures the collective content
8+
4. The summary should highlight common themes and key information across documents
9+
5. Keep the summary within the specified word limit
10+
11+
**Guidelines:**
12+
- Identify shared themes and topics across documents
13+
- Highlight common concepts and subject matter
14+
- Use clear and concise language
15+
- Avoid listing individual document titles unless necessary
16+
- Focus on what this group of documents collectively covers
17+
- The summary should be coherent and represent the cluster's unified content
18+
- **Important: Do not use any separators (like ---, ***, etc.), generate plain text summary only**
19+
20+
user_prompt: |
21+
Please generate a unified summary of the following document cluster based on individual document summaries:
22+
23+
{{ document_summaries }}
24+
25+
**Important Reminders:**
26+
- Do not use any separators (like ---, ***, ===, etc.)
27+
- Do not include document titles or filenames
28+
- Generate plain text summary content only
29+
30+
Cluster Summary ({{ max_words }} words):
31+
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
system_prompt: |-
2+
你是一个专业的簇总结助手。你的任务是将多个文档总结合并为一个连贯的簇总结。
3+
4+
**总结要求:**
5+
1. 输入包含属于同一簇的多个文档的总结
6+
2. 这些文档共享相似的主题或话题(通过聚类分组)
7+
3. 你需要综合成一个统一的总结,捕捉集合内容
8+
4. 总结应突出文档间的共同主题和关键信息
9+
5. 保持在指定的字数限制内
10+
11+
**指导原则:**
12+
- 识别文档间的共同主题和话题
13+
- 突出共同概念和主题内容
14+
- 使用清晰简洁的语言
15+
- 除非必要,避免列出单个文档标题
16+
- 专注于这组文档共同涵盖的内容
17+
- 总结应连贯且代表簇的统一内容
18+
- 确保准确、全面,明确关键实体,不要遗漏重要信息
19+
- **重要:不要使用任何分隔符(如---、***等),直接生成纯文本总结**
20+
21+
user_prompt: |
22+
请根据以下文档总结生成统一的学生簇总结:
23+
24+
{{ document_summaries }}
25+
26+
**重要提醒:**
27+
- 不要使用任何分隔符(如---、***、===等)
28+
- 不要包含文档标题或文件名
29+
- 直接生成纯文本总结内容
30+
31+
簇总结({{ max_words }}字):
32+
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
system_prompt: |-
2+
You are a professional document summarization assistant. Your task is to generate a concise summary of a document based on its key content snippets.
3+
4+
**Summary Requirements:**
5+
1. The input contains key snippets from a document (typically from beginning, middle, and end sections)
6+
2. You need to extract the main themes, topics, and key information
7+
3. Generate a summary that represents the document's core content
8+
4. The summary should be accurate, coherent, and concise
9+
5. Keep the summary within the specified word limit
10+
11+
**Guidelines:**
12+
- Focus on identifying main themes and key topics
13+
- Highlight important concepts and information
14+
- Use clear and concise language
15+
- Avoid redundancy and unnecessary details
16+
- The summary should help users understand what the document covers
17+
- **Important: Do not use any separators (like ---, ***, etc.), generate plain text summary only**
18+
19+
user_prompt: |
20+
Please generate a concise summary of the following document:
21+
22+
Document name: {{ filename }}
23+
24+
Content snippets:
25+
{{ content }}
26+
27+
Summary ({{ max_words }} words):
28+
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
system_prompt: |-
2+
你是一个专业的文档总结助手。你的任务是根据文档的关键内容片段生成简洁的总结。
3+
4+
**总结要求:**
5+
1. 输入包含文档的关键片段(通常来自开头、中间和结尾部分)
6+
2. 你需要提取主要主题、话题和关键信息
7+
3. 生成能代表文档核心内容的总结
8+
4. 总结应准确、连贯且简洁
9+
5. 保持在指定的字数限制内
10+
11+
**指导原则:**
12+
- 专注于识别主要主题和关键话题
13+
- 突出重要概念和信息
14+
- 使用清晰简洁的语言
15+
- 避免冗余和不必要的细节
16+
- 总结应帮助用户理解文档涵盖的内容
17+
- 确保总结准确、全面,不要遗漏关键实体和信息
18+
- **重要:不要使用任何分隔符(如---、***等),直接生成纯文本总结**
19+
20+
user_prompt: |
21+
请为以下文档生成简洁的总结:
22+
23+
文档名称:{{ filename }}
24+
25+
内容片段:
26+
{{ content }}
27+
28+
总结({{ max_words }}字):
29+

backend/pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ dependencies = [
1414
"pyyaml>=6.0.2",
1515
"redis>=5.0.0",
1616
"fastmcp==2.12.0",
17-
"langchain>=0.3.26"
17+
"langchain>=0.3.26",
18+
"scikit-learn>=1.0.0",
19+
"numpy>=1.24.0"
1820
]
1921

2022
[project.optional-dependencies]

backend/services/elasticsearch_service.py

Lines changed: 61 additions & 123 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,11 @@
1818

1919
from fastapi import Body, Depends, Path, Query
2020
from fastapi.responses import StreamingResponse
21-
from jinja2 import Template, StrictUndefined
2221
from nexent.core.models.embedding_model import OpenAICompatibleEmbedding, JinaEmbedding, BaseEmbedding
2322
from nexent.core.nlp.tokenizer import calculate_term_weights
2423
from nexent.vector_database.elasticsearch_core import ElasticSearchCore
25-
from openai import OpenAI
26-
from openai.types.chat import ChatCompletionMessageParam
2724

28-
from consts.const import ES_API_KEY, ES_HOST, LANGUAGE, MODEL_CONFIG_MAPPING, MESSAGE_ROLE, KNOWLEDGE_SUMMARY_MAX_TOKENS_ZH, KNOWLEDGE_SUMMARY_MAX_TOKENS_EN
25+
from consts.const import ES_API_KEY, ES_HOST, LANGUAGE
2926
from database.attachment_db import delete_file
3027
from database.knowledge_db import (
3128
create_knowledge_record,
@@ -36,97 +33,15 @@
3633
from services.redis_service import get_redis_service
3734
from utils.config_utils import tenant_config_manager, get_model_name_from_config
3835
from utils.file_management_utils import get_all_files_status, get_file_size
39-
from utils.prompt_template_utils import get_knowledge_summary_prompt_template
4036

4137
# Configure logging
4238
logger = logging.getLogger("elasticsearch_service")
4339

4440

4541

4642

47-
def generate_knowledge_summary_stream(keywords: str, language: str, tenant_id: str, model_id: Optional[int] = None) -> Generator:
48-
"""
49-
Generate a knowledge base summary based on keywords
50-
51-
Args:
52-
keywords: Keywords that frequently appear in the knowledge base content
53-
language: Language of the knowledge base content
54-
tenant_id: The tenant ID for configuration
55-
56-
Returns:
57-
str: Generate a knowledge base summary
58-
"""
59-
# Load prompt words based on language
60-
prompts = get_knowledge_summary_prompt_template(language)
61-
62-
# Render templates using Jinja2
63-
system_prompt = Template(
64-
prompts['system_prompt'], undefined=StrictUndefined).render({})
65-
user_prompt = Template(prompts['user_prompt'], undefined=StrictUndefined).render(
66-
{'content': keywords})
67-
68-
# Build messages
69-
messages: List[ChatCompletionMessageParam] = [
70-
{"role": MESSAGE_ROLE["SYSTEM"], "content": system_prompt},
71-
{"role": MESSAGE_ROLE["USER"], "content": user_prompt}
72-
]
73-
74-
# Get model configuration
75-
if model_id:
76-
try:
77-
from database.model_management_db import get_model_by_model_id
78-
model_info = get_model_by_model_id(model_id, tenant_id)
79-
if model_info:
80-
model_config = {
81-
'api_key': model_info.get('api_key', ''),
82-
'base_url': model_info.get('base_url', ''),
83-
'model_name': model_info.get('model_name', ''),
84-
'model_repo': model_info.get('model_repo', '')
85-
}
86-
else:
87-
# Fallback to default model if specified model not found
88-
logger.warning(f"Specified model {model_id} not found, falling back to default LLM.")
89-
model_config = tenant_config_manager.get_model_config(
90-
key=MODEL_CONFIG_MAPPING["llm"], tenant_id=tenant_id)
91-
except Exception as e:
92-
logger.warning(f"Failed to get model {model_id}, using default model: {e}")
93-
model_config = tenant_config_manager.get_model_config(
94-
key=MODEL_CONFIG_MAPPING["llm"], tenant_id=tenant_id)
95-
else:
96-
# Use default model configuration
97-
model_config = tenant_config_manager.get_model_config(
98-
key=MODEL_CONFIG_MAPPING["llm"], tenant_id=tenant_id)
99-
100-
# initialize OpenAI client
101-
client = OpenAI(api_key=model_config.get('api_key', ""),
102-
base_url=model_config.get('base_url', ""))
103-
104-
try:
105-
# Create stream chat completion request
106-
max_tokens = KNOWLEDGE_SUMMARY_MAX_TOKENS_ZH if language == LANGUAGE[
107-
"ZH"] else KNOWLEDGE_SUMMARY_MAX_TOKENS_EN
108-
# Get model name for the request
109-
model_name_for_request = model_config.get("model_name", "")
110-
if model_config.get("model_repo"):
111-
model_name_for_request = f"{model_config['model_repo']}/{model_name_for_request}"
112-
113-
stream = client.chat.completions.create(
114-
model=model_name_for_request,
115-
messages=messages,
116-
max_tokens=max_tokens, # add max_tokens limit
117-
stream=True # enable stream output
118-
)
119-
120-
# Iterate through stream response
121-
for chunk in stream:
122-
new_token = chunk.choices[0].delta.content
123-
if new_token is not None:
124-
yield new_token
125-
yield "END"
126-
127-
except Exception as e:
128-
logger.error(f"Error occurred: {str(e)}")
129-
yield f"Error: {str(e)}"
43+
# Old keyword-based summary method removed - replaced with Map-Reduce approach
44+
# See utils/document_vector_utils.py for new implementation
13045

13146

13247
# Initialize ElasticSearchCore instance with HTTPS support
@@ -871,62 +786,85 @@ async def summary_index_name(self,
871786
model_id: Optional[int] = None
872787
):
873788
"""
874-
Generate a summary for the specified index based on its content
789+
Generate a summary for the specified index using advanced Map-Reduce approach
790+
791+
New implementation:
792+
1. Get documents and cluster them by semantic similarity
793+
2. Map: Summarize each document individually
794+
3. Reduce: Merge document summaries into cluster summaries
795+
4. Return: Combined knowledge base summary
875796
876797
Args:
877798
index_name: Name of the index to summarize
878-
batch_size: Number of documents to process per batch
799+
batch_size: Number of documents to sample (default: 1000)
879800
es_core: ElasticSearchCore instance
880801
tenant_id: ID of the tenant
881802
language: Language of the summary (default: 'zh')
803+
model_id: Model ID for LLM summarization
882804
883805
Returns:
884806
StreamingResponse containing the generated summary
885807
"""
886808
try:
887-
# Get all documents
809+
from utils.document_vector_utils import (
810+
process_documents_for_clustering,
811+
kmeans_cluster_documents,
812+
summarize_clusters_map_reduce,
813+
merge_cluster_summaries
814+
)
815+
888816
if not tenant_id:
889-
raise Exception(
890-
"Tenant ID is required for summary generation.")
891-
all_documents = ElasticSearchService.get_random_documents(
892-
index_name, batch_size, es_core)
893-
all_chunks = self._clean_chunks_for_summary(all_documents)
894-
keywords_dict = calculate_term_weights(all_chunks)
895-
keywords_for_summary = ""
896-
for _, key in enumerate(keywords_dict):
897-
keywords_for_summary = keywords_for_summary + ", " + key
898-
817+
raise Exception("Tenant ID is required for summary generation.")
818+
819+
# Use new Map-Reduce approach
820+
sample_count = min(batch_size // 5, 200) # Sample reasonable number of documents
821+
822+
# Step 1: Get documents and calculate embeddings
823+
document_samples, doc_embeddings = process_documents_for_clustering(
824+
index_name=index_name,
825+
es_core=es_core,
826+
sample_doc_count=sample_count
827+
)
828+
829+
if not document_samples:
830+
raise Exception("No documents found in index.")
831+
832+
# Step 2: Cluster documents
833+
clusters = kmeans_cluster_documents(doc_embeddings, k=None)
834+
835+
# Step 3: Map-Reduce summarization
836+
cluster_summaries = summarize_clusters_map_reduce(
837+
document_samples=document_samples,
838+
clusters=clusters,
839+
language=language,
840+
doc_max_words=100,
841+
cluster_max_words=150,
842+
model_id=model_id,
843+
tenant_id=tenant_id
844+
)
845+
846+
# Step 4: Merge into final summary
847+
final_summary = merge_cluster_summaries(cluster_summaries)
848+
849+
# Stream the result
899850
async def generate_summary():
900-
token_join = []
901851
try:
902-
for new_token in generate_knowledge_summary_stream(keywords_for_summary, language, tenant_id, model_id):
903-
if new_token == "END":
904-
break
905-
else:
906-
token_join.append(new_token)
907-
yield f"data: {{\"status\": \"success\", \"message\": \"{new_token}\"}}\n\n"
908-
await asyncio.sleep(0.1)
852+
# Stream the summary character by character
853+
for char in final_summary:
854+
yield f"data: {{\"status\": \"success\", \"message\": \"{char}\"}}\n\n"
855+
await asyncio.sleep(0.01)
856+
yield f"data: {{\"status\": \"completed\"}}\n\n"
909857
except Exception as e:
910858
yield f"data: {{\"status\": \"error\", \"message\": \"{e}\"}}\n\n"
911-
912-
# Return the flow response
859+
913860
return StreamingResponse(
914861
generate_summary(),
915862
media_type="text/event-stream"
916863
)
917-
864+
918865
except Exception as e:
919-
raise Exception(f"{str(e)}")
920-
921-
@staticmethod
922-
def _clean_chunks_for_summary(all_documents):
923-
# Only use these three fields for summarization
924-
all_chunks = ""
925-
for _, chunk in enumerate(all_documents['documents']):
926-
all_chunks = all_chunks + "\n" + \
927-
chunk["title"] + "\n" + chunk["filename"] + \
928-
"\n" + chunk["content"]
929-
return all_chunks
866+
logger.error(f"Knowledge base summary generation failed: {str(e)}", exc_info=True)
867+
raise Exception(f"Failed to generate summary: {str(e)}")
930868

931869
@staticmethod
932870
def get_random_documents(

0 commit comments

Comments
 (0)