-
Notifications
You must be signed in to change notification settings - Fork 0
property embeddings #5
base: main
Are you sure you want to change the base?
Conversation
Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info
…ph/hugegraph-ai into master-icode
sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f
Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e
Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8
Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e
Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263
Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0
Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6
Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121
Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2
Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd
Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6
Change-Id: I799844222f194d54990b27d71770543ad6c707ca
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (1)
30-45:⚠️ Potential issue新增
properties参数缺乏校验与格式化,易导致模板渲染失败
- 当前仅把
properties原样传入str.format,若gremlin_prompt中未包含{properties}占位符将抛出KeyError。properties类型为List[tuple],直接插入 prompt 结果会是 Python 的字符串化形式,LLM 很可能无法解析。建议:
+def _format_properties(self, props: Optional[List[tuple]]) -> Optional[str]: + if not props: + return None + return "\n".join([f"- {k}: {v}" for k, v in props])并在
.format()时使用properties=self._format_properties(self.properties),同时对模板增加默认占位符或在调用前做占位符存在性检查以避免运行时异常。
♻️ Duplicate comments (7)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py (2)
26-29:⚠️ Potential issue
self.graph.schema()依旧返回类对象,实例化问题未解决此前评审已指出
PyHugeClient.schema()返回的是SchemaManager类 而非实例;本次改动依然直接赋值,后续调用实例方法将抛AttributeError。请确认 SDK 行为,必要时在此处显式实例化或修改 SDK。
54-63:⚠️ Potential issue仍然把方法对象写入字典,导致序列化失败
baseType()、fields()等为访问器方法,缺少()将得到<bound method …>。之前的指正尚未修复,序列化为 JSON 时会报错。请为所有访问器补上括号。hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)
25-25: 移除不必要的pylint禁用注释如之前审查中指出的,
# pylint: disable=too-many-statements注释对于这个简单函数来说是不必要的,该函数只包含少量语句。
23-23:⚠️ Potential issue线程安全问题仍需解决
如之前审查中指出的,全局字典
API_CALL_TRACKER存在线程安全问题。在并发环境中,多个请求可能同时修改此字典,导致数据竞争。hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (2)
56-64: 已缓存的schema未复用,仍存在额外网络开销
构造函数第 56 行已缓存self.schema,但_exact_match_vids又调用self._client.schema()。重复请求会显著增加延迟,建议直接改为vertex_label_num = len(self.schema.getVertexLabels())。
88-101: 属性精确匹配仍存在字段取值错误及性能问题
key.name应为方法调用key.name()(或属性.name,需确认 SDK 实现)。当前写法可能返回绑定方法或不可预期值,导致查询失败。- O(n×m) Gremlin 查询会对每个
property_key × keyword发送请求,扩展到大型图时极易压垮后端。可参考先前建议:预过滤仅
Text类型属性并合并 keywords 至单条查询,或一次性批量查询后在本地过滤。hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
88-105:diff_property_sets仍缺少文档字符串,返回值语义不明此前已指出缺少 docstring,当前实现依旧无说明,调用者难以理解四个返回列表的含义,易误用导致索引错误。请添加完整中文/英文注释,或参考 Google/NumPy 风格 docstring。
🧹 Nitpick comments (7)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py (1)
54-67: 一次性遍历全部 IndexLabel 可能影响大图性能如果图中 IndexLabel 很多,
self.schema.getIndexLabels()的全量遍历会耗时较长。可考虑:
- 按需分页或限制返回数量;
- 在前端或调用方加分页展示;
- 提供
?detail=false参数,仅返回数量。属于可选优化,但在生产大图中尤为重要。
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (1)
26-37: 确认新向量接口的耗时操作是否需要异步处理
update_vid_embedding可能触发向量计算与索引写入,执行时间长。当前直接在 HTTP 请求线程中运行,FastAPI 默认线程池大小有限,可能阻塞其它请求。可考虑:
- 使用
BackgroundTasks/ Celery 异步队列;- 或在
vector_http_api内使用asyncio.to_thread。这样既不影响 API 响应速度,也利于水平扩展。
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
30-45: API端点实现基本正确但缺乏灵活性速率限制逻辑和错误处理实现正确。但是,如之前审查中提到的,该端点不接受任何参数,这限制了未来的扩展性。
考虑增加参数灵活性以支持未来的功能扩展:
- def update_embedding_api(daily_limit: int = 2): + def update_embedding_api(request_data: dict = None, daily_limit: int = 2): # ... 现有的速率限制逻辑 ... - result = update_embedding_func() + result = update_embedding_func(request_data) if request_data else update_embedding_func()hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (1)
70-76: 重复代码可提取公共方法减少维护成本
raw_prompt与init_prompt几乎完全相同,仅example不同。可抽取_build_prompt(example)辅助方法以提升可读性并避免未来修改遗漏。hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
113-125: 嵌套if可合并以简化逻辑Ruff (SIM102) 提醒可用单层判断表达式,能略微提升可读性:
- elif isinstance(item, tuple): - if len(item) == 2: - unique_tuples.add(item) + elif isinstance(item, tuple) and len(item) == 2: + unique_tuples.add(item)🧰 Tools
🪛 Ruff (0.11.9)
122-123: Use a single
ifstatement instead of nestedifstatementsCombine
ifstatements usingand(SIM102)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
196-203: 硬编码阈值 100 000 缺乏配置化,且提前return会丢失顶点索引更新结果
- 100 000 这一 Magic Number 应放入配置文件或环境变量,方便运维侧调整。
- 当属性数量超过阈值时直接
return,但之前对顶点索引的统计已写入context。若后续调用方依赖更多字段(如match_props)将缺失,建议继续流程但仅跳过向量化步骤。hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (1)
216-235: 建议增强错误处理和性能优化当前的属性查询逻辑缺乏错误处理,且逐个执行属性查询可能影响性能。建议:
- 为每个Gremlin查询添加异常处理
- 考虑批量查询以提升性能
- 添加查询结果验证
paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"] log.debug("paths: %s", paths) + + try: + paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"] + log.debug("paths: %s", paths) + except Exception as e: + log.error("Failed to execute property query for %s=%s: %s", prop_name, prop_value, e) + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)hugegraph-llm/src/hugegraph_llm/config/prompt_config.py(2 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py(2 hunks)hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py(6 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py(4 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py(5 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py(5 hunks)hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py(2 hunks)hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py
122-123: Use a single if statement instead of nested if statements
Combine if statements using and
(SIM102)
🔇 Additional comments (12)
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2)
50-53: 属性索引清理功能实现正确正确地添加了对 "graph_props" 目录的索引清理支持,与现有的顶点ID索引和Gremlin示例索引清理保持一致。日志和用户界面消息也相应更新以反映新增的属性索引清理功能。
115-120:Details
✅ Verification successful
验证上下文键的一致性
函数现在根据
context["index_labels"]的存在性条件性地返回属性向量更新详情。需要确保在整个代码库中,相关的上下文键(如removed_props_num和added_props_vector_num)在适当的地方被正确设置。
🏁 Script executed:
#!/bin/bash # 描述:验证属性相关的上下文键在代码库中的设置和使用情况 # 预期:找到设置这些键的代码,确保一致性 echo "搜索 removed_props_num 的设置:" rg "removed_props_num.*=" -A 2 -B 2 echo -e "\n搜索 added_props_vector_num 的设置:" rg "added_props_vector_num.*=" -A 2 -B 2 echo -e "\n搜索 index_labels 的设置:" rg "index_labels.*=" -A 2 -B 2Length of output: 4427
🏁 Script executed:
#!/bin/bash # 查看 graph_index_utils.py 中相关上下文键的使用及函数定义 echo "=== graph_index_utils.py: 前200行 ===" sed -n '1,200p' hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py echo -e "\n=== graph_index_utils.py: 100-140行 ===" sed -n '100,140p' hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py # 搜索 GraphIndexUtils 在代码库中的引用位置 echo -e "\n=== 搜索 GraphIndexUtils 的引用 ===" rg "GraphIndexUtils" -nLength of output: 8385
🏁 Script executed:
#!/bin/bash # 验证 build_semantic_index.py 中是否将 removed_props_num 和 added_props_vector_num 写入 context echo "查找 context 写入 removed_props_num:" rg 'context\["removed_props_num"\]' -n hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py echo -e "\n查找 context 写入 added_props_vector_num:" rg 'context\["added_props_vector_num"\]' -n hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py echo -e "\n查看 prop_index.remove 周边代码:" rg -n "prop_index.remove" -A5 -B5 hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py echo -e "\n查看 prop_index.to_index_file 周边代码:" rg -n "to_index_file" -A5 -B5 hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.pyLength of output: 5045
确认上下文键一致性,无需修改
- 已在
build_semantic_index.py中通过
context.update({ "removed_props_num": …, "added_props_vector_num": … })
正确写入了removed_props_num和added_props_vector_numindex_labels由fetch_graph_data操作设置并注入到context中不存在遗漏或不一致的问题,关闭此条建议。
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)
62-69: 方法签名扩展实现正确正确地添加了可选的
properties参数以支持基于属性的查询生成。参数默认值为None保持了向后兼容性,参数传递给GremlinGenerateSynthesize构造函数的方式也是正确的。hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (2)
221-222: 属性利用指令添加合理新增的指令正确地指导模型在用户查询提及特定属性名称或值时,主动在Gremlin查询中利用这些属性。指令清晰地说明了如何使用属性进行过滤(如
has('propertyName', 'propertyValue'))或投影特定值。
236-237: 输出格式扩展一致性良好新增的"Referenced Extracted Properties Related to the Query"部分与现有的顶点ID引用部分保持一致的格式和风格。元组格式
[('property_name', 'property_value'), ...]清晰明确,便于模型理解和使用。hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
208-210: 在同步上下文中调用asyncio.run可能导致嵌套事件循环错误若上层框架(如 FastAPI、Jupyter)已有事件循环,
asyncio.run()会抛RuntimeError: asyncio.run() cannot be called from a running event loop。
建议:
- 将
run()改造为异步函数,由调用方决定何时await。- 或使用
await/asyncio.create_task并要求外层协程环境。- 若必须同步,可检测
asyncio.get_running_loop()是否存在并退化到loop.run_until_complete。hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (6)
56-56: 属性查询模板更新合理新的属性查询模板使用
has('{current_prop_name}', '{current_prop_value}')进行属性过滤,并在投影中使用.values('{current_prop_name}')替代.id(),这与基于属性匹配的查询逻辑一致。Also applies to: 68-69
132-132: Gremlin生成增强实现正确正确地从上下文中提取
match_props并将其传递给gremlin_generate_synthesize方法,支持基于属性的Gremlin查询生成。Also applies to: 138-138
164-164: 属性匹配初始化逻辑合理从上下文中提取匹配的属性,并正确设置
_prop_to_match为第一个属性名称。调试日志有助于跟踪属性匹配过程。Also applies to: 169-171
240-240: 变量使用确认此行的 pylint 注释表明开发者已意识到
knowledge_with_degree可能未定义的问题。建议修复上述代码段中的变量初始化问题。
349-349: 节点字符串格式化逻辑正确条件格式化逻辑正确处理了基于ID匹配和基于属性匹配两种情况,在ID匹配时使用顶点ID,在属性匹配时使用完整属性映射。
213-235:⚠️ Potential issue修复类型不一致和未定义变量问题
该代码段存在两个关键问题:
graph_chain_knowledge被初始化为set(),但后续调用.update()方法,这会导致类型错误,因为set.update()期望可迭代对象,而temp_graph_chain_knowledge是集合类型。变量
vertex_degree_list和knowledge_with_degree在此分支中未定义,但在第240行被使用,可能导致NameError。应用以下修复:
elif matched_props: # WARN: When will the query enter here? - graph_chain_knowledge = set() + graph_chain_knowledge = set() + vertex_degree_list = [] + knowledge_with_degree = {} for prop_name, prop_value in matched_props: self._prop_to_match = prop_name gremlin_query = PROPERTY_QUERY_NEIGHBOR_TPL.format( current_prop_name=prop_name, current_prop_value=prop_value, edge_labels=edge_labels_str, edge_limit=edge_limit_amount, max_deep=self._max_deep, max_items=self._max_items ) log.warning("Unable to find vid, downgraded to property query, please confirm if it meets expectation.") log.debug("property gremlin: %s", gremlin_query) paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"] log.debug("paths: %s", paths) - temp_graph_chain_knowledge, vertex_degree_list, knowledge_with_degree = self._format_graph_query_result( + temp_graph_chain_knowledge, temp_vertex_degree_list, temp_knowledge_with_degree = self._format_graph_query_result( query_paths=paths ) graph_chain_knowledge.update(temp_graph_chain_knowledge) + vertex_degree_list.extend(temp_vertex_degree_list) + knowledge_with_degree.update(temp_knowledge_with_degree)Likely an incorrect or invalid review comment.
| chunk_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "chunks"))) | ||
| graph_vid_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids"))) | ||
| graph_prop_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_props"))) | ||
| return json.dumps({ | ||
| "embed_dim": chunk_vector_index.index.d, | ||
| "vector_info": { | ||
| "chunk_vector_num": chunk_vector_index.index.ntotal, | ||
| "graph_vid_vector_num": graph_vid_vector_index.index.ntotal, | ||
| "graph_properties_vector_num": len(chunk_vector_index.properties) | ||
| "graph_properties_vector_num": graph_prop_vector_index.index.ntotal, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
增加错误处理与懒加载,避免索引文件缺失或内存占用过高
目前函数每次调用都会同步加载三个 faiss 索引文件;如果 graph_props 文件尚未生成,VectorIndex.from_index_file() 会直接抛异常,导致接口 500。同样,三个索引全部加载到内存也会放大一次性内存峰值。
建议:
- 在加载索引前捕获
FileNotFoundError / IOError并返回 0,使接口具备容错能力。 - 仅提取元数据时可考虑读取
*.json描述文件或使用faiss.read_index_header等轻量方式,减少 I/O 与内存。
示例修复(核心思路,需结合 VectorIndex 实现调整):
- chunk_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "chunks")))
- graph_vid_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids")))
- graph_prop_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_props")))
+ def safe_load(name: str):
+ try:
+ return VectorIndex.from_index_file(
+ str(os.path.join(resource_path, huge_settings.graph_name, name))
+ )
+ except FileNotFoundError:
+ return None
+
+ chunk_vector_index = safe_load("chunks")
+ graph_vid_vector_index = safe_load("graph_vids")
+ graph_prop_vector_index = safe_load("graph_props")同时在 JSON 组装处注意 None 判空。
让我们先定位 VectorIndex 的实现,检视 from_index_file 方法的行为,以确认缺失文件时的异常类型及加载开销:
🏁 Script executed:
#!/bin/bash
# 搜索 VectorIndex 类定义
rg -n "class VectorIndex" .
# 搜索 from_index_file 方法实现
rg -n "def from_index_file" .Length of output: 361
为了确认 from_index_file 的具体实现及异常行为,先查看代码:
🏁 Script executed:
#!/bin/bash
# 查看 VectorIndex.from_index_file 的实现
sed -n '1,200p' hugegraph-llm/src/hugegraph_llm/indices/vector_index.pyLength of output: 4337
使用轻量化元数据接口替换全量加载,避免不必要的 I/O 与内存占用
VectorIndex.from_index_file 已在索引文件缺失时自动返回空实例(ntotal=0),因此无需在调用处再做 FileNotFoundError 捕获。当前每次接口调用都会全量加载三份 FAISS 索引到内存,若仅需返回 embed_dim 与条目数,建议:
-
在
VectorIndex内新增read_metadata(dir_path: str) -> Tuple[int, int](或类似命名),通过os.path.exists检测文件存在性,缺失时直接返回(0, 0);faiss.read_index_header只读取索引头部以获知d与ntotal;- 反序列化
properties.pkl或配套 JSON,以获取属性数目(可选)。
-
在
vector_index_utils.py中,用该方法替换原有三次from_index_file调用,例如: -
chunk_index = VectorIndex.from_index_file(chunks_path)
-
vid_index = VectorIndex.from_index_file(graph_vids_path)
-
prop_index = VectorIndex.from_index_file(graph_props_path)
- chunk_meta = VectorIndex.read_metadata(chunks_path)
- vid_meta = VectorIndex.read_metadata(graph_vids_path)
- prop_meta = VectorIndex.read_metadata(graph_props_path)
- 返回时直接引用:
return json.dumps({ "embed_dim": chunk_meta.d, "vector_info": { "chunk_vector_num": chunk_meta.ntotal, "graph_vid_vector_num": vid_meta.ntotal, "graph_properties_vector_num": prop_meta.ntotal, } })
如此可避免三次全量索引加载,显著降低接口延迟、I/O 压力与内存峰值。
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py around lines 60
to 69, the code loads three FAISS index files using VectorIndex.from_index_file,
which can be expensive and may cause issues if files are missing or large. Since
from_index_file already handles missing files gracefully by returning an index
with ntotal=0, avoid reloading full indexes; instead, implement a lightweight
method to read only index metadata such as dimension and total count, reducing
I/O and memory usage. Replace the three full index loads with this new method,
and use the retrieved metadata for JSON response construction, ensuring
efficient and fault-tolerant operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)
25-25: 全局字典存在线程安全问题。全局字典
API_CALL_TRACKER在并发环境中存在线程安全问题,这与之前的审查评论中提到的问题相同。
28-28: 移除不必要的pylint禁用注释。
# pylint: disable=too-many-statements注释对于这个简单函数来说是不必要的,这与之前的审查评论中提到的问题相同。hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
147-219: 🛠️ Refactor suggestionrun 方法仍需重构以提高可维护性
虽然新增的属性索引功能实现正确,但 run 方法现在更加复杂(70+行),违反了单一职责原则。这个问题在之前的评审中已经提到但未解决。
强烈建议将该方法拆分为以下辅助方法:
_update_vertex_id_index(): 处理顶点ID索引更新(第148-171行)_update_property_index(): 处理属性索引更新(第173-218行)这样可以显著提高代码的可读性、可测试性和可维护性。
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
38-48: 优化速率限制清理逻辑。当前的清理逻辑在每次API调用时都会遍历整个字典来删除过期条目,这在高频调用时可能影响性能。
考虑使用更高效的清理策略:
-for call_date in list(API_CALL_TRACKER.keys()): - if call_date != today: - del API_CALL_TRACKER[call_date] +# 只清理昨天的记录,避免遍历整个字典 +yesterday = today - timedelta(days=1) +API_CALL_TRACKER.pop(yesterday, None)需要在文件顶部添加导入:
from datetime import date, timedeltahugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (2)
33-40: 建议将查询限制设为可配置参数Gremlin 查询模板设计合理,但硬编码的
100000限制可能不够灵活。考虑将此限制值设为配置参数,以便根据不同环境和需求进行调整。-INDEX_PROPERTY_GREMLIN = """ -g.V().hasLabel('{label}') - .limit(100000) - .project('vid', 'properties') - .by(id()) - .by(valueMap({fields})) - .toList() -""" +INDEX_PROPERTY_GREMLIN = """ +g.V().hasLabel('{label}') + .limit({limit}) + .project('vid', 'properties') + .by(id()) + .by(valueMap({fields})) + .toList() +"""然后在查询时传入配置的限制值。
89-106: 方法逻辑正确但建议添加内联注释
diff_property_sets方法的逻辑正确,能够准确识别需要添加、更新和删除的属性集合。但考虑到方法的复杂性,建议添加内联注释来提高可读性。def diff_property_sets( self, present_prop_value_to_propset: dict, past_prop_value_to_propset: dict ): to_add = [] to_update = [] to_update_remove = [] + # 计算需要删除的属性值(存在于历史但不存在于当前) to_remove_keys = set(past_prop_value_to_propset) - set(present_prop_value_to_propset) to_remove = [past_prop_value_to_propset[k] for k in to_remove_keys] + # 遍历当前属性值,确定新增和更新项 for prop_value, present_propset in present_prop_value_to_propset.items(): past_propset = past_prop_value_to_propset.get(prop_value) if past_propset is None: + # 新属性值,需要添加 to_add.append((prop_value, present_propset)) elif present_propset != past_propset: + # 属性集合发生变化,需要更新 to_update.append((prop_value, present_propset)) to_update_remove.append((prop_value, past_propset)) return to_add, to_update, to_remove, to_update_remove
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (5)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)
1-17: 许可证头部格式正确。Apache 许可证头部格式标准且完整。
19-24: 导入语句检查通过。所有导入的模块都在代码中被使用,没有未使用的导入。
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (3)
21-22: 导入语句变更合理新增的导入语句都是为了支持属性向量索引功能所必需的,包括用于数据结构的
defaultdict和用于图数据库查询的PyHugeClient。Also applies to: 31-31
45-45: 构造函数修改正确新增的属性索引目录、属性向量索引和客户端初始化都是为了支持属性向量索引功能,实现方式合理且正确。
Also applies to: 47-47, 50-56
136-145: get_past_props 方法实现简洁高效该方法从现有属性索引中提取历史属性信息的实现简洁明了,逻辑正确。
| del API_CALL_TRACKER[call_date] | ||
| call_count = API_CALL_TRACKER.get(today, 0) | ||
| if call_count >= daily_limit: | ||
| log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修正日志消息中的函数名称。
日志消息中的函数名称 update_vid_embedding 与实际的端点功能不匹配。
应用此修改来修正函数名称:
-log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit)
+log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit) | |
| log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 43, the log error
message references the function name update_vid_embedding, which does not match
the actual endpoint function name. Update the log message to reflect the correct
function name to improve clarity and debugging accuracy. Ensure the message
accurately describes the function where the rate limit is exceeded.
| # pylint: disable=too-many-statements | ||
| def vector_http_api(router: APIRouter, update_embedding_func): | ||
| @router.post("/vector/embedding", status_code=status.HTTP_200_OK) | ||
| def update_embedding_api(daily_limit: int = 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
添加daily_limit参数的输入验证。
daily_limit 参数缺少输入验证,可能导致负数或零值的问题。
应用此修改来添加参数验证:
-def update_embedding_api(daily_limit: int = 2):
+def update_embedding_api(daily_limit: int = 2):
+ if daily_limit <= 0:
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail="daily_limit must be a positive integer"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def update_embedding_api(daily_limit: int = 2): | |
| def update_embedding_api(daily_limit: int = 2): | |
| if daily_limit <= 0: | |
| raise HTTPException( | |
| status_code=status.HTTP_400_BAD_REQUEST, | |
| detail="daily_limit must be a positive integer" | |
| ) | |
| # existing implementation continues below |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 31, the function
update_embedding_api has a daily_limit parameter without input validation, which
could lead to invalid values such as negative numbers or zero. Add validation
logic to ensure daily_limit is a positive integer, raising an appropriate
exception if the validation fails. This will prevent invalid input values from
causing issues downstream.
| def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]: | ||
| results = [] | ||
| for item in context["index_labels"]: | ||
| label = item["base_value"] | ||
| fields = item["fields"] | ||
| fields_str_list = [f"'{field}'" for field in fields] | ||
| fields_for_query = ", ".join(fields_str_list) | ||
| gremlin_query = INDEX_PROPERTY_GREMLIN.format( | ||
| label=label, | ||
| fields=fields_for_query | ||
| ) | ||
| log.debug("gremlin_query: %s", gremlin_query) | ||
| result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] | ||
| results.extend(result) | ||
| orig_present_prop_value_to_propset = defaultdict(set) | ||
| for item in results: | ||
| properties = item["properties"] | ||
| for prop_key, values in properties.items(): | ||
| if not values: | ||
| continue | ||
| prop_value = str(values[0]) | ||
| orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value)) | ||
| present_prop_value_to_propset = { | ||
| k: frozenset(v) | ||
| for k, v in orig_present_prop_value_to_propset.items() | ||
| } | ||
| return present_prop_value_to_propset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
建议为 Gremlin 查询添加错误处理
get_present_props 方法的逻辑正确,但缺少对 Gremlin 查询可能失败的错误处理。建议添加异常捕获以提高代码健壮性。
def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]:
results = []
for item in context["index_labels"]:
label = item["base_value"]
fields = item["fields"]
fields_str_list = [f"'{field}'" for field in fields]
fields_for_query = ", ".join(fields_str_list)
gremlin_query = INDEX_PROPERTY_GREMLIN.format(
label=label,
fields=fields_for_query
)
log.debug("gremlin_query: %s", gremlin_query)
- result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]
- results.extend(result)
+ try:
+ result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]
+ results.extend(result)
+ except Exception as e:
+ log.error("Failed to execute Gremlin query for label %s: %s", label, e)
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]: | |
| results = [] | |
| for item in context["index_labels"]: | |
| label = item["base_value"] | |
| fields = item["fields"] | |
| fields_str_list = [f"'{field}'" for field in fields] | |
| fields_for_query = ", ".join(fields_str_list) | |
| gremlin_query = INDEX_PROPERTY_GREMLIN.format( | |
| label=label, | |
| fields=fields_for_query | |
| ) | |
| log.debug("gremlin_query: %s", gremlin_query) | |
| result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] | |
| results.extend(result) | |
| orig_present_prop_value_to_propset = defaultdict(set) | |
| for item in results: | |
| properties = item["properties"] | |
| for prop_key, values in properties.items(): | |
| if not values: | |
| continue | |
| prop_value = str(values[0]) | |
| orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value)) | |
| present_prop_value_to_propset = { | |
| k: frozenset(v) | |
| for k, v in orig_present_prop_value_to_propset.items() | |
| } | |
| return present_prop_value_to_propset | |
| def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]: | |
| results = [] | |
| for item in context["index_labels"]: | |
| label = item["base_value"] | |
| fields = item["fields"] | |
| fields_str_list = [f"'{field}'" for field in fields] | |
| fields_for_query = ", ".join(fields_str_list) | |
| gremlin_query = INDEX_PROPERTY_GREMLIN.format( | |
| label=label, | |
| fields=fields_for_query | |
| ) | |
| log.debug("gremlin_query: %s", gremlin_query) | |
| try: | |
| result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] | |
| results.extend(result) | |
| except Exception as e: | |
| log.error("Failed to execute Gremlin query for label %s: %s", label, e) | |
| raise | |
| orig_present_prop_value_to_propset = defaultdict(set) | |
| for item in results: | |
| properties = item["properties"] | |
| for prop_key, values in properties.items(): | |
| if not values: | |
| continue | |
| prop_value = str(values[0]) | |
| orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value)) | |
| present_prop_value_to_propset = { | |
| k: frozenset(v) | |
| for k, v in orig_present_prop_value_to_propset.items() | |
| } | |
| return present_prop_value_to_propset |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py
around lines 108 to 134, the code executes Gremlin queries without handling
potential errors or exceptions that may occur during the execution. Add a
try-except block around the gremlin().exec() call to catch exceptions, log an
error message with relevant details, and optionally handle the error gracefully
to prevent the method from crashing unexpectedly.
Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f
Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405
Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1
Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08
Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9
Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352
Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0
…arhugegraph/hugegraph-ai into property_embedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🔭 Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
79-87:⚠️ Potential issue异步收集结果时打乱顺序,嵌入向量与对象 ID 可能错位
asyncio.as_completed(tasks)会按完成先后返回任务结果,而非提交顺序;
随后直接embeddings.extend(batch_embeddings)会导致embeddings列表顺序与vid_batches(乃至added_vids/add_prop_values)不一致。
VectorIndex.add(embeddings, ids)依赖二者位置一一对应,此处可能插入错误向量,最终检索精度大幅下降。可改为使用
await asyncio.gather(*tasks)保持顺序,或显式记录批次下标:- embeddings = [] - with tqdm(total=len(tasks)) as pbar: - for future in asyncio.as_completed(tasks): - batch_embeddings = await future - embeddings.extend(batch_embeddings) - pbar.update(1) + embeddings: list[Any] = [] + batch_results = await asyncio.gather(*tasks) # 保序收集 + for batch_emb in batch_results: + embeddings.extend(batch_emb)
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1)
90-106:diff_property_sets的返回值类型文档仍未与实现保持一致
此前在主文件中已指出该问题,这里再次出现相同行为,建议统一修复或移除复制代码,避免后续重复劳动。
🧹 Nitpick comments (13)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (2)
58-60: 字符串切分应考虑无分隔符场景,避免IndexError
_extract_names假设 vid 中必含 “:”,若上游改为纯数字/UUID 会抛异常。建议使用partition或split(":", 1)并做回退:-return [v.split(":")[1] for v in vertices] +names = [] +for v in vertices: + _, sep, tail = v.partition(":") + names.append(tail if sep else v) +return names
167-177:added_props_vector_num类型混用int | str,调用方处理复杂化
当属性数量超限时返回字符串"0 (because of exceeding limit)",其余情况返回int。
混合类型会让后续日志或统计代码不得不做isinstance判断。
建议统一返回int,并通过额外布尔字段或日志提示超限:- if len(add_prop_values) > 100000: - log.warning("The number of props > 100000, please select which properties to vectorize.") - return removed_props_num, "0 (because of exceeding limit)" + if len(add_prop_values) > 100000: + log.warning("The number of props > 100000, skip building property index.") + return removed_props_num, 0hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1)
57-59: 外部文档链接指向内部baidu-int仓库,可能导致开源用户无法访问
开源环境下应保证文档可公开访问,避免 404。若该模板仅托管于内部仓库,请考虑:
- 同步内容到公开仓库并更新链接;
- 或在文档中注明“内部链接,仅百度内部可访问”。
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)
184-206: 新端点实现基本正确,建议改进
- 端点命名约定:
/text2gremlin不符合 REST 约定,建议使用/gremlin/generate或类似格式- 日志格式一致性: 第202行使用了 f-string,建议保持与其他日志记录的格式一致
建议应用以下改进:
- @router.post("/text2gremlin", status_code=status.HTTP_200_OK) + @router.post("/gremlin/generate", status_code=status.HTTP_200_OK)- log.error(f"Error in text2gremlin_api: {e}") + log.error("Error in text2gremlin_api: %s", e)README.md (1)
36-37: 修复 Markdown 列表缩进静态分析工具检测到列表项缩进不正确,应该从第0列开始,而不是第1列。
应用以下修复:
- - 如流 HugeGraph Team/Group - - GraphRAG DevOps Team (🚧) +- 如流 HugeGraph Team/Group +- GraphRAG DevOps Team (🚧)🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
36-36: Unordered list indentation
Expected: 0; Actual: 1(MD007, ul-indent)
37-37: Unordered list indentation
Expected: 0; Actual: 1(MD007, ul-indent)
vermeer-python-client/src/pyvermeer/client/client.py (2)
38-41: 考虑模块键名的一致性模块字典中使用了
"graph"和"tasks",建议保持命名的一致性。如果 GraphModule 对应"graph",那么 TaskModule 可能应该对应"task"而不是"tasks"。
44-48: 增强错误处理当前的
__getattr__方法只是简单地抛出AttributeError。建议提供更友好的错误信息,列出可用的模块。def __getattr__(self, name): """通过属性访问模块""" if name in self._modules: return self._modules[name] - raise AttributeError(f"Module {name} not found") + available_modules = list(self._modules.keys()) + raise AttributeError(f"Module '{name}' not found. Available modules: {available_modules}")vermeer-python-client/src/pyvermeer/structure/master_data.py (1)
56-56: 改进时间格式化的安全性
to_dict方法中的时间格式化逻辑已经考虑了None值的情况,这很好。但建议使用更明确的检查。-"launch_time": self.__launch_time.strftime("%Y-%m-%d %H:%M:%S") if self.__launch_time else '' +"launch_time": self.__launch_time.strftime("%Y-%m-%d %H:%M:%S") if self.__launch_time is not None else ''vermeer-python-client/src/pyvermeer/utils/log.py (1)
38-47: 建议保留文件日志选项注释掉的文件处理器代码可能在生产环境中很有用。建议将其作为可配置选项保留。
考虑添加配置参数来控制是否启用文件日志:
- def _initialize(self, name: str): + def _initialize(self, name: str, enable_file_log: bool = False): """初始化日志配置""" self.logger = logging.getLogger(name) self.logger.setLevel(logging.INFO) if not self.logger.handlers: # ... console handler code ... - # file_handler = logging.FileHandler('api_client.log') - # file_handler.setLevel(logging.DEBUG) - # file_handler.setFormatter( - # logging.Formatter( - # '[%(asctime)s] [%(levelname)s] [%(threadName)s] %(name)s - %(message)s' - # ) - # ) + if enable_file_log: + file_handler = logging.FileHandler('api_client.log') + file_handler.setLevel(logging.DEBUG) + file_handler.setFormatter( + logging.Formatter( + '[%(asctime)s] [%(levelname)s] [%(threadName)s] %(name)s - %(message)s' + ) + ) + self.logger.addHandler(file_handler) self.logger.addHandler(console_handler) - # self.logger.addHandler(file_handler)vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py (1)
96-96: 响应内容日志可能过大在调试日志中记录完整响应内容可能导致日志文件过大,影响性能。
限制响应内容日志大小:
- log.debug(f"Response code:{response.status_code}, received: {response.text}") + response_preview = response.text[:500] + "..." if len(response.text) > 500 else response.text + log.debug(f"Response code:{response.status_code}, received: {response_preview}")vermeer-python-client/src/pyvermeer/structure/task_data.py (1)
134-150: 考虑保持属性访问模式的一致性TaskCreateRequest 类使用公共属性,而其他类使用私有属性和属性访问器。为了保持代码风格的一致性,建议采用相同的模式。
class TaskCreateRequest: """task create request""" def __init__(self, task_type, graph_name, params): """init""" - self.task_type = task_type - self.graph_name = graph_name - self.params = params + self.__task_type = task_type + self.__graph_name = graph_name + self.__params = params + + @property + def task_type(self): + """task type""" + return self.__task_type + + @property + def graph_name(self): + """graph name""" + return self.__graph_name + + @property + def params(self): + """params""" + return self.__params def to_dict(self) -> dict: """to dict""" return { 'task_type': self.task_type, 'graph': self.graph_name, 'params': self.params }vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)
18-18: 简化字典获取操作根据静态分析工具的建议,可以简化
dic.get()调用,因为默认返回值就是None。- self.__vertex_data_backend = dic.get('vertex_data_backend', None) + self.__vertex_data_backend = dic.get('vertex_data_backend')🧰 Tools
🪛 Ruff (0.11.9)
18-18: Use
dic.get('vertex_data_backend')instead ofdic.get('vertex_data_backend', None)Replace
dic.get('vertex_data_backend', None)withdic.get('vertex_data_backend')(SIM910)
66-66: 改进注释内容注释中的 "Nonsense" 用词不够专业,建议提供更具描述性的说明。
- """is self worker. Nonsense """ + """indicates whether this is the local/self worker"""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (38)
README.md(2 hunks)hugegraph-llm/README.md(2 hunks)hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py(2 hunks)hugegraph-llm/src/hugegraph_llm/api/rag_api.py(3 hunks)hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py(1 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py(3 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py(1 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py(2 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py(4 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py(5 hunks)hugegraph-python-client/README.md(0 hunks)vermeer-python-client/README.md(1 hunks)vermeer-python-client/requirements.txt(1 hunks)vermeer-python-client/setup.py(1 hunks)vermeer-python-client/src/pyvermeer/__init__.py(1 hunks)vermeer-python-client/src/pyvermeer/api/base.py(1 hunks)vermeer-python-client/src/pyvermeer/api/graph.py(1 hunks)vermeer-python-client/src/pyvermeer/api/master.py(1 hunks)vermeer-python-client/src/pyvermeer/api/task.py(1 hunks)vermeer-python-client/src/pyvermeer/api/worker.py(1 hunks)vermeer-python-client/src/pyvermeer/client/__init__.py(1 hunks)vermeer-python-client/src/pyvermeer/client/client.py(1 hunks)vermeer-python-client/src/pyvermeer/demo/task_demo.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/__init__.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/base_data.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/graph_data.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/master_data.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/task_data.py(1 hunks)vermeer-python-client/src/pyvermeer/structure/worker_data.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/__init__.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/exception.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/log.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/vermeer_config.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py(1 hunks)vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py(1 hunks)
💤 Files with no reviewable changes (1)
- hugegraph-python-client/README.md
✅ Files skipped from review due to trivial changes (13)
- vermeer-python-client/src/pyvermeer/client/init.py
- vermeer-python-client/src/pyvermeer/utils/init.py
- vermeer-python-client/src/pyvermeer/structure/init.py
- vermeer-python-client/src/pyvermeer/api/master.py
- vermeer-python-client/src/pyvermeer/api/worker.py
- vermeer-python-client/src/pyvermeer/init.py
- hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py
- vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py
- vermeer-python-client/requirements.txt
- vermeer-python-client/README.md
- hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
- vermeer-python-client/setup.py
- vermeer-python-client/src/pyvermeer/utils/exception.py
🚧 Files skipped from review as they are similar to previous changes (3)
- hugegraph-llm/src/hugegraph_llm/api/vector_api.py
- hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
- hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py
🧰 Additional context used
🧬 Code Graph Analysis (5)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)
GremlinGenerateRequest(127-152)
vermeer-python-client/src/pyvermeer/api/base.py (3)
vermeer-python-client/src/pyvermeer/structure/task_data.py (1)
params(106-108)hugegraph-ml/src/hugegraph_ml/models/seal.py (1)
debug(790-791)vermeer-python-client/src/pyvermeer/client/client.py (1)
send_request(50-52)
vermeer-python-client/src/pyvermeer/api/task.py (2)
vermeer-python-client/src/pyvermeer/api/base.py (2)
BaseModule(10-29)_send_request(22-29)vermeer-python-client/src/pyvermeer/structure/task_data.py (11)
TasksResponse(174-193)TaskCreateRequest(134-149)TaskCreateResponse(152-171)TaskResponse(196-215)params(106-108)to_dict(31-33)to_dict(115-131)to_dict(143-149)to_dict(165-171)to_dict(187-193)to_dict(209-215)
vermeer-python-client/src/pyvermeer/structure/task_data.py (2)
vermeer-python-client/src/pyvermeer/structure/base_data.py (3)
BaseResponse(12-39)errcode(26-31)message(34-39)vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py (1)
parse_vermeer_time(12-17)
vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)
vermeer-python-client/src/pyvermeer/structure/base_data.py (3)
BaseResponse(12-39)errcode(26-31)message(34-39)vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py (1)
parse_vermeer_time(12-17)
🪛 Gitleaks (8.26.0)
vermeer-python-client/src/pyvermeer/demo/task_demo.py
16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 Ruff (0.11.9)
vermeer-python-client/src/pyvermeer/structure/graph_data.py
18-18: Use dic.get('vertex_data_backend') instead of dic.get('vertex_data_backend', None)
Replace dic.get('vertex_data_backend', None) with dic.get('vertex_data_backend')
(SIM910)
🪛 markdownlint-cli2 (0.17.2)
README.md
36-36: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
37-37: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
🪛 LanguageTool
hugegraph-llm/README.md
[misspelling] ~88-~88: Did you mean “except”?
Context: ...c/clients/restful-api-v3/schema.html): (Accept 2 types) - User-defined Schema (J...
(ACCEPT_EXCEPT)
🔇 Additional comments (23)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (2)
29-29: 导入语句正确新增的
GremlinGenerateRequest导入语句符合新 API 端点的需求。
45-45: 函数参数添加合理新增的
gremlin_generate_selective_func参数为新端点提供了必要的选择性生成功能。README.md (1)
6-7: 文档简化合理移除超链接格式化保持了文档的简洁性,同时不影响可读性。
vermeer-python-client/src/pyvermeer/structure/base_data.py (2)
7-9: 响应状态常量定义良好常量命名清晰,值的设置合理,便于状态判断和错误处理。
12-39: BaseResponse 类设计优秀实现遵循了良好的面向对象设计原则:
- 使用私有属性实现正确的封装
- 通过属性装饰器提供只读访问
- 优雅处理缺失的字典键
- 文档注释完整
vermeer-python-client/src/pyvermeer/demo/task_demo.py (1)
11-42: 演示代码结构良好除了安全问题外,演示代码结构清晰,有效展示了客户端的功能:
- 使用结构化的请求对象
- 演示了获取和创建操作
- 代码组织合理
🧰 Tools
🪛 Gitleaks (8.26.0)
16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
hugegraph-llm/README.md (3)
41-41: 文档更新:仓库地址变更为内部地址将克隆地址从公开的 Apache GitHub 仓库更改为百度内部仓库地址,并添加了用户名占位符。请确认这个变更是否符合项目的访问控制策略。
88-88: 修正:文档描述更准确将"Except 2 types"更正为"Accept 2 types",这个修改使描述更加准确和清晰。
🧰 Tools
🪛 LanguageTool
[misspelling] ~88-~88: Did you mean “except”?
Context: ...c/clients/restful-api-v3/schema.html): (Accept 2 types) - User-defined Schema (J...(ACCEPT_EXCEPT)
89-89: 文档链接更新:模板链接指向内部仓库模板链接已更新为指向百度内部 DevOps 仓库,行号引用也从 125 更改为 173。请验证新的行号引用是否正确对应到相关的模板内容。
vermeer-python-client/src/pyvermeer/api/base.py (1)
10-29: 良好的基础设计:依赖注入和统一请求处理这个基础模块类设计得很好:
- 使用依赖注入模式接收客户端实例
- 设置了层次化的日志记录器
- 提供了统一的请求处理入口点
- 遵循 Python 命名约定,使用下划线前缀表示受保护方法
实现简洁且职责明确,为继承类提供了良好的基础功能。
vermeer-python-client/src/pyvermeer/api/task.py (1)
1-39: 代码实现良好!TaskModule 类的实现简洁明了,正确继承了 BaseModule 并实现了任务相关的 API 操作。所有方法都正确使用了父类的统一请求方法,类型注解完整,符合 Python 编码规范。
vermeer-python-client/src/pyvermeer/client/client.py (1)
26-26:Details
✅ Verification successful
检查 Python 版本兼容性
使用
tuple[float, float]类型注解在 Python 3.9 之前的版本中不被支持。为了确保向后兼容性,建议使用Tuple[float, float]或添加from __future__ import annotations。请运行以下脚本检查项目的 Python 版本要求:
🏁 Script executed:
#!/bin/bash # 检查项目的Python版本要求 fd -e py -e txt setup.py pyproject.toml requirements.txt | xargs grep -l "python_requires\|Programming Language :: Python" | head -5Length of output: 299
为获取项目对 Python 版本的明确定义,请执行下面脚本重新搜索:
🏁 Script executed:
#!/bin/bash # 在仓库中检索 Python 版本约束声明 rg -n "python_requires" . rg -n "Programming Language :: Python" .Length of output: 498
确认 Python 版本兼容性
项目在
vermeer-python-client/setup.py中已通过python_requires=">=3.9"声明最低 Python 版本。Python 3.9 起原生支持tuple[float, float]注解,因此无需更改为Tuple[float, float]或添加from __future__ import annotations。hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (1)
230-236:Details
✅ Verification successful
验证输出键的完整性
建议验证定义的输出键是否与
gremlin_generate函数的实际返回值一致。
🏁 Script executed:
#!/bin/bash # 检查gremlin_generate函数的返回值结构 ast-grep --pattern $'def gremlin_generate($$$) -> $_: $$$ return ( $$$ )'Length of output: 94
为确认
gremlin_generate函数的定义及其返回值结构,请运行以下脚本:
🏁 Script executed:
#!/bin/bash rg -n "def gremlin_generate" -A10 hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.pyLength of output: 935
请进一步查看
gremlin_generate函数体内的返回语句,以确认返回值的元素顺序和数量:
🏁 Script executed:
#!/bin/bash rg -n "def gremlin_generate" -A150 -B0 hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.pyLength of output: 10031
确认
output_keys与gremlin_generate返回值一致经检查,
gremlin_generate在正常流程中始终返回一个 5 元组:
- match_result
- context["result"](即 template_gremlin)
- context["raw_result"](即 raw_gremlin)
- context["template_exec_res"](对应 template_execution_result)
- context["raw_exec_res"](对应 raw_execution_result)
而
gremlin_generate_selective中定义的output_keys = [ "match_result", "template_gremlin", "raw_gremlin", "template_execution_result", "raw_execution_result", ]正好与上述返回值一一对应,无需调整。
vermeer-python-client/src/pyvermeer/utils/log.py (1)
58-58: 全局日志实例创建良好全局日志实例的创建方式符合最佳实践,便于在整个客户端中使用。
vermeer-python-client/src/pyvermeer/structure/worker_data.py (1)
97-99: 现代类型注解用法良好使用
list[Worker]的现代类型注解语法是好的实践。hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2)
145-152: 输出类型默认值设计良好默认返回模板 Gremlin 的设计是合理的,既提供了有用的默认行为,又允许用户自定义输出类型。
120-125: 枚举定义清晰完整
GremlinOutputType枚举涵盖了 Gremlin 查询生成的各个阶段,命名清晰易懂。vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py (2)
38-42: Token 验证逻辑良好在初始化时验证 token 存在性是好的安全实践,可以早期发现配置问题。
97-104: 异常处理全面且正确异常处理涵盖了主要的错误场景,使用了适当的自定义异常类型,并保持了异常链。
vermeer-python-client/src/pyvermeer/structure/task_data.py (2)
13-34: TaskWorker 类实现正确该类的实现简洁明了,正确地封装了任务工作者的基本信息,属性访问器和字典序列化方法都符合预期。
152-216: 响应类实现正确TaskCreateResponse、TasksResponse 和 TaskResponse 类正确地继承了 BaseResponse,实现了一致的初始化和序列化模式,代码结构清晰且符合预期。
vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)
86-201: VermeerGraph 类实现正确该类正确地封装了图的各种元数据信息,时间解析和格式化处理得当,布尔标志属性清晰地表示了图的各种特性,整体实现符合预期。
203-245: 图响应类实现正确GraphsResponse 和 GraphResponse 类正确地继承了 BaseResponse,实现了一致的模式来处理单个图对象和图对象列表,序列化方法完整且正确。
| class BuildSemanticIndex: | ||
| def __init__(self, embedding: BaseEmbedding): | ||
| self.index_dir = str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids")) | ||
| self.index_dir_prop = str(os.path.join(resource_path, huge_settings.graph_name, "graph_props")) | ||
| self.vid_index = VectorIndex.from_index_file(self.index_dir) | ||
| self.prop_index = VectorIndex.from_index_file(self.index_dir_prop) | ||
| self.embedding = embedding | ||
| self.sm = SchemaManager(huge_settings.graph_name) | ||
| self.client = PyHugeClient( | ||
| url=huge_settings.graph_url, | ||
| graph=huge_settings.graph_name, | ||
| user=huge_settings.graph_user, | ||
| pwd=huge_settings.graph_pwd, | ||
| graphspace=huge_settings.graph_space, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
测试文件中完整复制业务实现,造成重复定义与包导入冲突
BuildSemanticIndex 在正式模块 build_semantic_index.py 中已存在,这里再次完整实现同名类将导致:
- 重复代码维护困难:两处实现稍有偏差就会出现行为分叉,难以及时同步。
- 包导入冲突:
import hugegraph_llm.operators.index_op.test as ...时,该类可能“掩盖”真正实现,引发难以察觉的运行时错误。 - 测试意图缺失:文件位于
operators/index_op/test.py,却不是单元测试,而是业务代码拷贝,不符合测试目录约定。
建议立即删除此处的业务实现,仅保留真正的单元测试代码,或直接在测试中 from ...build_semantic_index import BuildSemanticIndex。
-# 整个 BuildSemanticIndex 复制体
-class BuildSemanticIndex:
- ...
+# TODO: 编写针对 BuildSemanticIndex 的单元测试,而不是复制实现
+import pytest
+from hugegraph_llm.operators.index_op.build_semantic_index import BuildSemanticIndex
+
+
+def test_diff_property_sets_addition():
+ # 示例:验证新增属性集能被正确识别
+ build = BuildSemanticIndex(embedding=None) # 可使用 mock Embedding
+ to_add, *_ = build.diff_property_sets(
+ {"v1": frozenset({("k", "v1")})},
+ {},
+ )
+ assert to_add == [("v1", frozenset({("k", "v1")}))]
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py between lines 42
and 57, the BuildSemanticIndex class is fully reimplemented, duplicating the
business logic already defined in build_semantic_index.py. This causes code
duplication, import conflicts, and breaks testing conventions. To fix this,
remove the entire BuildSemanticIndex class implementation from this test file
and instead import BuildSemanticIndex from the original build_semantic_index
module. Then, keep only proper unit test code in this test.py file.
| client = PyVermeerClient( | ||
| ip="10.41.57.139", | ||
| port=8688, | ||
| token="Q7svB13nYvREB4bDCj7kQnwJEMvLgrgfDimu4h1Fp7CUzQLk758ya1EYwycn1kjbgskiHiKzDni9jEkJcssgTy7rZJdt4gYEkfvjeowZGzSebgiSEU86dgFPXzUUtwrA81vWKm1xfioBcS9GmXjGQoM6C", | ||
| log_level="DEBUG", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
严重安全问题:硬编码的 API 令牌
代码中包含硬编码的 API 令牌,这是一个严重的安全风险。即使在演示代码中,也不应该暴露真实的凭据。
应用以下修复方案:
def main():
"""main"""
+ # 从环境变量或配置文件读取敏感信息
+ import os
client = PyVermeerClient(
- ip="10.41.57.139",
- port=8688,
- token="Q7svB13nYvREB4bDCj7kQnwJEMvLgrgfDimu4h1Fp7CUzQLk758ya1EYwycn1kjbgskiHiKzDni9jEkJcssgTy7rZJdt4gYEkfvjeowZGzSebgiSEU86dgFPXzUUtwrA81vWKm1xfioBcS9GmXjGQoM6C",
+ ip=os.getenv("VERMEER_IP", "localhost"),
+ port=int(os.getenv("VERMEER_PORT", "8688")),
+ token=os.getenv("VERMEER_TOKEN", "your-token-here"),
log_level="DEBUG",
)🧰 Tools
🪛 Gitleaks (8.26.0)
16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/demo/task_demo.py around lines 13 to 18,
the API token is hardcoded, posing a serious security risk. Remove the hardcoded
token and instead load it securely from environment variables or a configuration
file. Update the code to read the token dynamically at runtime, ensuring no
sensitive credentials are exposed in the source code.
| class VermeerConfig: | ||
| """The configuration of a Vermeer instance.""" | ||
| ip: str | ||
| port: int | ||
| token: str | ||
| factor: str | ||
| username: str | ||
| graph_space: str | ||
|
|
||
| def __init__(self, | ||
| ip: str, | ||
| port: int, | ||
| token: str, | ||
| timeout: tuple[float, float] = (0.5, 15.0)): | ||
| """Initialize the configuration for a Vermeer instance.""" | ||
| self.ip = ip | ||
| self.port = port | ||
| self.token = token | ||
| self.timeout = timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
设计改进:确保所有声明的属性都被正确初始化
当前实现中存在几个问题:
factor、username、graph_space属性被声明但从未在构造函数中初始化,这可能导致AttributeErrortimeout属性在构造函数中被设置但未在类级别声明
建议的改进:
class VermeerConfig:
"""The configuration of a Vermeer instance."""
ip: str
port: int
token: str
+ timeout: tuple[float, float]
factor: str
username: str
graph_space: str
def __init__(self,
ip: str,
port: int,
token: str,
+ factor: str = "",
+ username: str = "",
+ graph_space: str = "",
timeout: tuple[float, float] = (0.5, 15.0)):
"""Initialize the configuration for a Vermeer instance."""
self.ip = ip
self.port = port
self.token = token
+ self.factor = factor
+ self.username = username
+ self.graph_space = graph_space
self.timeout = timeout📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class VermeerConfig: | |
| """The configuration of a Vermeer instance.""" | |
| ip: str | |
| port: int | |
| token: str | |
| factor: str | |
| username: str | |
| graph_space: str | |
| def __init__(self, | |
| ip: str, | |
| port: int, | |
| token: str, | |
| timeout: tuple[float, float] = (0.5, 15.0)): | |
| """Initialize the configuration for a Vermeer instance.""" | |
| self.ip = ip | |
| self.port = port | |
| self.token = token | |
| self.timeout = timeout | |
| class VermeerConfig: | |
| """The configuration of a Vermeer instance.""" | |
| ip: str | |
| port: int | |
| token: str | |
| timeout: tuple[float, float] | |
| factor: str | |
| username: str | |
| graph_space: str | |
| def __init__(self, | |
| ip: str, | |
| port: int, | |
| token: str, | |
| factor: str = "", | |
| username: str = "", | |
| graph_space: str = "", | |
| timeout: tuple[float, float] = (0.5, 15.0)): | |
| """Initialize the configuration for a Vermeer instance.""" | |
| self.ip = ip | |
| self.port = port | |
| self.token = token | |
| self.factor = factor | |
| self.username = username | |
| self.graph_space = graph_space | |
| self.timeout = timeout |
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/utils/vermeer_config.py lines 8 to 26,
the class VermeerConfig declares attributes factor, username, and graph_space
but does not initialize them in the constructor, and timeout is initialized in
the constructor but not declared at the class level. To fix this, add factor,
username, graph_space, and timeout to the class attribute declarations and
initialize all of them properly in the __init__ method, either by adding
parameters for them or setting default values to avoid AttributeError.
| def get_graph(self, graph_name: str) -> GraphResponse: | ||
| """获取任务列表""" | ||
| response = self._send_request( | ||
| "GET", | ||
| f"/graphs/{graph_name}" | ||
| ) | ||
| return GraphResponse(response) | ||
|
|
||
| def get_graphs(self) -> GraphsResponse: | ||
| """获取任务列表""" | ||
| response = self._send_request( | ||
| "GET", | ||
| "/graphs", | ||
| ) | ||
| return GraphsResponse(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修正:方法文档字符串不正确
两个方法的文档字符串都错误地描述为"获取任务列表",但实际上这些方法是用于获取图相关信息的。
请应用以下修改来纠正文档字符串:
def get_graph(self, graph_name: str) -> GraphResponse:
- """获取任务列表"""
+ """获取指定图的信息"""
response = self._send_request(
"GET",
f"/graphs/{graph_name}"
)
return GraphResponse(response)
def get_graphs(self) -> GraphsResponse:
- """获取任务列表"""
+ """获取图列表"""
response = self._send_request(
"GET",
"/graphs",
)
return GraphsResponse(response)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_graph(self, graph_name: str) -> GraphResponse: | |
| """获取任务列表""" | |
| response = self._send_request( | |
| "GET", | |
| f"/graphs/{graph_name}" | |
| ) | |
| return GraphResponse(response) | |
| def get_graphs(self) -> GraphsResponse: | |
| """获取任务列表""" | |
| response = self._send_request( | |
| "GET", | |
| "/graphs", | |
| ) | |
| return GraphsResponse(response) | |
| def get_graph(self, graph_name: str) -> GraphResponse: | |
| """获取指定图的信息""" | |
| response = self._send_request( | |
| "GET", | |
| f"/graphs/{graph_name}" | |
| ) | |
| return GraphResponse(response) | |
| def get_graphs(self) -> GraphsResponse: | |
| """获取图列表""" | |
| response = self._send_request( | |
| "GET", | |
| "/graphs", | |
| ) | |
| return GraphsResponse(response) |
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/api/graph.py around lines 14 to 28, the
docstrings for both get_graph and get_graphs methods incorrectly state "获取任务列表"
which means "get task list". Update the docstrings to accurately describe their
functionality: for get_graph, change it to "获取指定名称的图信息" (get graph information
by name), and for get_graphs, change it to "获取所有图的信息列表" (get list of all graph
information).
| return outputs_dict # Early exit if gremlin_generate returned an error tuple | ||
|
|
||
|
|
||
| match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
添加参数解包的安全检查
直接对元组进行解包可能在错误情况下失败。建议添加长度检查以确保安全。
+ # Ensure we have the expected 5-tuple result
+ if not isinstance(original_results, tuple) or len(original_results) != 5:
+ return {
+ "error": True,
+ "error_message": f"Unexpected result format from gremlin_generate: {type(original_results)}",
+ "requested_outputs": list(requested_outputs_set)
+ }
+
match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results | |
| # Ensure we have the expected 5-tuple result | |
| if not isinstance(original_results, tuple) or len(original_results) != 5: | |
| return { | |
| "error": True, | |
| "error_message": f"Unexpected result format from gremlin_generate: {type(original_results)}", | |
| "requested_outputs": list(requested_outputs_set) | |
| } | |
| match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py at line
262, the code directly unpacks the tuple original_results into multiple
variables without verifying its length, which can cause errors if the tuple is
shorter than expected. Add a length check before unpacking to ensure
original_results contains the expected number of elements, and handle the case
where it does not to prevent runtime exceptions.
| graphspae: Optional[str] = Query( | ||
| huge_settings.graph_space, | ||
| description="graph space." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
字段名称拼写错误
字段名 graphspae 应该是 graphspace。
修正字段名称:
- graphspae: Optional[str] = Query(
+ graphspace: Optional[str] = Query(
huge_settings.graph_space,
description="graph space."
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| graphspae: Optional[str] = Query( | |
| huge_settings.graph_space, | |
| description="graph space." | |
| ) | |
| graphspace: Optional[str] = Query( | |
| huge_settings.graph_space, | |
| description="graph space." | |
| ) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py around lines 133
to 136, the field name is misspelled as 'graphspae'. Rename the field to
'graphspace' to correct the spelling error.
| url = f"http://{self._cfg.ip}:{self._cfg.port}/" | ||
| return urljoin(url, path).strip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
URL 构建存在安全风险
urljoin 和 strip 的组合可能导致意外的 URL 构建行为。如果 path 以 '/' 开头,urljoin 会忽略基 URL 的路径部分。
改进 URL 构建逻辑:
def resolve(self, path: str):
"""
Resolve the path to a full URL.
"""
url = f"http://{self._cfg.ip}:{self._cfg.port}/"
- return urljoin(url, path).strip("/")
+ # Ensure path doesn't start with '/' to avoid urljoin issues
+ path = path.lstrip('/')
+ return urljoin(url, path)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| url = f"http://{self._cfg.ip}:{self._cfg.port}/" | |
| return urljoin(url, path).strip("/") | |
| def resolve(self, path: str): | |
| """ | |
| Resolve the path to a full URL. | |
| """ | |
| url = f"http://{self._cfg.ip}:{self._cfg.port}/" | |
| # Ensure path doesn't start with '/' to avoid urljoin issues | |
| path = path.lstrip('/') | |
| return urljoin(url, path) |
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py around lines 72
to 73, the current URL construction using urljoin combined with strip("/") can
cause incorrect URLs if path starts with a slash, as urljoin will ignore the
base URL's path. To fix this, avoid using urljoin with paths starting with '/',
and instead safely concatenate the base URL and path ensuring exactly one '/'
between them without stripping characters, preserving the intended URL
structure.
| ) -> dict: | ||
| """request""" | ||
| try: | ||
| log.debug(f"Request made to {path} with params {json.dumps(params)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
敏感信息日志泄露风险
调试日志中记录请求参数可能包含敏感信息,如认证令牌或个人数据。
避免在日志中记录敏感参数:
- log.debug(f"Request made to {path} with params {json.dumps(params)}")
+ log.debug(f"Request made to {path}")或者实现参数过滤:
+ def _safe_log_params(self, params):
+ """Remove sensitive information from params for logging"""
+ if not params:
+ return params
+ safe_params = {k: v for k, v in params.items()
+ if k not in ['token', 'password', 'api_key']}
+ return safe_params
+
- log.debug(f"Request made to {path} with params {json.dumps(params)}")
+ log.debug(f"Request made to {path} with params {json.dumps(self._safe_log_params(params))}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log.debug(f"Request made to {path} with params {json.dumps(params)}") | |
| # Add this helper in the class (e.g. just above the method that makes the request) | |
| def _safe_log_params(self, params): | |
| """Remove sensitive information from params for logging""" | |
| if not params: | |
| return params | |
| safe_params = { | |
| k: v | |
| for k, v in params.items() | |
| if k not in ["token", "password", "api_key"] | |
| } | |
| return safe_params | |
| # …later in the same class, replace the logging call: | |
| log.debug(f"Request made to {path} with params {json.dumps(self._safe_log_params(params))}") |
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py at line 89, the
debug log outputs request parameters which may contain sensitive information
like authentication tokens or personal data. Modify the logging to either
exclude sensitive parameters entirely or implement a filtering mechanism that
redacts or removes sensitive fields before logging. This ensures sensitive data
is not exposed in logs while still providing useful debug information.
| 'state': self.__state, | ||
| 'create_user': self.__create_user, | ||
| 'create_type': self.__create_type, | ||
| 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复时间格式化条件判断错误
在格式化 create_time 时,条件判断错误地使用了 self.__start_time 而不是 self.__create_time。
- 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '',
+ 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__create_time else '',📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '', | |
| 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__create_time else '', |
🤖 Prompt for AI Agents
In vermeer-python-client/src/pyvermeer/structure/task_data.py at line 123, the
condition for formatting 'create_time' incorrectly checks 'self.__start_time'
instead of 'self.__create_time'. Fix this by changing the condition to check
'self.__create_time' before calling strftime to ensure the correct attribute is
used for formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (4)
25-25: 线程安全问题仍未解决全局字典
API_CALL_TRACKER仍然存在线程安全问题,这在之前的审查中已经指出但尚未解决。在并发环境中,多个请求可能同时修改此字典,导致数据竞争。
28-28: 移除不必要的 pylint 注释函数
vector_http_api只包含少量语句,不需要# pylint: disable=too-many-statements注释。这个注释似乎是从其他文件复制过来的,应该移除。应用此修改来移除不必要的注释:
-# pylint: disable=too-many-statements def vector_http_api(router: APIRouter, update_embedding_func):
31-31: 添加输入验证以提高 API 健壮性
daily_limit参数缺少输入验证,这在之前的审查中已经指出但尚未解决。应该验证该参数为正整数,防止负数或零值导致的问题。
43-43: 修正日志消息中的函数名称日志消息中使用了错误的函数名称
update_vid_embedding,这在之前的审查中已经指出但尚未解决。应该使用正确的函数名称update_embedding_api。
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
49-51: 改进 API 设计的灵活性当前API设计缺乏灵活性,
update_embedding_func()被调用时没有传递任何参数。如果未来需要传递请求数据,当前设计将难以扩展。考虑以下改进来提高API的灵活性:
@router.post("/vector/embedding", status_code=status.HTTP_200_OK) -def update_embedding_api(daily_limit: int = 50): +def update_embedding_api(request_data: dict = None, daily_limit: int = 50): """ Updates the vector embedding. - This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!) + This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!) The rate limit is tracked per day and resets at midnight. """ # ... rate limiting logic ... - result = update_embedding_func() + result = update_embedding_func(request_data) if request_data else update_embedding_func() result = {"detail": result} return result
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)hugegraph-python-client/src/pyhugegraph/client.py(2 hunks)hugegraph-python-client/src/pyhugegraph/utils/huge_config.py(1 hunks)hugegraph-python-client/src/pyhugegraph/utils/huge_requests.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
hugegraph-python-client/src/pyhugegraph/client.py (1)
hugegraph-python-client/src/pyhugegraph/utils/huge_config.py (1)
HGraphConfig(30-75)
🔇 Additional comments (8)
hugegraph-python-client/src/pyhugegraph/utils/huge_config.py (1)
33-35: 身份验证字段更新合理。新增的token字段和可选的username/password字段设计得当,支持了灵活的身份验证方式。字段类型定义正确,默认值设置合适。
hugegraph-python-client/src/pyhugegraph/utils/huge_requests.py (2)
53-56: Token初始化逻辑正确。简洁明了地从配置中初始化token属性,逻辑清晰易懂。
147-155: HTTP请求身份验证实现得当。正确地实现了Bearer token和基本认证的切换逻辑:
- 复制headers避免修改类属性
- 按HTTP标准添加Authorization Bearer头
- 使用token时正确地将auth设为None避免冲突
- 符合REST API身份验证最佳实践
hugegraph-python-client/src/pyhugegraph/client.py (4)
32-32: 正确添加日志导入。为支持身份验证警告消息而添加的日志导入,符合代码需求。
54-56: 构造函数参数更新合理。将user/pwd设为可选参数并添加token参数,支持了灵活的身份验证方式。类型注解正确,参数设计符合API设计最佳实践。
60-67: 身份验证参数验证逻辑完善。实现了良好的参数验证:
- 确保提供token或用户名密码之一
- 当同时提供两种认证方式时给出警告
- 错误消息清晰明确,用户体验友好
69-77: 配置对象初始化规范。使用显式参数名称传递的方式初始化HGraphConfig对象,提高了代码可读性和维护性。所有参数映射正确,保持了向后兼容性。
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
34-34:⚠️ Potential issue修正文档中的默认值说明
文档注释中提到"默认允许每天2次调用",但实际默认值是50。这会误导API用户。
应用此修改来修正文档:
- This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!) + This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!)Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
121-122: Gremlin 查询执行缺少错误处理正如之前的评论所指出的,Gremlin 查询执行时缺少异常处理。建议添加 try-catch 块来提高代码的健壮性。
- result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] - results.extend(result) + try: + result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] + results.extend(result) + except Exception as e: + log.error("执行 Gremlin 查询失败,标签: %s,错误: %s", label, e) + raisehugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
30-32: 修正pylint禁用注释的误用该函数只包含少量语句,pylint禁用注释
# pylint: disable=too-many-statements是不必要的,可能是从其他文件复制过来的。hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1)
42-243:⚠️ Potential issue避免在测试文件中重复实现业务逻辑
该文件位于测试目录中,但包含了
BuildSemanticIndex类的完整业务实现,这与build_semantic_index.py中的实现重复。这种做法存在以下问题:
- 代码重复:维护两套相似实现容易导致不一致
- 测试目的不明:测试文件应该包含单元测试而不是业务实现
- 导入冲突风险:可能掩盖真正的业务实现
建议删除此处的业务实现,改为编写真正的单元测试:
-class BuildSemanticIndex: - # 删除整个类实现 - ... +import pytest +from hugegraph_llm.operators.index_op.build_semantic_index import BuildSemanticIndex + +def test_build_semantic_index(): + # 编写针对 BuildSemanticIndex 的单元测试 + passhugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
89-102: 属性精确匹配仍存在性能问题虽然之前的正确性问题已解决(使用缓存schema、正确提取属性名、使用可哈希元组),但O(n×m)的查询效率问题仍然存在:
对每个属性键和每个关键词都发起单独的Gremlin查询,在属性键较多时会产生大量网络请求。
建议优化查询策略:
- 预先筛选可能匹配的属性键(如值类型为Text的属性)
- 将多个关键词合并到同一查询中
- 考虑使用批量查询减少网络开销
- for key in property_keys: - for keyword in list(unmatched_keywords): - gremlin_query = f"g.V().has('{key.name}', '{keyword}').limit(1)" + # 筛选文本类型的属性键 + text_property_keys = [key for key in property_keys if key.dataType == 'TEXT'] + for key in text_property_keys: + # 批量查询多个关键词 + keywords_str = "', '".join(list(unmatched_keywords)) + gremlin_query = f"g.V().has('{key.name}', within('{keywords_str}')).valueMap('{key.name}').limit(10)"
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)
24-24: 移除未使用的导入静态分析工具检测到
huge_settings被导入但未在文件中使用,应该移除这个未使用的导入以保持代码整洁。应用此修改来移除未使用的导入:
from hugegraph_llm.config import prompt -from hugegraph_llm.config import huge_settings🧰 Tools
🪛 Ruff (0.11.9)
24-24:
hugegraph_llm.config.huge_settingsimported but unusedRemove unused import:
hugegraph_llm.config.huge_settings(F401)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
114-126: 数据重整方法功能完善该方法能够正确处理混合数据类型并去重,逻辑清晰。但根据静态分析提示,可以优化嵌套if语句。
- if isinstance(item, (frozenset, set)): - for prop_tuple in item: - if isinstance(prop_tuple, tuple) and len(prop_tuple) == 2: - unique_tuples.add(prop_tuple) + if isinstance(item, (frozenset, set)): + for prop_tuple in item: + if isinstance(prop_tuple, tuple) and len(prop_tuple) == 2: + unique_tuples.add(prop_tuple) - elif isinstance(item, tuple): - if len(item) == 2: - unique_tuples.add(item) + elif isinstance(item, tuple) and len(item) == 2: + unique_tuples.add(item)🧰 Tools
🪛 Ruff (0.11.9)
123-124: Use a single
ifstatement instead of nestedifstatementsCombine
ifstatements usingand(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py(2 hunks)hugegraph-llm/src/hugegraph_llm/api/rag_api.py(4 hunks)hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py(1 hunks)hugegraph-llm/src/hugegraph_llm/indices/graph_index.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py(7 hunks)hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py(4 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py(5 hunks)hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py(1 hunks)hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py
🚧 Files skipped from review as they are similar to previous changes (2)
- hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py
- hugegraph-llm/src/hugegraph_llm/api/rag_api.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1)
hugegraph-python-client/src/pyhugegraph/client.py (2)
PyHugeClient(49-120)graph(88-89)
🪛 Ruff (0.11.9)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py
36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
24-24: hugegraph_llm.config.huge_settings imported but unused
Remove unused import: hugegraph_llm.config.huge_settings
(F401)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py
123-124: Use a single if statement instead of nested if statements
Combine if statements using and
(SIM102)
🔇 Additional comments (19)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (6)
33-40: Gremlin 查询模板实现正确且包含必要的安全限制INDEX_PROPERTY_GREMLIN 模板设计合理,通过
limit(100000)防止查询过大的数据集,project结构清晰地分离了顶点 ID 和属性数据。查询逻辑符合需求。
45-47: 属性索引目录和索引对象初始化正确新增的属性索引目录
graph_props和对应的VectorIndex实例初始化逻辑正确,与现有的顶点 ID 索引保持一致的命名和结构模式。
50-57: HugeGraph 客户端初始化支持令牌认证客户端初始化正确地加入了
token参数支持,与整个代码库的令牌认证机制保持一致。参数传递完整且符合新的认证模式。
195-197: 属性数量限制检查的安全保护机制代码正确实现了对属性数量的限制检查(100,000),这是一个重要的安全保护措施,防止系统因处理过多属性而出现性能问题。返回的警告信息清晰明了。
203-203: 异步嵌入生成调用正确基于过去评论的修复,现在正确使用了
asyncio.run()来调用异步的_get_embeddings_parallel方法,解决了之前的异步调用问题。
214-231: 重构后的 run 方法结构清晰方法重构成功地将复杂逻辑分解为
_update_vid_index和_update_property_index两个辅助方法,大大提高了代码的可读性和可维护性。流程控制逻辑清晰,条件判断合理。hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py (1)
46-46: 令牌认证参数添加正确在
get_hg_client()函数中正确添加了token=huge_settings.graph_token参数,与项目中其他文件的 PyHugeClient 初始化方式保持一致,支持新的令牌认证机制。hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1)
31-31: SchemaManager 中的令牌认证参数添加正确在
SchemaManager构造函数中正确添加了token=huge_settings.graph_token参数,与整个代码库中其他 PyHugeClient 实例化保持一致,确保了认证机制的统一性。hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (1)
30-30: 令牌配置字段定义正确在
HugeGraphConfig类中新增的graph_token: Optional[str] = None字段定义正确,类型注解和默认值设置合适。这为整个系统提供了令牌认证的配置基础,与其他可选认证参数的模式保持一致。hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2)
29-32: token认证支持的实现良好将
user和pwd字段设为可选且默认为空字符串,同时添加token字段,这很好地支持了基于token的认证机制,与整个代码库的认证升级保持一致。
121-147: 新增的Gremlin生成请求模型设计合理
GremlinOutputType枚举和GremlinGenerateRequest模型为选择性Gremlin生成提供了灵活的配置选项,支持多种输出类型组合,API设计考虑周到。hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (8)
20-20: 类型导入扩展良好新增的类型导入支持了属性匹配功能的实现,类型定义清晰。
41-43: 属性向量索引初始化正确为属性语义搜索添加了专门的向量索引,与顶点ID索引分离,设计合理。
54-54: Token认证支持增强了安全性添加token认证参数提升了图数据库连接的安全性。
57-57: Schema缓存优化性能缓存schema对象避免了重复的网络调用,显著提升性能。
82-83: 批量嵌入调用优化性能将单个文本嵌入调用改为批量调用,减少了模型调用次数,提升效率。
104-112: 属性模糊匹配实现正确属性模糊匹配逻辑与顶点ID模糊匹配保持一致,使用专门的属性向量索引进行语义搜索,实现合理。
132-132: 查询向量生成优化与关键词处理保持一致,使用批量嵌入调用提升性能。
146-157: 属性匹配集成逻辑完善将属性匹配结果正确集成到上下文中,支持精确匹配和模糊匹配的组合,逻辑清晰。通过检查索引标签的存在来决定是否执行属性匹配,避免了不必要的计算。
| graph_token: Optional[str] = huge_settings.graph_token, | ||
| graph_space: Optional[str] = huge_settings.graph_space, | ||
| ): | ||
| self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd, | ||
| graphspace=graph_space) | ||
| token=huge_settings.graph_token, graphspace=graph_space) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复参数使用不一致的问题
构造函数中定义了 graph_token 参数,但在创建 PyHugeClient 实例时却直接使用了 huge_settings.graph_token,这导致传入的参数被忽略。
应用此修改来修复参数使用:
self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd,
- token=huge_settings.graph_token, graphspace=graph_space)
+ token=graph_token, graphspace=graph_space)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| graph_token: Optional[str] = huge_settings.graph_token, | |
| graph_space: Optional[str] = huge_settings.graph_space, | |
| ): | |
| self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd, | |
| graphspace=graph_space) | |
| token=huge_settings.graph_token, graphspace=graph_space) | |
| self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd, | |
| token=graph_token, graphspace=graph_space) |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/indices/graph_index.py around lines 32 to 36,
the constructor defines a graph_token parameter but the PyHugeClient instance is
created using huge_settings.graph_token directly, ignoring the passed parameter.
Fix this by replacing huge_settings.graph_token with the graph_token parameter
when calling PyHugeClient, ensuring the passed argument is used consistently.
| @router.post("/vector/embedding", status_code=status.HTTP_200_OK) | ||
| def update_embedding_api( | ||
| daily_limit: int = 50, | ||
| graph_config: Optional[GraphConfigRequest] = Body(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复函数参数默认值中的可变对象问题
在函数参数默认值中使用 Body(None) 可能导致意外行为,因为默认值在函数定义时被求值一次,而不是每次调用时。
应用此修改来修复问题:
def update_embedding_api(
daily_limit: int = 50,
- graph_config: Optional[GraphConfigRequest] = Body(None)
+ graph_config: Optional[GraphConfigRequest] = None
):
+ if graph_config is None:
+ graph_config = Body(None)或者更好的做法是在函数签名中移除默认值:
def update_embedding_api(
daily_limit: int = 50,
- graph_config: Optional[GraphConfigRequest] = Body(None)
+ graph_config: Optional[GraphConfigRequest] = Body(default=None)
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| graph_config: Optional[GraphConfigRequest] = Body(None) | |
| def update_embedding_api( | |
| daily_limit: int = 50, | |
| - graph_config: Optional[GraphConfigRequest] = Body(None) | |
| + graph_config: Optional[GraphConfigRequest] = None | |
| ): | |
| + if graph_config is None: | |
| + graph_config = Body(None) |
🧰 Tools
🪛 Ruff (0.11.9)
36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 36, the function
parameter uses Body(None) as a default value, which is mutable and evaluated
only once at function definition, potentially causing unexpected behavior. To
fix this, remove the default value from the function signature so that the
parameter does not have a mutable default, or alternatively, use None as the
default and assign Body() inside the function body if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (3)
49-49: [重复问题] 修正日志消息中的函数名称日志消息中的函数名称与实际端点不匹配,这与之前的审查意见一致。
- log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit) + log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)
28-28:⚠️ Potential issue[重复问题] 实现线程安全的速率限制
全局字典
API_CALL_TRACKER在并发环境中存在线程安全问题,这与之前的审查意见一致。建议使用线程安全的实现:
+import threading +from collections import defaultdict + -API_CALL_TRACKER = {} +API_CALL_TRACKER = defaultdict(int) +TRACKER_LOCK = threading.Lock()然后在速率限制逻辑中使用锁保护:
+ with TRACKER_LOCK: today = date.today() for call_date in list(API_CALL_TRACKER.keys()): if call_date != today: del API_CALL_TRACKER[call_date] call_count = API_CALL_TRACKER.get(today, 0) if call_count >= daily_limit: # ... 错误处理 API_CALL_TRACKER[today] = call_count + 1
36-36:⚠️ Potential issue[重复问题] 修复函数参数中的可变默认值
Body(None)作为默认值会导致意外行为,因为默认值在函数定义时被求值一次。- graph_config: Optional[GraphConfigRequest] = Body(None) + graph_config: Optional[GraphConfigRequest] = Body(default=None)🧰 Tools
🪛 Ruff (0.11.9)
36-36: Do not perform function call
Bodyin argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)
24-24: 移除未使用的导入静态分析工具检测到
huge_settings被导入但在当前文件中未使用。应该移除这个未使用的导入以保持代码整洁。-from hugegraph_llm.config import huge_settings🧰 Tools
🪛 Ruff (0.11.9)
24-24:
hugegraph_llm.config.huge_settingsimported but unusedRemove unused import:
hugegraph_llm.config.huge_settings(F401)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
38-42: 修正文档中的默认限制值文档注释中提到"默认允许每天2次调用",但实际默认值是50。应该更新文档以保持一致性。
- This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!) + This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py(2 hunks)hugegraph-llm/src/hugegraph_llm/api/rag_api.py(5 hunks)hugegraph-llm/src/hugegraph_llm/api/vector_api.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/api/rag_api.py
🧰 Additional context used
🪛 Ruff (0.11.9)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py
24-24: hugegraph_llm.config.huge_settings imported but unused
Remove unused import: hugegraph_llm.config.huge_settings
(F401)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py
36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
🔇 Additional comments (6)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (4)
18-19: LGTM - 新导入项支持新功能新增的
List和Enum导入支持了后续定义的GremlinOutputType枚举和GremlinGenerateRequest模型,导入合理。
28-32: GraphConfigRequest 模型字段更新合理模型字段的更新支持了更灵活的认证方式:
name重命名为graph更符合语义user和pwd设为可选,默认为空字符串- 新增
token字段支持基于令牌的认证这些更改与代码库中其他地方的令牌认证支持保持一致。
121-127: GremlinOutputType 枚举定义清晰枚举很好地定义了 Gremlin 生成结果的不同输出类型,命名清晰且具有描述性。这将为选择性 Gremlin 生成提供良好的类型安全性。
128-146: GremlinGenerateRequest 模型设计良好模型设计合理,包含了所有必要的参数:
- 查询文本和示例数量
- 可选的提示模板和客户端配置
- 输出类型列表支持选择性结果返回
默认值设置合理,文档描述清晰。
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)
55-61: 图配置更新逻辑正确配置更新逻辑正确地将请求中的图配置参数映射到全局设置,包括新增的令牌字段。这与代码库中其他地方的令牌认证支持保持一致。
63-65: 嵌入更新和响应包装合理函数调用嵌入更新功能并将结果包装在
detail字段中返回,这符合 API 响应的常见模式。
6a76d6c to
fe94881
Compare
Summary by CodeRabbit
新功能
功能增强
文档更新
新增客户端 SDK