property embeddings #5

MrJs133 · 2025-05-22T07:38:44Z

Summary by CodeRabbit

新功能
- 新增向量相关的 HTTP API 接口，支持通过接口更新顶点 ID 及属性的向量嵌入。
- 语义索引与查询现已支持顶点属性的语义匹配，提升了属性级别的智能检索能力。
- 新增 Gremlin 生成请求模型及对应 API，支持按需选择输出内容。
- 引入属性向量索引，实现属性向量的异步批量计算与管理。
- 语义索引构建支持顶点属性的异步向量计算与索引更新，提升索引覆盖范围和效率。
- 通过 schema 获取索引标签元数据，丰富图结构信息。
- Gremlin 查询生成支持基于属性匹配信息，增强查询表达能力。
- 查询提示模板增加对“引用的提取属性”的支持，提升 Gremlin 查询准确性和细节丰富度。
- Gremlin 生成任务新增对属性信息的支持，丰富生成逻辑。
- 属性匹配逻辑调整，支持单个属性名值对匹配，提高匹配精度。
功能增强
- 图谱索引中新增属性索引，支持属性向量的存储与管理，并在索引信息中展示属性向量数量。
- 图索引清理功能扩展，加入“graph_props”索引的清理，确保属性向量索引同步更新。
- Gremlin 查询中属性匹配改为单个属性名值对匹配，支持多属性迭代查询，提升查询鲁棒性。
- 调整 Gremlin 生成响应解析逻辑，支持无匹配时返回完整响应，提升容错性。
- 支持基于 Token 的图数据库认证方式，增强安全性和灵活性。
文档更新
- README 精简，移除外部链接及联系方式，简化项目介绍。
- 更新示例和文档中 HugeGraph schema 相关链接，指向新版文档及内部资源。
新增客户端 SDK
- 发布 Vermeer Python SDK，包含基础客户端、图管理、任务管理等模块，支持任务查询与创建。
- SDK 配置、请求、异常处理及日志模块完善，提升开发体验。
- 提供示例脚本演示任务管理功能。

Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info

…ph/hugegraph-ai into master-icode

sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f

Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e

Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8

Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e

Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263

Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0

Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6

Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121

Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2

Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd

Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6

Change-Id: I799844222f194d54990b27d71770543ad6c707ca

Change-Id: Ica35d454bf7a1fa12d3cfb14332530565cd08f94

imbajin · 2025-05-26T11:24:11Z

@coderabbitai full review

coderabbitai · 2025-05-26T11:24:16Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (1)
30-45: ⚠️ Potential issue

新增 properties 参数缺乏校验与格式化，易导致模板渲染失败

当前仅把 properties 原样传入 str.format，若 gremlin_prompt 中未包含 {properties} 占位符将抛出 KeyError。

properties 类型为 List[tuple]，直接插入 prompt 结果会是 Python 的字符串化形式，LLM 很可能无法解析。

建议：
+def _format_properties(self, props: Optional[List[tuple]]) -> Optional[str]:
+    if not props:
+        return None
+    return "\n".join([f"- {k}: {v}" for k, v in props])
并在 .format() 时使用 properties=self._format_properties(self.properties)，同时对模板增加默认占位符或在调用前做占位符存在性检查以避免运行时异常。

♻️ Duplicate comments (7)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py (2)

26-29: ⚠️ Potential issue

self.graph.schema() 依旧返回类对象，实例化问题未解决

此前评审已指出 PyHugeClient.schema() 返回的是 SchemaManager 类而非实例；本次改动依然直接赋值，后续调用实例方法将抛 AttributeError。请确认 SDK 行为，必要时在此处显式实例化或修改 SDK。

54-63: ⚠️ Potential issue

仍然把方法对象写入字典，导致序列化失败

baseType()、fields() 等为访问器方法，缺少 () 将得到 <bound method …>。之前的指正尚未修复，序列化为 JSON 时会报错。请为所有访问器补上括号。

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)

25-25: 移除不必要的pylint禁用注释

如之前审查中指出的，# pylint: disable=too-many-statements 注释对于这个简单函数来说是不必要的，该函数只包含少量语句。

23-23: ⚠️ Potential issue

线程安全问题仍需解决

如之前审查中指出的，全局字典 API_CALL_TRACKER 存在线程安全问题。在并发环境中，多个请求可能同时修改此字典，导致数据竞争。

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (2)

56-64: 已缓存的 schema 未复用，仍存在额外网络开销
构造函数第 56 行已缓存 self.schema，但 _exact_match_vids 又调用 self._client.schema()。重复请求会显著增加延迟，建议直接改为 vertex_label_num = len(self.schema.getVertexLabels())。

88-101: 属性精确匹配仍存在字段取值错误及性能问题

key.name 应为方法调用 key.name()（或属性 .name，需确认 SDK 实现）。当前写法可能返回绑定方法或不可预期值，导致查询失败。

O(n×m) Gremlin 查询会对每个 property_key × keyword 发送请求，扩展到大型图时极易压垮后端。

可参考先前建议：预过滤仅 Text 类型属性并合并 keywords 至单条查询，或一次性批量查询后在本地过滤。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)

88-105: diff_property_sets 仍缺少文档字符串，返回值语义不明

此前已指出缺少 docstring，当前实现依旧无说明，调用者难以理解四个返回列表的含义，易误用导致索引错误。请添加完整中文/英文注释，或参考 Google/NumPy 风格 docstring。

🧹 Nitpick comments (7)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py (1)

54-67: 一次性遍历全部 IndexLabel 可能影响大图性能

如果图中 IndexLabel 很多，self.schema.getIndexLabels() 的全量遍历会耗时较长。可考虑：

按需分页或限制返回数量；

在前端或调用方加分页展示；

提供 ?detail=false 参数，仅返回数量。

属于可选优化，但在生产大图中尤为重要。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (1)

26-37: 确认新向量接口的耗时操作是否需要异步处理

update_vid_embedding 可能触发向量计算与索引写入，执行时间长。当前直接在 HTTP 请求线程中运行，FastAPI 默认线程池大小有限，可能阻塞其它请求。可考虑：

使用 BackgroundTasks / Celery 异步队列；

或在 vector_http_api 内使用 asyncio.to_thread。

这样既不影响 API 响应速度，也利于水平扩展。
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
30-45: API端点实现基本正确但缺乏灵活性

速率限制逻辑和错误处理实现正确。但是，如之前审查中提到的，该端点不接受任何参数，这限制了未来的扩展性。

考虑增加参数灵活性以支持未来的功能扩展：
-    def update_embedding_api(daily_limit: int = 2):
+    def update_embedding_api(request_data: dict = None, daily_limit: int = 2):
         # ... 现有的速率限制逻辑 ...
-        result = update_embedding_func()
+        result = update_embedding_func(request_data) if request_data else update_embedding_func()
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (1)

70-76: 重复代码可提取公共方法减少维护成本

raw_prompt 与 init_prompt 几乎完全相同，仅 example 不同。可抽取 _build_prompt(example) 辅助方法以提升可读性并避免未来修改遗漏。
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
113-125: 嵌套 if 可合并以简化逻辑

Ruff (SIM102) 提醒可用单层判断表达式，能略微提升可读性：
-            elif isinstance(item, tuple):
-                if len(item) == 2:
-                    unique_tuples.add(item)
+            elif isinstance(item, tuple) and len(item) == 2:
+                unique_tuples.add(item)
🧰 Tools

🪛 Ruff (0.11.9)

122-123: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)

196-203: 硬编码阈值 100 000 缺乏配置化，且提前 return 会丢失顶点索引更新结果

100 000 这一 Magic Number 应放入配置文件或环境变量，方便运维侧调整。

当属性数量超过阈值时直接 return，但之前对顶点索引的统计已写入 context。若后续调用方依赖更多字段（如 match_props）将缺失，建议继续流程但仅跳过向量化步骤。
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (1)
216-235: 建议增强错误处理和性能优化

当前的属性查询逻辑缺乏错误处理，且逐个执行属性查询可能影响性能。建议：

为每个Gremlin查询添加异常处理

考虑批量查询以提升性能

添加查询结果验证
                paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"]
                log.debug("paths: %s", paths)
+                
+                try:
+                    paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"]
+                    log.debug("paths: %s", paths)
+                except Exception as e:
+                    log.error("Failed to execute property query for %s=%s: %s", prop_name, prop_value, e)
+                    continue

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26b9e45 and 9acaa96.

📒 Files selected for processing (11)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/fetch_graph_data.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (6 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py

122-123: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🔇 Additional comments (12)

hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2)

50-53: 属性索引清理功能实现正确

正确地添加了对 "graph_props" 目录的索引清理支持，与现有的顶点ID索引和Gremlin示例索引清理保持一致。日志和用户界面消息也相应更新以反映新增的属性索引清理功能。

115-120:
Details

✅ Verification successful

验证上下文键的一致性

函数现在根据 context["index_labels"] 的存在性条件性地返回属性向量更新详情。需要确保在整个代码库中，相关的上下文键（如 removed_props_num 和 added_props_vector_num）在适当的地方被正确设置。

🏁 Script executed:
#!/bin/bash
# 描述：验证属性相关的上下文键在代码库中的设置和使用情况
# 预期：找到设置这些键的代码，确保一致性

echo "搜索 removed_props_num 的设置："
rg "removed_props_num.*=" -A 2 -B 2

echo -e "\n搜索 added_props_vector_num 的设置："
rg "added_props_vector_num.*=" -A 2 -B 2

echo -e "\n搜索 index_labels 的设置："
rg "index_labels.*=" -A 2 -B 2
Length of output: 4427

🏁 Script executed:
#!/bin/bash
# 查看 graph_index_utils.py 中相关上下文键的使用及函数定义
echo "=== graph_index_utils.py: 前200行 ==="
sed -n '1,200p' hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py

echo -e "\n=== graph_index_utils.py: 100-140行 ==="
sed -n '100,140p' hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py

# 搜索 GraphIndexUtils 在代码库中的引用位置
echo -e "\n=== 搜索 GraphIndexUtils 的引用 ==="
rg "GraphIndexUtils" -n
Length of output: 8385

🏁 Script executed:
#!/bin/bash
# 验证 build_semantic_index.py 中是否将 removed_props_num 和 added_props_vector_num 写入 context

echo "查找 context 写入 removed_props_num："
rg 'context\["removed_props_num"\]' -n hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py

echo -e "\n查找 context 写入 added_props_vector_num："
rg 'context\["added_props_vector_num"\]' -n hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py

echo -e "\n查看 prop_index.remove 周边代码："
rg -n "prop_index.remove" -A5 -B5 hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py

echo -e "\n查看 prop_index.to_index_file 周边代码："
rg -n "to_index_file" -A5 -B5 hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py
Length of output: 5045
确认上下文键一致性，无需修改

已在 build_semantic_index.py 中通过
context.update({ "removed_props_num": …, "added_props_vector_num": … })
正确写入了 removed_props_num 和 added_props_vector_num

index_labels 由 fetch_graph_data 操作设置并注入到 context 中

不存在遗漏或不一致的问题，关闭此条建议。
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)

62-69: 方法签名扩展实现正确

正确地添加了可选的 properties 参数以支持基于属性的查询生成。参数默认值为 None 保持了向后兼容性，参数传递给 GremlinGenerateSynthesize 构造函数的方式也是正确的。

hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (2)

221-222: 属性利用指令添加合理

新增的指令正确地指导模型在用户查询提及特定属性名称或值时，主动在Gremlin查询中利用这些属性。指令清晰地说明了如何使用属性进行过滤（如 has('propertyName', 'propertyValue')）或投影特定值。

236-237: 输出格式扩展一致性良好

新增的"Referenced Extracted Properties Related to the Query"部分与现有的顶点ID引用部分保持一致的格式和风格。元组格式 [('property_name', 'property_value'), ...] 清晰明确，便于模型理解和使用。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)

208-210: 在同步上下文中调用 asyncio.run 可能导致嵌套事件循环错误

若上层框架（如 FastAPI、Jupyter）已有事件循环，asyncio.run() 会抛 RuntimeError: asyncio.run() cannot be called from a running event loop。
建议：

将 run() 改造为异步函数，由调用方决定何时 await。

或使用 await/asyncio.create_task 并要求外层协程环境。

若必须同步，可检测 asyncio.get_running_loop() 是否存在并退化到 loop.run_until_complete。
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (6)

56-56: 属性查询模板更新合理

新的属性查询模板使用 has('{current_prop_name}', '{current_prop_value}') 进行属性过滤，并在投影中使用 .values('{current_prop_name}') 替代 .id()，这与基于属性匹配的查询逻辑一致。

Also applies to: 68-69

132-132: Gremlin生成增强实现正确

正确地从上下文中提取 match_props 并将其传递给 gremlin_generate_synthesize 方法，支持基于属性的Gremlin查询生成。

Also applies to: 138-138

164-164: 属性匹配初始化逻辑合理

从上下文中提取匹配的属性，并正确设置 _prop_to_match 为第一个属性名称。调试日志有助于跟踪属性匹配过程。

Also applies to: 169-171

240-240: 变量使用确认

此行的 pylint 注释表明开发者已意识到 knowledge_with_degree 可能未定义的问题。建议修复上述代码段中的变量初始化问题。

349-349: 节点字符串格式化逻辑正确

条件格式化逻辑正确处理了基于ID匹配和基于属性匹配两种情况，在ID匹配时使用顶点ID，在属性匹配时使用完整属性映射。

213-235: ⚠️ Potential issue

修复类型不一致和未定义变量问题

该代码段存在两个关键问题：

graph_chain_knowledge 被初始化为 set()，但后续调用 .update() 方法，这会导致类型错误，因为 set.update() 期望可迭代对象，而 temp_graph_chain_knowledge 是集合类型。

变量 vertex_degree_list 和 knowledge_with_degree 在此分支中未定义，但在第240行被使用，可能导致 NameError。

应用以下修复：
        elif matched_props:
            # WARN: When will the query enter here?
-            graph_chain_knowledge = set()
+            graph_chain_knowledge = set()
+            vertex_degree_list = []
+            knowledge_with_degree = {}
            for prop_name, prop_value in matched_props:
                self._prop_to_match = prop_name
                gremlin_query = PROPERTY_QUERY_NEIGHBOR_TPL.format(
                    current_prop_name=prop_name,
                    current_prop_value=prop_value,
                    edge_labels=edge_labels_str,
                    edge_limit=edge_limit_amount,
                    max_deep=self._max_deep,
                    max_items=self._max_items
                )
                log.warning("Unable to find vid, downgraded to property query, please confirm if it meets expectation.")
                log.debug("property gremlin: %s", gremlin_query)

                paths: List[Any] = self._client.gremlin().exec(gremlin=gremlin_query)["data"]
                log.debug("paths: %s", paths)
-                temp_graph_chain_knowledge, vertex_degree_list, knowledge_with_degree = self._format_graph_query_result(
+                temp_graph_chain_knowledge, temp_vertex_degree_list, temp_knowledge_with_degree = self._format_graph_query_result(
                    query_paths=paths
                )
                graph_chain_knowledge.update(temp_graph_chain_knowledge)
+                vertex_degree_list.extend(temp_vertex_degree_list)
+                knowledge_with_degree.update(temp_knowledge_with_degree)
Likely an incorrect or invalid review comment.

coderabbitai · 2025-05-26T11:31:15Z

hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py

    chunk_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "chunks")))
    graph_vid_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids")))
+    graph_prop_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_props")))
    return json.dumps({
        "embed_dim": chunk_vector_index.index.d,
        "vector_info": {
            "chunk_vector_num": chunk_vector_index.index.ntotal,
            "graph_vid_vector_num": graph_vid_vector_index.index.ntotal,
-            "graph_properties_vector_num": len(chunk_vector_index.properties)
+            "graph_properties_vector_num": graph_prop_vector_index.index.ntotal,
        }


💡 Verification agent

🧩 Analysis chain

增加错误处理与懒加载，避免索引文件缺失或内存占用过高

目前函数每次调用都会同步加载三个 faiss 索引文件；如果 graph_props 文件尚未生成，VectorIndex.from_index_file() 会直接抛异常，导致接口 500。同样，三个索引全部加载到内存也会放大一次性内存峰值。

建议：

在加载索引前捕获 FileNotFoundError / IOError 并返回 0，使接口具备容错能力。

仅提取元数据时可考虑读取 *.json 描述文件或使用 faiss.read_index_header 等轻量方式，减少 I/O 与内存。

示例修复（核心思路，需结合 VectorIndex 实现调整）:

- chunk_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "chunks"))) - graph_vid_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids"))) - graph_prop_vector_index = VectorIndex.from_index_file(str(os.path.join(resource_path, huge_settings.graph_name, "graph_props"))) + def safe_load(name: str): + try: + return VectorIndex.from_index_file( + str(os.path.join(resource_path, huge_settings.graph_name, name)) + ) + except FileNotFoundError: + return None + + chunk_vector_index = safe_load("chunks") + graph_vid_vector_index = safe_load("graph_vids") + graph_prop_vector_index = safe_load("graph_props")

同时在 JSON 组装处注意 None 判空。

让我们先定位 VectorIndex 的实现，检视 from_index_file 方法的行为，以确认缺失文件时的异常类型及加载开销：

🏁 Script executed:

#!/bin/bash # 搜索 VectorIndex 类定义 rg -n "class VectorIndex" . # 搜索 from_index_file 方法实现 rg -n "def from_index_file" .

Length of output: 361

为了确认 from_index_file 的具体实现及异常行为，先查看代码：

🏁 Script executed:

#!/bin/bash # 查看 VectorIndex.from_index_file 的实现 sed -n '1,200p' hugegraph-llm/src/hugegraph_llm/indices/vector_index.py

Length of output: 4337

使用轻量化元数据接口替换全量加载，避免不必要的 I/O 与内存占用

VectorIndex.from_index_file 已在索引文件缺失时自动返回空实例（ntotal=0），因此无需在调用处再做 FileNotFoundError 捕获。当前每次接口调用都会全量加载三份 FAISS 索引到内存，若仅需返回 embed_dim 与条目数，建议：

在 VectorIndex 内新增 read_metadata(dir_path: str) -> Tuple[int, int]（或类似命名），通过

os.path.exists 检测文件存在性，缺失时直接返回 (0, 0)；

faiss.read_index_header 只读取索引头部以获知 d 与 ntotal；

反序列化 properties.pkl 或配套 JSON，以获取属性数目（可选）。

在 vector_index_utils.py 中，用该方法替换原有三次 from_index_file 调用，例如：

chunk_index = VectorIndex.from_index_file(chunks_path)

vid_index = VectorIndex.from_index_file(graph_vids_path)

prop_index = VectorIndex.from_index_file(graph_props_path)

chunk_meta = VectorIndex.read_metadata(chunks_path)

vid_meta = VectorIndex.read_metadata(graph_vids_path)

prop_meta = VectorIndex.read_metadata(graph_props_path)

返回时直接引用：
return json.dumps({ "embed_dim": chunk_meta.d, "vector_info": { "chunk_vector_num": chunk_meta.ntotal, "graph_vid_vector_num": vid_meta.ntotal, "graph_properties_vector_num": prop_meta.ntotal, } })

如此可避免三次全量索引加载，显著降低接口延迟、I/O 压力与内存峰值。

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py around lines 60 to 69, the code loads three FAISS index files using VectorIndex.from_index_file, which can be expensive and may cause issues if files are missing or large. Since from_index_file already handles missing files gracefully by returning an index with ntotal=0, avoid reloading full indexes; instead, implement a lightweight method to read only index metadata such as dimension and total count, reducing I/O and memory usage. Replace the three full index loads with this new method, and use the retrieved metadata for JSON response construction, ensuring efficient and fault-tolerant operation.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)

25-25: 全局字典存在线程安全问题。

全局字典 API_CALL_TRACKER 在并发环境中存在线程安全问题，这与之前的审查评论中提到的问题相同。

28-28: 移除不必要的pylint禁用注释。

# pylint: disable=too-many-statements 注释对于这个简单函数来说是不必要的，这与之前的审查评论中提到的问题相同。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)

147-219: 🛠️ Refactor suggestion

run 方法仍需重构以提高可维护性

虽然新增的属性索引功能实现正确，但 run 方法现在更加复杂（70+行），违反了单一职责原则。这个问题在之前的评审中已经提到但未解决。

强烈建议将该方法拆分为以下辅助方法：

_update_vertex_id_index(): 处理顶点ID索引更新（第148-171行）

_update_property_index(): 处理属性索引更新（第173-218行）

这样可以显著提高代码的可读性、可测试性和可维护性。

🧹 Nitpick comments (3)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
38-48: 优化速率限制清理逻辑。

当前的清理逻辑在每次API调用时都会遍历整个字典来删除过期条目，这在高频调用时可能影响性能。

考虑使用更高效的清理策略：
-for call_date in list(API_CALL_TRACKER.keys()):
-    if call_date != today:
-        del API_CALL_TRACKER[call_date]
+# 只清理昨天的记录，避免遍历整个字典
+yesterday = today - timedelta(days=1)
+API_CALL_TRACKER.pop(yesterday, None)
需要在文件顶部添加导入：
from datetime import date, timedelta
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (2)
33-40: 建议将查询限制设为可配置参数

Gremlin 查询模板设计合理，但硬编码的 100000 限制可能不够灵活。考虑将此限制值设为配置参数，以便根据不同环境和需求进行调整。
-INDEX_PROPERTY_GREMLIN = """
-g.V().hasLabel('{label}')
- .limit(100000)
- .project('vid', 'properties')
- .by(id())
- .by(valueMap({fields}))
- .toList()
-"""
+INDEX_PROPERTY_GREMLIN = """
+g.V().hasLabel('{label}')
+ .limit({limit})
+ .project('vid', 'properties')
+ .by(id())
+ .by(valueMap({fields}))
+ .toList()
+"""
然后在查询时传入配置的限制值。

89-106: 方法逻辑正确但建议添加内联注释

diff_property_sets 方法的逻辑正确，能够准确识别需要添加、更新和删除的属性集合。但考虑到方法的复杂性，建议添加内联注释来提高可读性。
 def diff_property_sets(
     self,
     present_prop_value_to_propset: dict,
     past_prop_value_to_propset: dict
 ):
     to_add = []
     to_update = []
     to_update_remove = []
+    # 计算需要删除的属性值（存在于历史但不存在于当前）
     to_remove_keys = set(past_prop_value_to_propset) - set(present_prop_value_to_propset)
     to_remove = [past_prop_value_to_propset[k] for k in to_remove_keys]
+    # 遍历当前属性值，确定新增和更新项
     for prop_value, present_propset in present_prop_value_to_propset.items():
         past_propset = past_prop_value_to_propset.get(prop_value)
         if past_propset is None:
+            # 新属性值，需要添加
             to_add.append((prop_value, present_propset))
         elif present_propset != past_propset:
+            # 属性集合发生变化，需要更新
             to_update.append((prop_value, present_propset))
             to_update_remove.append((prop_value, past_propset))
     return to_add, to_update, to_remove, to_update_remove

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9acaa96 and 50ec338.

📒 Files selected for processing (2)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (3.10)
GitHub Check: build (3.11)

🔇 Additional comments (5)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)

1-17: 许可证头部格式正确。

Apache 许可证头部格式标准且完整。

19-24: 导入语句检查通过。

所有导入的模块都在代码中被使用，没有未使用的导入。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (3)

21-22: 导入语句变更合理

新增的导入语句都是为了支持属性向量索引功能所必需的，包括用于数据结构的 defaultdict 和用于图数据库查询的 PyHugeClient。

Also applies to: 31-31

45-45: 构造函数修改正确

新增的属性索引目录、属性向量索引和客户端初始化都是为了支持属性向量索引功能，实现方式合理且正确。

Also applies to: 47-47, 50-56

136-145: get_past_props 方法实现简洁高效

该方法从现有属性索引中提取历史属性信息的实现简洁明了，逻辑正确。

coderabbitai · 2025-05-26T11:50:28Z

hugegraph-llm/src/hugegraph_llm/api/vector_api.py

+                del API_CALL_TRACKER[call_date]
+        call_count = API_CALL_TRACKER.get(today, 0)
+        if call_count >= daily_limit:
+            log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit)


⚠️ Potential issue

修正日志消息中的函数名称。

日志消息中的函数名称 update_vid_embedding 与实际的端点功能不匹配。

应用此修改来修正函数名称：

-log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit) +log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit)

log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 43, the log error message references the function name update_vid_embedding, which does not match the actual endpoint function name. Update the log message to reflect the correct function name to improve clarity and debugging accuracy. Ensure the message accurately describes the function where the rate limit is exceeded.

coderabbitai · 2025-05-26T11:50:28Z

hugegraph-llm/src/hugegraph_llm/api/vector_api.py

+# pylint: disable=too-many-statements
+def vector_http_api(router: APIRouter, update_embedding_func):
+    @router.post("/vector/embedding", status_code=status.HTTP_200_OK)
+    def update_embedding_api(daily_limit: int = 2):


🛠️ Refactor suggestion

添加daily_limit参数的输入验证。

daily_limit 参数缺少输入验证，可能导致负数或零值的问题。

应用此修改来添加参数验证：

-def update_embedding_api(daily_limit: int = 2): +def update_embedding_api(daily_limit: int = 2): + if daily_limit <= 0: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="daily_limit must be a positive integer" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def update_embedding_api(daily_limit: int = 2):

def update_embedding_api(daily_limit: int = 2):

if daily_limit <= 0:

raise HTTPException(

status_code=status.HTTP_400_BAD_REQUEST,

detail="daily_limit must be a positive integer"

)

# existing implementation continues below

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 31, the function update_embedding_api has a daily_limit parameter without input validation, which could lead to invalid values such as negative numbers or zero. Add validation logic to ensure daily_limit is a positive integer, raising an appropriate exception if the validation fails. This will prevent invalid input values from causing issues downstream.

coderabbitai · 2025-05-26T11:50:28Z

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py

+    def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]:
+        results = []
+        for item in context["index_labels"]:
+            label = item["base_value"]
+            fields = item["fields"]
+            fields_str_list = [f"'{field}'" for field in fields]
+            fields_for_query = ", ".join(fields_str_list)
+            gremlin_query = INDEX_PROPERTY_GREMLIN.format(
+                label=label,
+                fields=fields_for_query
+            )
+            log.debug("gremlin_query: %s", gremlin_query)
+            result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]
+            results.extend(result)
+        orig_present_prop_value_to_propset = defaultdict(set)
+        for item in results:
+            properties = item["properties"]
+            for prop_key, values in properties.items():
+                if not values:
+                    continue
+                prop_value = str(values[0])
+                orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value))
+        present_prop_value_to_propset = {
+            k: frozenset(v)
+            for k, v in orig_present_prop_value_to_propset.items()
+        }
+        return present_prop_value_to_propset


🛠️ Refactor suggestion

建议为 Gremlin 查询添加错误处理

get_present_props 方法的逻辑正确，但缺少对 Gremlin 查询可能失败的错误处理。建议添加异常捕获以提高代码健壮性。

def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]: results = [] for item in context["index_labels"]: label = item["base_value"] fields = item["fields"] fields_str_list = [f"'{field}'" for field in fields] fields_for_query = ", ".join(fields_str_list) gremlin_query = INDEX_PROPERTY_GREMLIN.format( label=label, fields=fields_for_query ) log.debug("gremlin_query: %s", gremlin_query) - result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] - results.extend(result) + try: + result = self.client.gremlin().exec(gremlin=gremlin_query)["data"] + results.extend(result) + except Exception as e: + log.error("Failed to execute Gremlin query for label %s: %s", label, e) + raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]:

results = []

for item in context["index_labels"]:

label = item["base_value"]

fields = item["fields"]

fields_str_list = [f"'{field}'" for field in fields]

fields_for_query = ", ".join(fields_str_list)

gremlin_query = INDEX_PROPERTY_GREMLIN.format(

label=label,

fields=fields_for_query

)

log.debug("gremlin_query: %s", gremlin_query)

result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]

results.extend(result)

orig_present_prop_value_to_propset = defaultdict(set)

for item in results:

properties = item["properties"]

for prop_key, values in properties.items():

if not values:

continue

prop_value = str(values[0])

orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value))

present_prop_value_to_propset = {

k: frozenset(v)

for k, v in orig_present_prop_value_to_propset.items()

}

return present_prop_value_to_propset

def get_present_props(self, context: dict[str, Any]) -> dict[str, frozenset[tuple[str, str]]]:

results = []

for item in context["index_labels"]:

label = item["base_value"]

fields = item["fields"]

fields_str_list = [f"'{field}'" for field in fields]

fields_for_query = ", ".join(fields_str_list)

gremlin_query = INDEX_PROPERTY_GREMLIN.format(

label=label,

fields=fields_for_query

)

log.debug("gremlin_query: %s", gremlin_query)

try:

result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]

results.extend(result)

except Exception as e:

log.error("Failed to execute Gremlin query for label %s: %s", label, e)

raise

orig_present_prop_value_to_propset = defaultdict(set)

for item in results:

properties = item["properties"]

for prop_key, values in properties.items():

if not values:

continue

prop_value = str(values[0])

orig_present_prop_value_to_propset[prop_value].add((prop_key, prop_value))

present_prop_value_to_propset = {

k: frozenset(v)

for k, v in orig_present_prop_value_to_propset.items()

}

return present_prop_value_to_propset

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py around lines 108 to 134, the code executes Gremlin queries without handling potential errors or exceptions that may occur during the execution. Add a try-except block around the gremlin().exec() call to catch exceptions, log an error message with relevant details, and optionally handle the error gracefully to prevent the method from crashing unexpectedly.

Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f

Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405

Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1

Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08

Change-Id: Idb8c1476ebda521e195022f14f6733a33288f552

Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9

Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352

Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0

…arhugegraph/hugegraph-ai into property_embedding

coderabbitai

Actionable comments posted: 17

🔭 Outside diff range comments (1)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
79-87: ⚠️ Potential issue

异步收集结果时打乱顺序，嵌入向量与对象 ID 可能错位
asyncio.as_completed(tasks) 会按完成先后返回任务结果，而非提交顺序；
随后直接 embeddings.extend(batch_embeddings) 会导致 embeddings 列表顺序与 vid_batches（乃至 added_vids / add_prop_values）不一致。
VectorIndex.add(embeddings, ids) 依赖二者位置一一对应，此处可能插入错误向量，最终检索精度大幅下降。

可改为使用 await asyncio.gather(*tasks) 保持顺序，或显式记录批次下标：
-        embeddings = []
-        with tqdm(total=len(tasks)) as pbar:
-            for future in asyncio.as_completed(tasks):
-                batch_embeddings = await future
-                embeddings.extend(batch_embeddings)
-                pbar.update(1)
+        embeddings: list[Any] = []
+        batch_results = await asyncio.gather(*tasks)  # 保序收集
+        for batch_emb in batch_results:
+            embeddings.extend(batch_emb)

♻️ Duplicate comments (1)

hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1)

90-106: diff_property_sets 的返回值类型文档仍未与实现保持一致
此前在主文件中已指出该问题，这里再次出现相同行为，建议统一修复或移除复制代码，避免后续重复劳动。

🧹 Nitpick comments (13)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (2)
58-60: 字符串切分应考虑无分隔符场景，避免 IndexError
_extract_names 假设 vid 中必含 “:”，若上游改为纯数字/UUID 会抛异常。建议使用 partition 或 split(":", 1) 并做回退：
-return [v.split(":")[1] for v in vertices]
+names = []
+for v in vertices:
+    _, sep, tail = v.partition(":")
+    names.append(tail if sep else v)
+return names
167-177: added_props_vector_num 类型混用 int | str，调用方处理复杂化
当属性数量超限时返回字符串 "0 (because of exceeding limit)"，其余情况返回 int。
混合类型会让后续日志或统计代码不得不做 isinstance 判断。
建议统一返回 int，并通过额外布尔字段或日志提示超限：
-            if len(add_prop_values) > 100000:
-                log.warning("The number of props > 100000, please select which properties to vectorize.")
-                return removed_props_num, "0 (because of exceeding limit)"
+            if len(add_prop_values) > 100000:
+                log.warning("The number of props > 100000, skip building property index.")
+                return removed_props_num, 0
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1)

57-59: 外部文档链接指向内部 baidu-int 仓库，可能导致开源用户无法访问
开源环境下应保证文档可公开访问，避免 404。若该模板仅托管于内部仓库，请考虑：

同步内容到公开仓库并更新链接；

或在文档中注明“内部链接，仅百度内部可访问”。
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)
184-206: 新端点实现基本正确，建议改进

端点命名约定: /text2gremlin 不符合 REST 约定，建议使用 /gremlin/generate 或类似格式

日志格式一致性: 第202行使用了 f-string，建议保持与其他日志记录的格式一致

建议应用以下改进：
-    @router.post("/text2gremlin", status_code=status.HTTP_200_OK)
+    @router.post("/gremlin/generate", status_code=status.HTTP_200_OK)
-            log.error(f"Error in text2gremlin_api: {e}")
+            log.error("Error in text2gremlin_api: %s", e)
README.md (1)
36-37: 修复 Markdown 列表缩进

静态分析工具检测到列表项缩进不正确，应该从第0列开始，而不是第1列。

应用以下修复：
- - 如流 HugeGraph Team/Group
- - GraphRAG DevOps Team (🚧)
+- 如流 HugeGraph Team/Group
+- GraphRAG DevOps Team (🚧)
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

36-36: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)

37-37: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)
vermeer-python-client/src/pyvermeer/client/client.py (2)
38-41: 考虑模块键名的一致性

模块字典中使用了 "graph" 和 "tasks"，建议保持命名的一致性。如果 GraphModule 对应 "graph"，那么 TaskModule 可能应该对应 "task" 而不是 "tasks"。

44-48: 增强错误处理

当前的 __getattr__ 方法只是简单地抛出 AttributeError。建议提供更友好的错误信息，列出可用的模块。
 def __getattr__(self, name):
     """通过属性访问模块"""
     if name in self._modules:
         return self._modules[name]
-    raise AttributeError(f"Module {name} not found")
+    available_modules = list(self._modules.keys())
+    raise AttributeError(f"Module '{name}' not found. Available modules: {available_modules}")
vermeer-python-client/src/pyvermeer/structure/master_data.py (1)
56-56: 改进时间格式化的安全性

to_dict 方法中的时间格式化逻辑已经考虑了 None 值的情况，这很好。但建议使用更明确的检查。
-"launch_time": self.__launch_time.strftime("%Y-%m-%d %H:%M:%S") if self.__launch_time else ''
+"launch_time": self.__launch_time.strftime("%Y-%m-%d %H:%M:%S") if self.__launch_time is not None else ''
vermeer-python-client/src/pyvermeer/utils/log.py (1)
38-47: 建议保留文件日志选项

注释掉的文件处理器代码可能在生产环境中很有用。建议将其作为可配置选项保留。

考虑添加配置参数来控制是否启用文件日志：
-    def _initialize(self, name: str):
+    def _initialize(self, name: str, enable_file_log: bool = False):
         """初始化日志配置"""
         self.logger = logging.getLogger(name)
         self.logger.setLevel(logging.INFO)
 
         if not self.logger.handlers:
             # ... console handler code ...
             
-            # file_handler = logging.FileHandler('api_client.log')
-            # file_handler.setLevel(logging.DEBUG)
-            # file_handler.setFormatter(
-            #     logging.Formatter(
-            #         '[%(asctime)s] [%(levelname)s] [%(threadName)s] %(name)s - %(message)s'
-            #     )
-            # )
+            if enable_file_log:
+                file_handler = logging.FileHandler('api_client.log')
+                file_handler.setLevel(logging.DEBUG)
+                file_handler.setFormatter(
+                    logging.Formatter(
+                        '[%(asctime)s] [%(levelname)s] [%(threadName)s] %(name)s - %(message)s'
+                    )
+                )
+                self.logger.addHandler(file_handler)
 
             self.logger.addHandler(console_handler)
-            # self.logger.addHandler(file_handler)
vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py (1)
96-96: 响应内容日志可能过大

在调试日志中记录完整响应内容可能导致日志文件过大，影响性能。

限制响应内容日志大小：
-            log.debug(f"Response code:{response.status_code}, received: {response.text}")
+            response_preview = response.text[:500] + "..." if len(response.text) > 500 else response.text
+            log.debug(f"Response code:{response.status_code}, received: {response_preview}")
vermeer-python-client/src/pyvermeer/structure/task_data.py (1)
134-150: 考虑保持属性访问模式的一致性

TaskCreateRequest 类使用公共属性，而其他类使用私有属性和属性访问器。为了保持代码风格的一致性，建议采用相同的模式。
 class TaskCreateRequest:
     """task create request"""
 
     def __init__(self, task_type, graph_name, params):
         """init"""
-        self.task_type = task_type
-        self.graph_name = graph_name
-        self.params = params
+        self.__task_type = task_type
+        self.__graph_name = graph_name
+        self.__params = params
+
+    @property
+    def task_type(self):
+        """task type"""
+        return self.__task_type
+
+    @property
+    def graph_name(self):
+        """graph name"""
+        return self.__graph_name
+
+    @property
+    def params(self):
+        """params"""
+        return self.__params
 
     def to_dict(self) -> dict:
         """to dict"""
         return {
             'task_type': self.task_type,
             'graph': self.graph_name,
             'params': self.params
         }
vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)
18-18: 简化字典获取操作

根据静态分析工具的建议，可以简化 dic.get() 调用，因为默认返回值就是 None。
-        self.__vertex_data_backend = dic.get('vertex_data_backend', None)
+        self.__vertex_data_backend = dic.get('vertex_data_backend')
🧰 Tools

🪛 Ruff (0.11.9)

18-18: Use dic.get('vertex_data_backend') instead of dic.get('vertex_data_backend', None)

Replace dic.get('vertex_data_backend', None) with dic.get('vertex_data_backend')

(SIM910)

66-66: 改进注释内容

注释中的 "Nonsense" 用词不够专业，建议提供更具描述性的说明。
-        """is self worker. Nonsense """
+        """indicates whether this is the local/self worker"""

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50ec338 and e75f68f.

📒 Files selected for processing (38)

README.md (2 hunks)
hugegraph-llm/README.md (2 hunks)
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (3 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (5 hunks)
hugegraph-python-client/README.md (0 hunks)
vermeer-python-client/README.md (1 hunks)
vermeer-python-client/requirements.txt (1 hunks)
vermeer-python-client/setup.py (1 hunks)
vermeer-python-client/src/pyvermeer/__init__.py (1 hunks)
vermeer-python-client/src/pyvermeer/api/base.py (1 hunks)
vermeer-python-client/src/pyvermeer/api/graph.py (1 hunks)
vermeer-python-client/src/pyvermeer/api/master.py (1 hunks)
vermeer-python-client/src/pyvermeer/api/task.py (1 hunks)
vermeer-python-client/src/pyvermeer/api/worker.py (1 hunks)
vermeer-python-client/src/pyvermeer/client/__init__.py (1 hunks)
vermeer-python-client/src/pyvermeer/client/client.py (1 hunks)
vermeer-python-client/src/pyvermeer/demo/task_demo.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/__init__.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/base_data.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/graph_data.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/master_data.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/task_data.py (1 hunks)
vermeer-python-client/src/pyvermeer/structure/worker_data.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/__init__.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/exception.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/log.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/vermeer_config.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py (1 hunks)
vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py (1 hunks)

💤 Files with no reviewable changes (1)

hugegraph-python-client/README.md

✅ Files skipped from review due to trivial changes (13)

vermeer-python-client/src/pyvermeer/client/init.py
vermeer-python-client/src/pyvermeer/utils/init.py
vermeer-python-client/src/pyvermeer/structure/init.py
vermeer-python-client/src/pyvermeer/api/master.py
vermeer-python-client/src/pyvermeer/api/worker.py
vermeer-python-client/src/pyvermeer/init.py
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py
vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py
vermeer-python-client/requirements.txt
vermeer-python-client/README.md
hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
vermeer-python-client/setup.py
vermeer-python-client/src/pyvermeer/utils/exception.py

🚧 Files skipped from review as they are similar to previous changes (3)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py

🧰 Additional context used

🧬 Code Graph Analysis (5)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py (1)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)

GremlinGenerateRequest (127-152)

vermeer-python-client/src/pyvermeer/api/base.py (3)

vermeer-python-client/src/pyvermeer/structure/task_data.py (1)

params (106-108)

hugegraph-ml/src/hugegraph_ml/models/seal.py (1)

debug (790-791)

vermeer-python-client/src/pyvermeer/client/client.py (1)

send_request (50-52)

vermeer-python-client/src/pyvermeer/api/task.py (2)

vermeer-python-client/src/pyvermeer/api/base.py (2)

BaseModule (10-29)

_send_request (22-29)

vermeer-python-client/src/pyvermeer/structure/task_data.py (11)

TasksResponse (174-193)

TaskCreateRequest (134-149)

TaskCreateResponse (152-171)

TaskResponse (196-215)

params (106-108)

to_dict (31-33)

to_dict (115-131)

to_dict (143-149)

to_dict (165-171)

to_dict (187-193)

to_dict (209-215)

vermeer-python-client/src/pyvermeer/structure/task_data.py (2)

vermeer-python-client/src/pyvermeer/structure/base_data.py (3)

BaseResponse (12-39)

errcode (26-31)

message (34-39)

vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py (1)

parse_vermeer_time (12-17)

vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)

vermeer-python-client/src/pyvermeer/structure/base_data.py (3)

BaseResponse (12-39)

errcode (26-31)

message (34-39)

vermeer-python-client/src/pyvermeer/utils/vermeer_datetime.py (1)

parse_vermeer_time (12-17)

🪛 Gitleaks (8.26.0)

vermeer-python-client/src/pyvermeer/demo/task_demo.py

16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 Ruff (0.11.9)

vermeer-python-client/src/pyvermeer/structure/graph_data.py

18-18: Use dic.get('vertex_data_backend') instead of dic.get('vertex_data_backend', None)

Replace dic.get('vertex_data_backend', None) with dic.get('vertex_data_backend')

(SIM910)

🪛 markdownlint-cli2 (0.17.2)

README.md

36-36: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)

37-37: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)

🪛 LanguageTool

hugegraph-llm/README.md

[misspelling] ~88-~88: Did you mean “except”?
Context: ...c/clients/restful-api-v3/schema.html): (Accept 2 types) - User-defined Schema (J...

(ACCEPT_EXCEPT)

🔇 Additional comments (23)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py (2)

29-29: 导入语句正确

新增的 GremlinGenerateRequest 导入语句符合新 API 端点的需求。

45-45: 函数参数添加合理

新增的 gremlin_generate_selective_func 参数为新端点提供了必要的选择性生成功能。

README.md (1)

6-7: 文档简化合理

移除超链接格式化保持了文档的简洁性，同时不影响可读性。

vermeer-python-client/src/pyvermeer/structure/base_data.py (2)

7-9: 响应状态常量定义良好

常量命名清晰，值的设置合理，便于状态判断和错误处理。

12-39: BaseResponse 类设计优秀

实现遵循了良好的面向对象设计原则：

使用私有属性实现正确的封装

通过属性装饰器提供只读访问

优雅处理缺失的字典键

文档注释完整

vermeer-python-client/src/pyvermeer/demo/task_demo.py (1)

11-42: 演示代码结构良好

除了安全问题外，演示代码结构清晰，有效展示了客户端的功能：

使用结构化的请求对象

演示了获取和创建操作

代码组织合理

🧰 Tools

🪛 Gitleaks (8.26.0)

16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

hugegraph-llm/README.md (3)

41-41: 文档更新：仓库地址变更为内部地址

将克隆地址从公开的 Apache GitHub 仓库更改为百度内部仓库地址，并添加了用户名占位符。请确认这个变更是否符合项目的访问控制策略。

88-88: 修正：文档描述更准确

将"Except 2 types"更正为"Accept 2 types"，这个修改使描述更加准确和清晰。

🧰 Tools

🪛 LanguageTool

[misspelling] ~88-~88: Did you mean “except”?
Context: ...c/clients/restful-api-v3/schema.html): (Accept 2 types) - User-defined Schema (J...

(ACCEPT_EXCEPT)

89-89: 文档链接更新：模板链接指向内部仓库

模板链接已更新为指向百度内部 DevOps 仓库，行号引用也从 125 更改为 173。请验证新的行号引用是否正确对应到相关的模板内容。

vermeer-python-client/src/pyvermeer/api/base.py (1)

10-29: 良好的基础设计：依赖注入和统一请求处理

这个基础模块类设计得很好：

使用依赖注入模式接收客户端实例

设置了层次化的日志记录器

提供了统一的请求处理入口点

遵循 Python 命名约定，使用下划线前缀表示受保护方法

实现简洁且职责明确，为继承类提供了良好的基础功能。

vermeer-python-client/src/pyvermeer/api/task.py (1)

1-39: 代码实现良好！

TaskModule 类的实现简洁明了，正确继承了 BaseModule 并实现了任务相关的 API 操作。所有方法都正确使用了父类的统一请求方法，类型注解完整，符合 Python 编码规范。
vermeer-python-client/src/pyvermeer/client/client.py (1)

26-26:
Details

✅ Verification successful

检查 Python 版本兼容性

使用 tuple[float, float] 类型注解在 Python 3.9 之前的版本中不被支持。为了确保向后兼容性，建议使用 Tuple[float, float] 或添加 from __future__ import annotations。

请运行以下脚本检查项目的 Python 版本要求：

🏁 Script executed:
#!/bin/bash
# 检查项目的Python版本要求
fd -e py -e txt setup.py pyproject.toml requirements.txt | xargs grep -l "python_requires\|Programming Language :: Python" | head -5
Length of output: 299

为获取项目对 Python 版本的明确定义，请执行下面脚本重新搜索：

🏁 Script executed:
#!/bin/bash
# 在仓库中检索 Python 版本约束声明
rg -n "python_requires" .
rg -n "Programming Language :: Python" .
Length of output: 498
确认 Python 版本兼容性

项目在 vermeer-python-client/setup.py 中已通过 python_requires=">=3.9" 声明最低 Python 版本。Python 3.9 起原生支持 tuple[float, float] 注解，因此无需更改为 Tuple[float, float] 或添加 from __future__ import annotations。
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (1)

230-236:
Details

✅ Verification successful

验证输出键的完整性

建议验证定义的输出键是否与 gremlin_generate 函数的实际返回值一致。

🏁 Script executed:
#!/bin/bash
# 检查gremlin_generate函数的返回值结构
ast-grep --pattern $'def gremlin_generate($$$) -> $_:
    $$$
    return (
        $$$
    )'
Length of output: 94

为确认 gremlin_generate 函数的定义及其返回值结构，请运行以下脚本：

🏁 Script executed:
#!/bin/bash
rg -n "def gremlin_generate" -A10 hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py
Length of output: 935

请进一步查看 gremlin_generate 函数体内的返回语句，以确认返回值的元素顺序和数量：

🏁 Script executed:
#!/bin/bash
rg -n "def gremlin_generate" -A150 -B0 hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py
Length of output: 10031
确认 output_keys 与 gremlin_generate 返回值一致

经检查，gremlin_generate 在正常流程中始终返回一个 5 元组：

match_result

context["result"]（即 template_gremlin）

context["raw_result"]（即 raw_gremlin）

context["template_exec_res"]（对应 template_execution_result）

context["raw_exec_res"]（对应 raw_execution_result）

而 gremlin_generate_selective 中定义的
output_keys = [
    "match_result",
    "template_gremlin",
    "raw_gremlin",
    "template_execution_result",
    "raw_execution_result",
]
正好与上述返回值一一对应，无需调整。
vermeer-python-client/src/pyvermeer/utils/log.py (1)

58-58: 全局日志实例创建良好

全局日志实例的创建方式符合最佳实践，便于在整个客户端中使用。

vermeer-python-client/src/pyvermeer/structure/worker_data.py (1)

97-99: 现代类型注解用法良好

使用 list[Worker] 的现代类型注解语法是好的实践。

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2)

145-152: 输出类型默认值设计良好

默认返回模板 Gremlin 的设计是合理的，既提供了有用的默认行为，又允许用户自定义输出类型。

120-125: 枚举定义清晰完整

GremlinOutputType 枚举涵盖了 Gremlin 查询生成的各个阶段，命名清晰易懂。

vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py (2)

38-42: Token 验证逻辑良好

在初始化时验证 token 存在性是好的安全实践，可以早期发现配置问题。

97-104: 异常处理全面且正确

异常处理涵盖了主要的错误场景，使用了适当的自定义异常类型，并保持了异常链。

vermeer-python-client/src/pyvermeer/structure/task_data.py (2)

13-34: TaskWorker 类实现正确

该类的实现简洁明了，正确地封装了任务工作者的基本信息，属性访问器和字典序列化方法都符合预期。

152-216: 响应类实现正确

TaskCreateResponse、TasksResponse 和 TaskResponse 类正确地继承了 BaseResponse，实现了一致的初始化和序列化模式，代码结构清晰且符合预期。

vermeer-python-client/src/pyvermeer/structure/graph_data.py (2)

86-201: VermeerGraph 类实现正确

该类正确地封装了图的各种元数据信息，时间解析和格式化处理得当，布尔标志属性清晰地表示了图的各种特性，整体实现符合预期。

203-245: 图响应类实现正确

GraphsResponse 和 GraphResponse 类正确地继承了 BaseResponse，实现了一致的模式来处理单个图对象和图对象列表，序列化方法完整且正确。

coderabbitai · 2025-05-28T17:07:51Z

hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py

+class BuildSemanticIndex:
+    def __init__(self, embedding: BaseEmbedding):
+        self.index_dir = str(os.path.join(resource_path, huge_settings.graph_name, "graph_vids"))
+        self.index_dir_prop = str(os.path.join(resource_path, huge_settings.graph_name, "graph_props"))
+        self.vid_index = VectorIndex.from_index_file(self.index_dir)
+        self.prop_index = VectorIndex.from_index_file(self.index_dir_prop)
+        self.embedding = embedding
+        self.sm = SchemaManager(huge_settings.graph_name)
+        self.client = PyHugeClient(
+            url=huge_settings.graph_url,
+            graph=huge_settings.graph_name,
+            user=huge_settings.graph_user,
+            pwd=huge_settings.graph_pwd,
+            graphspace=huge_settings.graph_space,
+        )
+


⚠️ Potential issue

测试文件中完整复制业务实现，造成重复定义与包导入冲突
BuildSemanticIndex 在正式模块 build_semantic_index.py 中已存在，这里再次完整实现同名类将导致：

重复代码维护困难：两处实现稍有偏差就会出现行为分叉，难以及时同步。

包导入冲突：import hugegraph_llm.operators.index_op.test as ... 时，该类可能“掩盖”真正实现，引发难以察觉的运行时错误。

测试意图缺失：文件位于 operators/index_op/test.py，却不是单元测试，而是业务代码拷贝，不符合测试目录约定。

建议立即删除此处的业务实现，仅保留真正的单元测试代码，或直接在测试中 from ...build_semantic_index import BuildSemanticIndex。

-# 整个 BuildSemanticIndex 复制体 -class BuildSemanticIndex: - ... +# TODO: 编写针对 BuildSemanticIndex 的单元测试，而不是复制实现 +import pytest +from hugegraph_llm.operators.index_op.build_semantic_index import BuildSemanticIndex + + +def test_diff_property_sets_addition(): + # 示例：验证新增属性集能被正确识别 + build = BuildSemanticIndex(embedding=None) # 可使用 mock Embedding + to_add, *_ = build.diff_property_sets( + {"v1": frozenset({("k", "v1")})}, + {}, + ) + assert to_add == [("v1", frozenset({("k", "v1")}))]

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py between lines 42 and 57, the BuildSemanticIndex class is fully reimplemented, duplicating the business logic already defined in build_semantic_index.py. This causes code duplication, import conflicts, and breaks testing conventions. To fix this, remove the entire BuildSemanticIndex class implementation from this test file and instead import BuildSemanticIndex from the original build_semantic_index module. Then, keep only proper unit test code in this test.py file.

coderabbitai · 2025-05-28T17:07:51Z

vermeer-python-client/src/pyvermeer/demo/task_demo.py

+    client = PyVermeerClient(
+        ip="10.41.57.139",
+        port=8688,
+        token="Q7svB13nYvREB4bDCj7kQnwJEMvLgrgfDimu4h1Fp7CUzQLk758ya1EYwycn1kjbgskiHiKzDni9jEkJcssgTy7rZJdt4gYEkfvjeowZGzSebgiSEU86dgFPXzUUtwrA81vWKm1xfioBcS9GmXjGQoM6C",
+        log_level="DEBUG",
+    )


⚠️ Potential issue

严重安全问题：硬编码的 API 令牌

代码中包含硬编码的 API 令牌，这是一个严重的安全风险。即使在演示代码中，也不应该暴露真实的凭据。

应用以下修复方案：

def main(): """main""" + # 从环境变量或配置文件读取敏感信息 + import os client = PyVermeerClient( - ip="10.41.57.139", - port=8688, - token="Q7svB13nYvREB4bDCj7kQnwJEMvLgrgfDimu4h1Fp7CUzQLk758ya1EYwycn1kjbgskiHiKzDni9jEkJcssgTy7rZJdt4gYEkfvjeowZGzSebgiSEU86dgFPXzUUtwrA81vWKm1xfioBcS9GmXjGQoM6C", + ip=os.getenv("VERMEER_IP", "localhost"), + port=int(os.getenv("VERMEER_PORT", "8688")), + token=os.getenv("VERMEER_TOKEN", "your-token-here"), log_level="DEBUG", )

🧰 Tools

🪛 Gitleaks (8.26.0)

16-16: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/demo/task_demo.py around lines 13 to 18, the API token is hardcoded, posing a serious security risk. Remove the hardcoded token and instead load it securely from environment variables or a configuration file. Update the code to read the token dynamically at runtime, ensuring no sensitive credentials are exposed in the source code.

coderabbitai · 2025-05-28T17:07:51Z

vermeer-python-client/src/pyvermeer/utils/vermeer_config.py

+class VermeerConfig:
+    """The configuration of a Vermeer instance."""
+    ip: str
+    port: int
+    token: str
+    factor: str
+    username: str
+    graph_space: str
+
+    def __init__(self,
+                 ip: str,
+                 port: int,
+                 token: str,
+                 timeout: tuple[float, float] = (0.5, 15.0)):
+        """Initialize the configuration for a Vermeer instance."""
+        self.ip = ip
+        self.port = port
+        self.token = token
+        self.timeout = timeout


🛠️ Refactor suggestion

设计改进：确保所有声明的属性都被正确初始化

当前实现中存在几个问题：

factor、username、graph_space 属性被声明但从未在构造函数中初始化，这可能导致 AttributeError

timeout 属性在构造函数中被设置但未在类级别声明

建议的改进：

class VermeerConfig: """The configuration of a Vermeer instance.""" ip: str port: int token: str + timeout: tuple[float, float] factor: str username: str graph_space: str def __init__(self, ip: str, port: int, token: str, + factor: str = "", + username: str = "", + graph_space: str = "", timeout: tuple[float, float] = (0.5, 15.0)): """Initialize the configuration for a Vermeer instance.""" self.ip = ip self.port = port self.token = token + self.factor = factor + self.username = username + self.graph_space = graph_space self.timeout = timeout

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class VermeerConfig:

"""The configuration of a Vermeer instance."""

ip: str

port: int

token: str

factor: str

username: str

graph_space: str

def __init__(self,

ip: str,

port: int,

token: str,

timeout: tuple[float, float] = (0.5, 15.0)):

"""Initialize the configuration for a Vermeer instance."""

self.ip = ip

self.port = port

self.token = token

self.timeout = timeout

class VermeerConfig:

"""The configuration of a Vermeer instance."""

ip: str

port: int

token: str

timeout: tuple[float, float]

factor: str

username: str

graph_space: str

def __init__(self,

ip: str,

port: int,

token: str,

factor: str = "",

username: str = "",

graph_space: str = "",

timeout: tuple[float, float] = (0.5, 15.0)):

"""Initialize the configuration for a Vermeer instance."""

self.ip = ip

self.port = port

self.token = token

self.factor = factor

self.username = username

self.graph_space = graph_space

self.timeout = timeout

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/utils/vermeer_config.py lines 8 to 26, the class VermeerConfig declares attributes factor, username, and graph_space but does not initialize them in the constructor, and timeout is initialized in the constructor but not declared at the class level. To fix this, add factor, username, graph_space, and timeout to the class attribute declarations and initialize all of them properly in the __init__ method, either by adding parameters for them or setting default values to avoid AttributeError.

coderabbitai · 2025-05-28T17:07:52Z

vermeer-python-client/src/pyvermeer/api/graph.py

+    def get_graph(self, graph_name: str) -> GraphResponse:
+        """获取任务列表"""
+        response = self._send_request(
+            "GET",
+            f"/graphs/{graph_name}"
+        )
+        return GraphResponse(response)
+
+    def get_graphs(self) -> GraphsResponse:
+        """获取任务列表"""
+        response = self._send_request(
+            "GET",
+            "/graphs",
+        )
+        return GraphsResponse(response)


⚠️ Potential issue

修正：方法文档字符串不正确

两个方法的文档字符串都错误地描述为"获取任务列表"，但实际上这些方法是用于获取图相关信息的。

请应用以下修改来纠正文档字符串：

def get_graph(self, graph_name: str) -> GraphResponse: - """获取任务列表""" + """获取指定图的信息""" response = self._send_request( "GET", f"/graphs/{graph_name}" ) return GraphResponse(response) def get_graphs(self) -> GraphsResponse: - """获取任务列表""" + """获取图列表""" response = self._send_request( "GET", "/graphs", ) return GraphsResponse(response)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_graph(self, graph_name: str) -> GraphResponse:

"""获取任务列表"""

response = self._send_request(

"GET",

f"/graphs/{graph_name}"

)

return GraphResponse(response)

def get_graphs(self) -> GraphsResponse:

"""获取任务列表"""

response = self._send_request(

"GET",

"/graphs",

)

return GraphsResponse(response)

def get_graph(self, graph_name: str) -> GraphResponse:

"""获取指定图的信息"""

response = self._send_request(

"GET",

f"/graphs/{graph_name}"

)

return GraphResponse(response)

def get_graphs(self) -> GraphsResponse:

"""获取图列表"""

response = self._send_request(

"GET",

"/graphs",

)

return GraphsResponse(response)

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/api/graph.py around lines 14 to 28, the docstrings for both get_graph and get_graphs methods incorrectly state "获取任务列表" which means "get task list". Update the docstrings to accurately describe their functionality: for get_graph, change it to "获取指定名称的图信息" (get graph information by name), and for get_graphs, change it to "获取所有图的信息列表" (get list of all graph information).

coderabbitai · 2025-05-28T17:07:52Z

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py

+        return outputs_dict # Early exit if gremlin_generate returned an error tuple
+
+
+    match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results


🛠️ Refactor suggestion

添加参数解包的安全检查

直接对元组进行解包可能在错误情况下失败。建议添加长度检查以确保安全。

+ # Ensure we have the expected 5-tuple result + if not isinstance(original_results, tuple) or len(original_results) != 5: + return { + "error": True, + "error_message": f"Unexpected result format from gremlin_generate: {type(original_results)}", + "requested_outputs": list(requested_outputs_set) + } + match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results

# Ensure we have the expected 5-tuple result

if not isinstance(original_results, tuple) or len(original_results) != 5:

return {

"error": True,

"error_message": f"Unexpected result format from gremlin_generate: {type(original_results)}",

"requested_outputs": list(requested_outputs_set)

}

match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py at line 262, the code directly unpacks the tuple original_results into multiple variables without verifying its length, which can cause errors if the tuple is shorter than expected. Add a length check before unpacking to ensure original_results contains the expected number of elements, and handle the case where it does not to prevent runtime exceptions.

coderabbitai · 2025-05-28T17:07:53Z

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py

+    graphspae: Optional[str] = Query(
+        huge_settings.graph_space,
+        description="graph space."
+    )


⚠️ Potential issue

字段名称拼写错误

字段名 graphspae 应该是 graphspace。

修正字段名称：

- graphspae: Optional[str] = Query( + graphspace: Optional[str] = Query( huge_settings.graph_space, description="graph space." )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

graphspae: Optional[str] = Query(

huge_settings.graph_space,

description="graph space."

)

graphspace: Optional[str] = Query(

huge_settings.graph_space,

description="graph space."

)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py around lines 133 to 136, the field name is misspelled as 'graphspae'. Rename the field to 'graphspace' to correct the spelling error.

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py

coderabbitai · 2025-05-28T17:07:53Z

vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py

+        url = f"http://{self._cfg.ip}:{self._cfg.port}/"
+        return urljoin(url, path).strip("/")


🛠️ Refactor suggestion

URL 构建存在安全风险

urljoin 和 strip 的组合可能导致意外的 URL 构建行为。如果 path 以 '/' 开头，urljoin 会忽略基 URL 的路径部分。

改进 URL 构建逻辑：

def resolve(self, path: str): """ Resolve the path to a full URL. """ url = f"http://{self._cfg.ip}:{self._cfg.port}/" - return urljoin(url, path).strip("/") + # Ensure path doesn't start with '/' to avoid urljoin issues + path = path.lstrip('/') + return urljoin(url, path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

url = f"http://{self._cfg.ip}:{self._cfg.port}/"

return urljoin(url, path).strip("/")

def resolve(self, path: str):

"""

Resolve the path to a full URL.

"""

url = f"http://{self._cfg.ip}:{self._cfg.port}/"

# Ensure path doesn't start with '/' to avoid urljoin issues

path = path.lstrip('/')

return urljoin(url, path)

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py around lines 72 to 73, the current URL construction using urljoin combined with strip("/") can cause incorrect URLs if path starts with a slash, as urljoin will ignore the base URL's path. To fix this, avoid using urljoin with paths starting with '/', and instead safely concatenate the base URL and path ensuring exactly one '/' between them without stripping characters, preserving the intended URL structure.

coderabbitai · 2025-05-28T17:07:54Z

vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py

+    ) -> dict:
+        """request"""
+        try:
+            log.debug(f"Request made to {path} with params {json.dumps(params)}")


⚠️ Potential issue

敏感信息日志泄露风险

调试日志中记录请求参数可能包含敏感信息，如认证令牌或个人数据。

避免在日志中记录敏感参数：

- log.debug(f"Request made to {path} with params {json.dumps(params)}") + log.debug(f"Request made to {path}")

或者实现参数过滤：

+ def _safe_log_params(self, params): + """Remove sensitive information from params for logging""" + if not params: + return params + safe_params = {k: v for k, v in params.items() + if k not in ['token', 'password', 'api_key']} + return safe_params + - log.debug(f"Request made to {path} with params {json.dumps(params)}") + log.debug(f"Request made to {path} with params {json.dumps(self._safe_log_params(params))}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

log.debug(f"Request made to {path} with params {json.dumps(params)}")

# Add this helper in the class (e.g. just above the method that makes the request)

def _safe_log_params(self, params):

"""Remove sensitive information from params for logging"""

if not params:

return params

safe_params = {

k: v

for k, v in params.items()

if k not in ["token", "password", "api_key"]

}

return safe_params

# …later in the same class, replace the logging call:

log.debug(f"Request made to {path} with params {json.dumps(self._safe_log_params(params))}")

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/utils/vermeer_requests.py at line 89, the debug log outputs request parameters which may contain sensitive information like authentication tokens or personal data. Modify the logging to either exclude sensitive parameters entirely or implement a filtering mechanism that redacts or removes sensitive fields before logging. This ensures sensitive data is not exposed in logs while still providing useful debug information.

coderabbitai · 2025-05-28T17:07:54Z

vermeer-python-client/src/pyvermeer/structure/task_data.py

+            'state': self.__state,
+            'create_user': self.__create_user,
+            'create_type': self.__create_type,
+            'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '',


⚠️ Potential issue

修复时间格式化条件判断错误

在格式化 create_time 时，条件判断错误地使用了 self.__start_time 而不是 self.__create_time。

- 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '', + 'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__create_time else '',

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '',

'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__create_time else '',

🤖 Prompt for AI Agents

In vermeer-python-client/src/pyvermeer/structure/task_data.py at line 123, the condition for formatting 'create_time' incorrectly checks 'self.__start_time' instead of 'self.__create_time'. Fix this by changing the condition to check 'self.__create_time' before calling strftime to ensure the correct attribute is used for formatting.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (4)
25-25: 线程安全问题仍未解决

全局字典 API_CALL_TRACKER 仍然存在线程安全问题，这在之前的审查中已经指出但尚未解决。在并发环境中，多个请求可能同时修改此字典，导致数据竞争。

28-28: 移除不必要的 pylint 注释

函数 vector_http_api 只包含少量语句，不需要 # pylint: disable=too-many-statements 注释。这个注释似乎是从其他文件复制过来的，应该移除。

应用此修改来移除不必要的注释：
-# pylint: disable=too-many-statements
 def vector_http_api(router: APIRouter, update_embedding_func):
31-31: 添加输入验证以提高 API 健壮性

daily_limit 参数缺少输入验证，这在之前的审查中已经指出但尚未解决。应该验证该参数为正整数，防止负数或零值导致的问题。

43-43: 修正日志消息中的函数名称

日志消息中使用了错误的函数名称 update_vid_embedding，这在之前的审查中已经指出但尚未解决。应该使用正确的函数名称 update_embedding_api。

🧹 Nitpick comments (1)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)

49-51: 改进 API 设计的灵活性

当前API设计缺乏灵活性，update_embedding_func() 被调用时没有传递任何参数。如果未来需要传递请求数据，当前设计将难以扩展。

考虑以下改进来提高API的灵活性：

 @router.post("/vector/embedding", status_code=status.HTTP_200_OK)
-def update_embedding_api(daily_limit: int = 50):
+def update_embedding_api(request_data: dict = None, daily_limit: int = 50):
     """
     Updates the vector embedding.
-    This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!)
+    This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!)
     The rate limit is tracked per day and resets at midnight.
     """
     # ... rate limiting logic ...
-    result = update_embedding_func()
+    result = update_embedding_func(request_data) if request_data else update_embedding_func()
     result = {"detail": result}
     return result

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e75f68f and 7b7c6d2.

📒 Files selected for processing (4)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)
hugegraph-python-client/src/pyhugegraph/client.py (2 hunks)
hugegraph-python-client/src/pyhugegraph/utils/huge_config.py (1 hunks)
hugegraph-python-client/src/pyhugegraph/utils/huge_requests.py (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

hugegraph-python-client/src/pyhugegraph/client.py (1)

hugegraph-python-client/src/pyhugegraph/utils/huge_config.py (1)

HGraphConfig (30-75)

🔇 Additional comments (8)

hugegraph-python-client/src/pyhugegraph/utils/huge_config.py (1)

33-35: 身份验证字段更新合理。

新增的token字段和可选的username/password字段设计得当，支持了灵活的身份验证方式。字段类型定义正确，默认值设置合适。

hugegraph-python-client/src/pyhugegraph/utils/huge_requests.py (2)

53-56: Token初始化逻辑正确。

简洁明了地从配置中初始化token属性，逻辑清晰易懂。

147-155: HTTP请求身份验证实现得当。

正确地实现了Bearer token和基本认证的切换逻辑：

复制headers避免修改类属性

按HTTP标准添加Authorization Bearer头

使用token时正确地将auth设为None避免冲突

符合REST API身份验证最佳实践

hugegraph-python-client/src/pyhugegraph/client.py (4)

32-32: 正确添加日志导入。

为支持身份验证警告消息而添加的日志导入，符合代码需求。

54-56: 构造函数参数更新合理。

将user/pwd设为可选参数并添加token参数，支持了灵活的身份验证方式。类型注解正确，参数设计符合API设计最佳实践。

60-67: 身份验证参数验证逻辑完善。

实现了良好的参数验证：

确保提供token或用户名密码之一

当同时提供两种认证方式时给出警告

错误消息清晰明确，用户体验友好

69-77: 配置对象初始化规范。

使用显式参数名称传递的方式初始化HGraphConfig对象，提高了代码可读性和维护性。所有参数映射正确，保持了向后兼容性。
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)

34-34: ⚠️ Potential issue

修正文档中的默认值说明

文档注释中提到"默认允许每天2次调用"，但实际默认值是50。这会误导API用户。

应用此修改来修正文档：
-        This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!)
+        This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!)
Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)
121-122: Gremlin 查询执行缺少错误处理

正如之前的评论所指出的，Gremlin 查询执行时缺少异常处理。建议添加 try-catch 块来提高代码的健壮性。
-            result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]
-            results.extend(result)
+            try:
+                result = self.client.gremlin().exec(gremlin=gremlin_query)["data"]
+                results.extend(result)
+            except Exception as e:
+                log.error("执行 Gremlin 查询失败，标签: %s，错误: %s", label, e)
+                raise
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)

30-32: 修正pylint禁用注释的误用

该函数只包含少量语句，pylint禁用注释 # pylint: disable=too-many-statements 是不必要的，可能是从其他文件复制过来的。
hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1)
42-243: ⚠️ Potential issue

避免在测试文件中重复实现业务逻辑

该文件位于测试目录中，但包含了 BuildSemanticIndex 类的完整业务实现，这与 build_semantic_index.py 中的实现重复。这种做法存在以下问题：

代码重复：维护两套相似实现容易导致不一致

测试目的不明：测试文件应该包含单元测试而不是业务实现

导入冲突风险：可能掩盖真正的业务实现

建议删除此处的业务实现，改为编写真正的单元测试：
-class BuildSemanticIndex:
-    # 删除整个类实现
-    ...
+import pytest
+from hugegraph_llm.operators.index_op.build_semantic_index import BuildSemanticIndex
+
+def test_build_semantic_index():
+    # 编写针对 BuildSemanticIndex 的单元测试
+    pass
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
89-102: 属性精确匹配仍存在性能问题

虽然之前的正确性问题已解决（使用缓存schema、正确提取属性名、使用可哈希元组），但O(n×m)的查询效率问题仍然存在：

对每个属性键和每个关键词都发起单独的Gremlin查询，在属性键较多时会产生大量网络请求。

建议优化查询策略：

预先筛选可能匹配的属性键（如值类型为Text的属性）

将多个关键词合并到同一查询中

考虑使用批量查询减少网络开销
-        for key in property_keys:
-            for keyword in list(unmatched_keywords):
-                gremlin_query = f"g.V().has('{key.name}', '{keyword}').limit(1)"
+        # 筛选文本类型的属性键
+        text_property_keys = [key for key in property_keys if key.dataType == 'TEXT']
+        for key in text_property_keys:
+            # 批量查询多个关键词
+            keywords_str = "', '".join(list(unmatched_keywords))
+            gremlin_query = f"g.V().has('{key.name}', within('{keywords_str}')).valueMap('{key.name}').limit(10)"

🧹 Nitpick comments (2)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)
24-24: 移除未使用的导入

静态分析工具检测到 huge_settings 被导入但未在文件中使用，应该移除这个未使用的导入以保持代码整洁。

应用此修改来移除未使用的导入：
from hugegraph_llm.config import prompt
-from hugegraph_llm.config import huge_settings
🧰 Tools

🪛 Ruff (0.11.9)

24-24: hugegraph_llm.config.huge_settings imported but unused

Remove unused import: hugegraph_llm.config.huge_settings

(F401)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (1)
114-126: 数据重整方法功能完善

该方法能够正确处理混合数据类型并去重，逻辑清晰。但根据静态分析提示，可以优化嵌套if语句。
-            if isinstance(item, (frozenset, set)):
-                for prop_tuple in item:
-                    if isinstance(prop_tuple, tuple) and len(prop_tuple) == 2:
-                        unique_tuples.add(prop_tuple)
+            if isinstance(item, (frozenset, set)):
+                for prop_tuple in item:
+                    if isinstance(prop_tuple, tuple) and len(prop_tuple) == 2:
+                        unique_tuples.add(prop_tuple)
-            elif isinstance(item, tuple):
-                if len(item) == 2:
-                    unique_tuples.add(item)
+            elif isinstance(item, tuple) and len(item) == 2:
+                unique_tuples.add(item)
🧰 Tools

🪛 Ruff (0.11.9)

123-124: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b7c6d2 and 31bf971.

📒 Files selected for processing (12)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py (7 hunks)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (4 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/operators/index_op/test.py (1 hunks)
hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py

🚧 Files skipped from review as they are similar to previous changes (2)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/graph_rag_query.py
hugegraph-llm/src/hugegraph_llm/api/rag_api.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1)

hugegraph-python-client/src/pyhugegraph/client.py (2)

PyHugeClient (49-120)

graph (88-89)

🪛 Ruff (0.11.9)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py

36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py

24-24: hugegraph_llm.config.huge_settings imported but unused

Remove unused import: hugegraph_llm.config.huge_settings

(F401)

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py

123-124: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

🔇 Additional comments (19)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (6)

33-40: Gremlin 查询模板实现正确且包含必要的安全限制

INDEX_PROPERTY_GREMLIN 模板设计合理，通过 limit(100000) 防止查询过大的数据集，project 结构清晰地分离了顶点 ID 和属性数据。查询逻辑符合需求。

45-47: 属性索引目录和索引对象初始化正确

新增的属性索引目录 graph_props 和对应的 VectorIndex 实例初始化逻辑正确，与现有的顶点 ID 索引保持一致的命名和结构模式。

50-57: HugeGraph 客户端初始化支持令牌认证

客户端初始化正确地加入了 token 参数支持，与整个代码库的令牌认证机制保持一致。参数传递完整且符合新的认证模式。

195-197: 属性数量限制检查的安全保护机制

代码正确实现了对属性数量的限制检查（100,000），这是一个重要的安全保护措施，防止系统因处理过多属性而出现性能问题。返回的警告信息清晰明了。

203-203: 异步嵌入生成调用正确

基于过去评论的修复，现在正确使用了 asyncio.run() 来调用异步的 _get_embeddings_parallel 方法，解决了之前的异步调用问题。

214-231: 重构后的 run 方法结构清晰

方法重构成功地将复杂逻辑分解为 _update_vid_index 和 _update_property_index 两个辅助方法，大大提高了代码的可读性和可维护性。流程控制逻辑清晰，条件判断合理。

hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py (1)

46-46: 令牌认证参数添加正确

在 get_hg_client() 函数中正确添加了 token=huge_settings.graph_token 参数，与项目中其他文件的 PyHugeClient 初始化方式保持一致，支持新的令牌认证机制。

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1)

31-31: SchemaManager 中的令牌认证参数添加正确

在 SchemaManager 构造函数中正确添加了 token=huge_settings.graph_token 参数，与整个代码库中其他 PyHugeClient 实例化保持一致，确保了认证机制的统一性。

hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (1)

30-30: 令牌配置字段定义正确

在 HugeGraphConfig 类中新增的 graph_token: Optional[str] = None 字段定义正确，类型注解和默认值设置合适。这为整个系统提供了令牌认证的配置基础，与其他可选认证参数的模式保持一致。

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2)

29-32: token认证支持的实现良好

将 user 和 pwd 字段设为可选且默认为空字符串，同时添加 token 字段，这很好地支持了基于token的认证机制，与整个代码库的认证升级保持一致。

121-147: 新增的Gremlin生成请求模型设计合理

GremlinOutputType 枚举和 GremlinGenerateRequest 模型为选择性Gremlin生成提供了灵活的配置选项，支持多种输出类型组合，API设计考虑周到。

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (8)

20-20: 类型导入扩展良好

新增的类型导入支持了属性匹配功能的实现，类型定义清晰。

41-43: 属性向量索引初始化正确

为属性语义搜索添加了专门的向量索引，与顶点ID索引分离，设计合理。

54-54: Token认证支持增强了安全性

添加token认证参数提升了图数据库连接的安全性。

57-57: Schema缓存优化性能

缓存schema对象避免了重复的网络调用，显著提升性能。

82-83: 批量嵌入调用优化性能

将单个文本嵌入调用改为批量调用，减少了模型调用次数，提升效率。

104-112: 属性模糊匹配实现正确

属性模糊匹配逻辑与顶点ID模糊匹配保持一致，使用专门的属性向量索引进行语义搜索，实现合理。

132-132: 查询向量生成优化

与关键词处理保持一致，使用批量嵌入调用提升性能。

146-157: 属性匹配集成逻辑完善

将属性匹配结果正确集成到上下文中，支持精确匹配和模糊匹配的组合，逻辑清晰。通过检查索引标签的存在来决定是否执行属性匹配，避免了不必要的计算。

coderabbitai · 2025-05-29T12:22:09Z

hugegraph-llm/src/hugegraph_llm/indices/graph_index.py

+            graph_token: Optional[str] = huge_settings.graph_token,
            graph_space: Optional[str] = huge_settings.graph_space,
    ):
        self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd,
-                                   graphspace=graph_space)
+                                   token=huge_settings.graph_token, graphspace=graph_space)


⚠️ Potential issue

修复参数使用不一致的问题

构造函数中定义了 graph_token 参数，但在创建 PyHugeClient 实例时却直接使用了 huge_settings.graph_token，这导致传入的参数被忽略。

应用此修改来修复参数使用：

self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd, - token=huge_settings.graph_token, graphspace=graph_space) + token=graph_token, graphspace=graph_space)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

graph_token: Optional[str] = huge_settings.graph_token,

graph_space: Optional[str] = huge_settings.graph_space,

):

self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd,

graphspace=graph_space)

token=huge_settings.graph_token, graphspace=graph_space)

self.client = PyHugeClient(url=graph_url, graph=graph_name, user=graph_user, pwd=graph_pwd,

token=graph_token, graphspace=graph_space)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/indices/graph_index.py around lines 32 to 36, the constructor defines a graph_token parameter but the PyHugeClient instance is created using huge_settings.graph_token directly, ignoring the passed parameter. Fix this by replacing huge_settings.graph_token with the graph_token parameter when calling PyHugeClient, ensuring the passed argument is used consistently.

coderabbitai · 2025-05-29T12:22:10Z

hugegraph-llm/src/hugegraph_llm/api/vector_api.py

+    @router.post("/vector/embedding", status_code=status.HTTP_200_OK)
+    def update_embedding_api(
+            daily_limit: int = 50,
+            graph_config: Optional[GraphConfigRequest] = Body(None)


⚠️ Potential issue

修复函数参数默认值中的可变对象问题

在函数参数默认值中使用 Body(None) 可能导致意外行为，因为默认值在函数定义时被求值一次，而不是每次调用时。

应用此修改来修复问题：

def update_embedding_api( daily_limit: int = 50, - graph_config: Optional[GraphConfigRequest] = Body(None) + graph_config: Optional[GraphConfigRequest] = None ): + if graph_config is None: + graph_config = Body(None)

或者更好的做法是在函数签名中移除默认值：

def update_embedding_api( daily_limit: int = 50, - graph_config: Optional[GraphConfigRequest] = Body(None) + graph_config: Optional[GraphConfigRequest] = Body(default=None) ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

graph_config: Optional[GraphConfigRequest] = Body(None)

def update_embedding_api(

daily_limit: int = 50,

- graph_config: Optional[GraphConfigRequest] = Body(None)

+ graph_config: Optional[GraphConfigRequest] = None

):

+ if graph_config is None:

+ graph_config = Body(None)

🧰 Tools

🪛 Ruff (0.11.9)

36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents

In hugegraph-llm/src/hugegraph_llm/api/vector_api.py at line 36, the function parameter uses Body(None) as a default value, which is mutable and evaluated only once at function definition, potentially causing unexpected behavior. To fix this, remove the default value from the function signature so that the parameter does not have a mutable default, or alternatively, use None as the default and assign Body() inside the function body if needed.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (3)
49-49: [重复问题] 修正日志消息中的函数名称

日志消息中的函数名称与实际端点不匹配，这与之前的审查意见一致。
-            log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit)
+            log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)
28-28: ⚠️ Potential issue

[重复问题] 实现线程安全的速率限制

全局字典 API_CALL_TRACKER 在并发环境中存在线程安全问题，这与之前的审查意见一致。

建议使用线程安全的实现：
+import threading
+from collections import defaultdict
+
-API_CALL_TRACKER = {}
+API_CALL_TRACKER = defaultdict(int)
+TRACKER_LOCK = threading.Lock()
然后在速率限制逻辑中使用锁保护：
+        with TRACKER_LOCK:
             today = date.today()
             for call_date in list(API_CALL_TRACKER.keys()):
                 if call_date != today:
                     del API_CALL_TRACKER[call_date]
             call_count = API_CALL_TRACKER.get(today, 0)
             if call_count >= daily_limit:
                 # ... 错误处理
             API_CALL_TRACKER[today] = call_count + 1
36-36: ⚠️ Potential issue

[重复问题] 修复函数参数中的可变默认值

Body(None) 作为默认值会导致意外行为，因为默认值在函数定义时被求值一次。
-            graph_config: Optional[GraphConfigRequest] = Body(None)
+            graph_config: Optional[GraphConfigRequest] = Body(default=None)
🧰 Tools

🪛 Ruff (0.11.9)

36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🧹 Nitpick comments (2)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (1)
24-24: 移除未使用的导入

静态分析工具检测到 huge_settings 被导入但在当前文件中未使用。应该移除这个未使用的导入以保持代码整洁。
-from hugegraph_llm.config import huge_settings
🧰 Tools

🪛 Ruff (0.11.9)

24-24: hugegraph_llm.config.huge_settings imported but unused

Remove unused import: hugegraph_llm.config.huge_settings

(F401)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1)
38-42: 修正文档中的默认限制值

文档注释中提到"默认允许每天2次调用"，但实际默认值是50。应该更新文档以保持一致性。
-        This endpoint is rate-limited. By default, it allows 2 calls per day. (Note: Not Thread-Safe!)
+        This endpoint is rate-limited. By default, it allows 50 calls per day. (Note: Not Thread-Safe!)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31bf971 and a0e460c.

📒 Files selected for processing (3)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (2 hunks)
hugegraph-llm/src/hugegraph_llm/api/rag_api.py (5 hunks)
hugegraph-llm/src/hugegraph_llm/api/vector_api.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

hugegraph-llm/src/hugegraph_llm/api/rag_api.py

🧰 Additional context used

🪛 Ruff (0.11.9)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py

24-24: hugegraph_llm.config.huge_settings imported but unused

Remove unused import: hugegraph_llm.config.huge_settings

(F401)

hugegraph-llm/src/hugegraph_llm/api/vector_api.py

36-36: Do not perform function call Body in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🔇 Additional comments (6)

hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py (4)

18-19: LGTM - 新导入项支持新功能

新增的 List 和 Enum 导入支持了后续定义的 GremlinOutputType 枚举和 GremlinGenerateRequest 模型，导入合理。

28-32: GraphConfigRequest 模型字段更新合理

模型字段的更新支持了更灵活的认证方式：

name 重命名为 graph 更符合语义

user 和 pwd 设为可选，默认为空字符串

新增 token 字段支持基于令牌的认证

这些更改与代码库中其他地方的令牌认证支持保持一致。

121-127: GremlinOutputType 枚举定义清晰

枚举很好地定义了 Gremlin 生成结果的不同输出类型，命名清晰且具有描述性。这将为选择性 Gremlin 生成提供良好的类型安全性。

128-146: GremlinGenerateRequest 模型设计良好

模型设计合理，包含了所有必要的参数：

查询文本和示例数量

可选的提示模板和客户端配置

输出类型列表支持选择性结果返回

默认值设置合理，文档描述清晰。

hugegraph-llm/src/hugegraph_llm/api/vector_api.py (2)

55-61: 图配置更新逻辑正确

配置更新逻辑正确地将请求中的图配置参数映射到全局设置，包括新增的令牌字段。这与代码库中其他地方的令牌认证支持保持一致。

63-65: 嵌入更新和响应包装合理

函数调用嵌入更新功能并将结果包装在 detail 字段中返回，这符合 API 响应的常见模式。

Jin and others added 30 commits January 2, 2025 19:24

doc: update for inner repo (GraphPlatform-4190)

74cf3c5

Change-Id: I31e9532990c2e7e09ddf518ac9c35cdf996a5a25 doc: update repo & url link for inner repo Change-Id: Ieac00904c4ab793de87c1cae7324e3d2d7946884 doc: update repo & url link for inner repo (GraphPlatform-4190) doc: update the link info

Merge branch 'master' of ssh://icode.baidu.com:8235/baidu/starhugegra…

e9941ad

…ph/hugegraph-ai into master-icode

Merge branch 'main' into master-icode

d2e4e2e

Merge branch 'main' into master-icode

c5ec3ea

docs(client): update README.md

12221c9

Merge remote-tracking branch 'github/main'

8343d3c

sync Change-Id: Id2138dfecfbd377dd39ec51fa9d46b630e9ea88f

Merge branch 'main'

0036b2c

Merge branch 'main' into master-icode

173b9b2

fix(llm): enable concurrency config in rag answer

c9ef9ed

Merge branch 'main' into master-icode

c933c58

Merge branch 'main' into master-icode

263758b

Merge branch 'main' into master-icode

8a0e6cd

GraphPlatform-4765 [Tech Task] vermeer-client框架开发

a60887d

Change-Id: I8cb2139be06f63378f87bccd9035357a7cb7a91e

update llm settings

43cdcae

Change-Id: I07414cd8b418afabcb1f8a40454a679f10760bf8

set num_gremlin_generate_example 0

b4f1a51

Change-Id: I909b71df67bef316b8cef02010bcae9e91168f4e

disable text2gql by default

142a44e

Change-Id: I0d4ce8e5a227941ecb3e5972bec35a79e25a6263

remove log info

9f245a0

Change-Id: I803b36fde6d0b0291dc6fe5cc40c37a946b354f0

temperature 0.01

87cb1cf

Change-Id: I1e3f97afeaa853250966a66c096582339f3c51d6

merge community

961f582

Change-Id: I6107a5d63c9949ac5e47f93521aa248d0a8ec121

modify prompt to only output gql

4cbb47a

Change-Id: I56329c6aa219fd04198519ee1d1220e395fe82f2

test_api_connection

36116e6

Change-Id: Id93b59ca9a325f79623741d919a18d8f64c12ccd

merge main to master

5db4e0c

Change-Id: I81c24fdabd84b92b7cc6e79e19ce3286700806f6

empty chunk

bf19a84

Change-Id: I799844222f194d54990b27d71770543ad6c707ca

build, get and remove property embeddings

026c65c

limit the number of props to be updated

f2ee374

disable pylint and modify limit logic

0e98879

save after removing props

a3b864f

change key:value to value

ac7e137

pv-embedding + set<(pk, pv)>

8ace5a8

vector/embedding api

a9f92c4

imbajin and others added 2 commits May 26, 2025 16:22

Merge branch 'master-icode' into property_embedding

765e93f

format

b5f31ff

Change-Id: Ica35d454bf7a1fa12d3cfb14332530565cd08f94

coderabbitai bot reviewed May 26, 2025

View reviewed changes

fix lint & comment

50ec338

coderabbitai bot reviewed May 26, 2025

View reviewed changes

jinsong04 and others added 11 commits May 26, 2025 21:12

text2gremlin api

7d9d67c

Change-Id: Ie99469b659abe5efd53d8a52581c4ad91622ef6f

change params

c918e73

Change-Id: I04b1a71a5eda13c8369cd9c4b99e83f6c1373405

change params

7b7260a

Change-Id: I8d9d5e3a46e16162c45b1b07a05bd187ebe0f4b1

text to json

3754211

Change-Id: Ibf83629dde9373398dac75b945a7fa19ae029a08

detail

7a2cf2b

Change-Id: Idb8c1476ebda521e195022f14f6733a33288f552

add graph_space in text2gremlin api

7b3e5e2

Change-Id: I8923df363da0ddc301b4a3c4833cec478a6c83f9

add graph_space in text2gremlin api

cb31b35

Change-Id: I84b2fdbfc0745d1556d5d29059ce5f9dfa311352

change default in text2gremlin api

9778c37

Change-Id: I79ef4dc7852788823b249a80d7979e5917d2a8c0

split build_semantic_index.run()

cf6d2e4

Merge branch 'property_embedding' of https://icode.baidu.com/baidu/st…

eb5e9f1

…arhugegraph/hugegraph-ai into property_embedding

conflict

e75f68f

coderabbitai bot reviewed May 28, 2025

View reviewed changes

jinsong04 added 2 commits May 29, 2025 10:33

change daily limit

4aa1a4d

create pyhugegraph client by token

7b7c6d2

coderabbitai bot reviewed May 29, 2025

View reviewed changes

change param

31bf971

coderabbitai bot reviewed May 29, 2025

View reviewed changes

name -> graph

a0e460c

coderabbitai bot reviewed May 30, 2025

View reviewed changes

imbajin force-pushed the main branch 4 times, most recently from 6a76d6c to fe94881 Compare July 28, 2025 07:34

	log.error("Rate limit exceeded for update_vid_embedding. Maximum %d calls per day.", daily_limit)
	log.error("Rate limit exceeded for update_embedding_api. Maximum %d calls per day.", daily_limit)

-    def update_embedding_api(daily_limit: int = 2):
+    def update_embedding_api(daily_limit: int = 2):
+        if daily_limit <= 0:
+            raise HTTPException(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                detail="daily_limit must be a positive integer"
+            )
+        # existing implementation continues below

		return outputs_dict # Early exit if gremlin_generate returned an error tuple


		match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results

-    match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results
+    # Ensure we have the expected 5-tuple result
+    if not isinstance(original_results, tuple) or len(original_results) != 5:
+        return {
+            "error": True,
+            "error_message": f"Unexpected result format from gremlin_generate: {type(original_results)}",
+            "requested_outputs": list(requested_outputs_set)
+        }
+    match_res_orig, template_gremlin_orig, raw_gremlin_orig, template_exec_res_orig, raw_exec_res_orig = original_results

		url = f"http://{self._cfg.ip}:{self._cfg.port}/"
		return urljoin(url, path).strip("/")

-            log.debug(f"Request made to {path} with params {json.dumps(params)}")
+    # Add this helper in the class (e.g. just above the method that makes the request)
+    def _safe_log_params(self, params):
+        """Remove sensitive information from params for logging"""
+        if not params:
+            return params
+        safe_params = {
+            k: v
+            for k, v in params.items()
+            if k not in ["token", "password", "api_key"]
+        }
+        return safe_params
+    # …later in the same class, replace the logging call:
+            log.debug(f"Request made to {path} with params {json.dumps(self._safe_log_params(params))}")

	'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__start_time else '',
	'create_time': self.__create_time.strftime("%Y-%m-%d %H:%M:%S") if self.__create_time else '',

property embeddings #5

Are you sure you want to change the base?

property embeddings #5

Uh oh!

Conversation

MrJs133 commented May 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

imbajin commented May 26, 2025

Uh oh!

coderabbitai bot commented May 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

MrJs133 commented May 22, 2025 •

edited by coderabbitai bot

Loading