Skip to content

Conversation

@hahahahbenny
Copy link
Contributor

@hahahahbenny hahahahbenny commented Dec 19, 2025

Purpose of the PR

This PR implements the vector-index runtime management framework (VectorIndexManager), which coordinates data synchronization between the RocksDB storage layer and the JVector in-memory index, and supports incremental vector updates, ANN search, and index persistence.

Architecture

Overall Architecture

flowchart TB
    %% --- Node style definitions (ClassDefs) ---
    classDef manager fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000;
    classDef memoryComp fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000;
    classDef memoryStruct fill:#e1bee7,stroke:#4a148c,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
    classDef disk fill:#ffe0b2,stroke:#e65100,stroke-width:2px,shape:cylinder,color:#000;
    classDef file fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,shape:note,color:#000;

    %% ================= Main Architecture =================

    %% --- 1. Top: Orchestration Layer (green background) ---
    subgraph Orchestration ["Orchestration Layer"]
        direction TB
        M["VectorIndexManager<br/>(Main Coordinator)"]:::manager
    end

    %% --- 2. Middle: Memory Layer (blue background) ---
    subgraph MemoryLayer ["Memory Space"]
        direction TB
        
        %% Three core component containers (white background to highlight inner components)
        subgraph Components ["Core Components"]
            direction LR
            
            subgraph Components_vector ["Vector-Related Components"]
               %% StateStore
                SS["VectorStateStore<br/>(KV Data Abstraction)"]:::memoryComp

                %% Runtime
                RT["VectorRuntime<br/>(JVector In-Memory Graph)"]:::memoryComp
            end
            
            %% Scheduler & EventHub
            subgraph Scheduler_Wrap ["Async Scheduling"]
                style Scheduler_Wrap fill:none
                SC["VectorTaskScheduler<br/>(Task Scheduler)"]:::memoryComp
                EH[("EventHub<br/>(In-Memory Queue/RingBuffer)")]:::memoryStruct
            end
        end
    end

    %% --- 3. Bottom: Disk Persistence Layer (orange background) ---
    subgraph DiskLayer ["Disk Space"]
        direction LR
        
        %% RocksDB
        ROCKS[("RocksDB<br/>(WAL / SSTables)")]:::disk
        
        %% JVector Files
        subgraph JVectorFiles ["JVector Persistence"]
            JV_IDX["Index File<br/>(Vector-Graph Data)"]:::file
            JV_META["Meta Data<br/>(Sequence / VectorID)"]:::file
        end
    end

    %% ================= Connection Logic =================

    %% Manager interactions
    M -->|Interactive| SS
    M -->|Interactive| RT

    %% Inside scheduler
    SC -- "push/poll" --> EH

    %% Update coordination flow (dashed for async/data flow)
    M -.->|Submit Update Task| SC
    SC -.->|1. Async: Scan Deltas| SS
    SC -.->|2. Async: Build/Update| RT

    %% Persistence interactions
    SS -- "scanDeltas / getVertex" --- ROCKS
    RT -- "Load / Flush" --- JV_IDX
    RT -- "Read / Write" --- JV_META

    %% Force alignment
    SS ~~~ SC ~~~ RT

    %% ================= Subgraph color styling =================
    %% 1. Orchestration Layer: light green
    style Orchestration fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,stroke-dasharray: 5 5

    %% 2. Memory Layer: light blue
    style MemoryLayer fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,stroke-dasharray: 5 5
    
    %% 3. Core Components area: pure white (makes blue nodes stand out)
    style Components fill:#ffffff,stroke:#90caf9,stroke-width:1px

    %% 4. Disk Layer: light orange
    style DiskLayer fill:#fff3e0,stroke:#ff9800,stroke-width:2px,stroke-dasharray: 5 5
    
    %% 5. JVector file area: transparent or light yellow
    style JVectorFiles fill:#fffde7,stroke:#fbc02d,stroke-width:1px,stroke-dasharray: 3 3
Loading

Data Flow

sequenceDiagram
    participant GIT as GraphIndexTransaction
    participant M as VectorIndexManager
    participant SC as Scheduler
    participant SS as StateStore
    participant RT as Runtime
    participant JV as JVector

    Note over GIT,JV: Write Flow (async)
    GIT->>M: signal(indexLabelId)
    M->>SC: execute(task)
    SC->>SS: scanDeltas(indexLabelId, fromSeq)
    SS-->>SC: Iterator VectorRecord
    SC->>RT: update(indexLabelId, records)
    RT->>JV: addGraphNode / markNodeDeleted

    Note over GIT,JV: Search Flow (sync)
    GIT->>M: searchVector(indexLabelId, vector, topK)
    M->>RT: search(indexLabelId, vector, topK)
    RT->>JV: GraphSearcher.search()
    JV-->>RT: Iterator vectorId
    RT-->>M: Iterator vectorId
    M->>SS: getVertex(indexLabelId, vectorIds)
    SS-->>M: Set vertexId
    M-->>GIT: Set Id
Loading

Main Changes

hugegraph-common (abstraction layer)

File Responsibility
VectorIndexManager Coordinator, manages lifecycle and interaction of the three components
VectorIndexRuntime Runtime interface, defines operations such as update/search/flush
AbstractVectorRuntime Abstract runtime implementation, manages IndexContext and versioned persistence
VectorIndexStateStore State storage interface, defines scanDeltas/getVertex operations
VectorTaskScheduler Task scheduling interface, supports async task execution
VectorRecord Vector record DTO, contains vectorId/vector/deleted/sequence

hugegraph-core (server-side implementation)

File Responsibility
ServerVectorRuntime JVector runtime implementation, supports COSINE/EUCLIDEAN/DOT_PRODUCT
ServerVectorStateStore RocksDB state storage implementation, scans increments based on IdPrefixQuery
ServerVectorScheduler Event-driven scheduling implementation based on EventHub

Core Design

1. Incremental Sync Mechanism

Uses sequence watermarks to track and sync only newly added/modified vector records to the JVector in-memory index.

2. IndexContext Management

Each IndexLabel corresponds to one IndexContext, which encapsulates vector data, JVector builder, and metadata.

3. Versioned Persistence

Employs symbolic link switching to support atomic version updates and rollback of old versions.

flowchart LR
    subgraph Dir["Directory Structure"]
        BASE["{basePath}/{indexLabelId}/"]
        BASE --> CUR["current → version_xxx (symlink)"]
        BASE --> V1["version_20250101_120000/"]
        BASE --> V2["version_20250101_110000/"]
        V1 --> IDX1["index.inline"]
        V1 --> META1["vector_meta.json"]
    end
Loading

4. Soft Delete Strategy

Deletion operations only mark nodes as deleted; actual cleanup occurs during flush.

Search Flow

Search returns Set<Id> (vertexId), which can be directly used to build FixedIdHolder for IdHolderList, seamlessly integrating with the existing index query framework.

New Dependencies

Dependency Version Purpose
jvector 3.0.0 HNSW vector index implementation

Follow-up Work

To be completed

  • Integrate into GraphIndexTransaction.queryByUserprop() query path
  • Implement doVectorIndex() method
  • REST API / Gremlin Step support for vector search syntax
  • Old version file cleanup during stop()

Tests to be added

Test Type Test Content
Unit test VectorIndexManager lifecycle
Unit test ServerVectorRuntime incremental update and search
Unit test AbstractVectorRuntime versioned persistence
Integration test End-to-end search with RocksDB + JVector
Performance test Search latency under different vector scales

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • xxx

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

概述

本 PR 实现了向量索引的运行时管理框架(VectorIndexManager),负责协调 RocksDB 存储层与 JVector 内存索引之间的数据同步,支持向量的增量更新、ANN 搜索和索引持久化。

架构设计

整体架构

flowchart TB
    %% --- 节点样式定义 (ClassDefs) ---
    classDef manager fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000;
    classDef memoryComp fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000;
    classDef memoryStruct fill:#e1bee7,stroke:#4a148c,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
    classDef disk fill:#ffe0b2,stroke:#e65100,stroke-width:2px,shape:cylinder,color:#000;
    classDef file fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,shape:note,color:#000;

    %% ================= 架构图主体 =================

    %% --- 1. 顶层:协调层 (绿色背景) ---
    subgraph Orchestration ["Orchestration Layer (协调层)"]
        direction TB
        M["VectorIndexManager<br/>(总协调器)"]:::manager


    %% --- 2. 中层:内存组件层 (蓝色背景) ---
    subgraph MemoryLayer ["Memory Space (内存层)"]
        direction TB
        
        %% 三大组件容器 (白色背景,突出内部组件)
        subgraph Components ["Core Components (核心组件)"]
            direction LR
            
            subgraph Components_vector [直接与vector相关组件]
               %% StateStore
                SS["VectorStateStore<br/>(KV数据抽象)"]:::memoryComp

                %% Runtime
                RT["VectorRuntime<br/>(JVector 内存图结构)"]:::memoryComp
            
            end
            %% Scheduler & EventHub
            subgraph Scheduler_Wrap ["异步调度"]
                style Scheduler_Wrap fill:none
                SC["VectorTaskScheduler<br/>(任务调度器)"]:::memoryComp
                EH[("EventHub<br/>(内存队列/RingBuffer)")]:::memoryStruct
            end
            
            
        end
    end
   end
    %% --- 3. 底层:磁盘持久化层 (橙色背景) ---
    subgraph DiskLayer ["Disk Space (磁盘层)"]
        direction LR
        
        %% RocksDB
        ROCKS[("RocksDB<br/>(WAL / SSTables)")]:::disk
        
        %% JVector Files
        subgraph JVectorFiles ["JVector Persistence"]
            JV_IDX["Index File<br/>(向量图数据)"]:::file
            JV_META["Meta Data<br/>(Sequence / VectorID)"]:::file
        end
    end

    %% ================= 连线逻辑 =================

    %% Manager 交互
    M -->|Interactive| SS
    M -->|Interactive| RT

    %% 调度器内部
    SC -- "push/poll" --> EH

    %% Update 协同流程 (虚线体现异步/数据流)
    M -.->|Submit Update Task| SC
    SC -.->|1. Async: Scan Deltas| SS
    SC -.->|2. Async: Build/Update| RT

    %% 持久化交互
    SS -- "scanDeltas / getVertex" --- ROCKS
    RT -- "Load / Flush" --- JV_IDX
    RT -- "Read / Write" --- JV_META

    %% 强制对齐
    SS ~~~ SC ~~~ RT

    %% ================= Subgraph 颜色样式配置 =================
    %% 语法: style [SubgraphID] fill:[背景色],stroke:[边框色],stroke-width:[宽],color:[文字色]
    
    %% 1. 协调层:浅绿色
    style Orchestration fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,stroke-dasharray: 5 5

    %% 2. 内存层:浅蓝色
    style MemoryLayer fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,stroke-dasharray: 5 5
    
    %% 3. 核心组件区:纯白 (让内部的蓝色节点更突出)
    style Components fill:#ffffff,stroke:#90caf9,stroke-width:1px

    %% 4. 磁盘层:浅橙色
    style DiskLayer fill:#fff3e0,stroke:#ff9800,stroke-width:2px,stroke-dasharray: 5 5
    
    %% 5. JVector文件区:透明或微黄
    style JVectorFiles fill:#fffde7,stroke:#fbc02d,stroke-width:1px,stroke-dasharray: 3 3
Loading

数据流

sequenceDiagram
    participant GIT as GraphIndexTransaction
    participant M as VectorIndexManager
    participant SC as Scheduler
    participant SS as StateStore
    participant RT as Runtime
    participant JV as JVector

    Note over GIT,JV: 写入流程(异步)
    GIT->>M: signal(indexLabelId)
    M->>SC: execute(task)
    SC->>SS: scanDeltas(indexLabelId, fromSeq)
    SS-->>SC: Iterator VectorRecord
    SC->>RT: update(indexLabelId, records)
    RT->>JV: addGraphNode / markNodeDeleted

    Note over GIT,JV: 搜索流程(同步)
    GIT->>M: searchVector(indexLabelId, vector, topK)
    M->>RT: search(indexLabelId, vector, topK)
    RT->>JV: GraphSearcher.search()
    JV-->>RT: Iterator vectorId
    RT-->>M: Iterator vectorId
    M->>SS: getVertex(indexLabelId, vectorIds)
    SS-->>M: Set vertexId
    M-->>GIT: Set Id
Loading

主要变更

hugegraph-common(抽象层)

文件 职责
VectorIndexManager 协调器,管理三大组件的生命周期与交互
VectorIndexRuntime 运行时接口,定义 update/search/flush 等操作
AbstractVectorRuntime 运行时抽象实现,管理 IndexContext 与版本化持久化
VectorIndexStateStore 状态存储接口,定义 scanDeltas/getVertex 操作
VectorTaskScheduler 任务调度接口,支持异步任务执行
VectorRecord 向量记录 DTO,包含 vectorId/vector/deleted/sequence

hugegraph-core(服务端实现)

文件 职责
ServerVectorRuntime JVector 运行时实现,支持 COSINE/EUCLIDEAN/DOT_PRODUCT
ServerVectorStateStore RocksDB 状态存储实现,基于 IdPrefixQuery 扫描增量
ServerVectorScheduler 基于 EventHub 的事件驱动调度实现

核心设计

1. 增量同步机制

通过 sequence 水位追踪,仅同步新增或修改的向量记录到 JVector 内存索引。

2. IndexContext 管理

每个 IndexLabel 对应一个 IndexContext,封装向量数据、JVector 构建器与元数据。

3. 版本化持久化

采用符号链接切换机制,支持原子性版本更新与旧版本回滚。

flowchart LR
    subgraph Dir["目录结构"]
        BASE["{basePath}/{indexLabelId}/"]
        BASE --> CUR["current → version_xxx(符号链接)"]
        BASE --> V1["version_20250101_120000/"]
        BASE --> V2["version_20250101_110000/"]
        V1 --> IDX1["index.inline"]
        V1 --> META1["vector_meta.json"]
    end
Loading

4. 软删除策略

删除操作仅将节点标记为已删除状态,实际清理在 flush 时进行。

搜索流程

搜索返回 Set<Id>(vertexId),可直接用于构建 FixedIdHolder 进而得到 IdHolderList,与现有索引查询框架无缝集成。

新增依赖

依赖 版本 用途
jvector 3.0.0 HNSW 向量索引实现

后续工作

待完成

  • 集成到 GraphIndexTransaction.queryByUserprop() 查询路径
  • 实现 doVectorIndex() 方法
  • REST API / Gremlin Step 支持向量搜索语法
  • stop() 时的旧版本文件清理

待补充测试

测试类型 测试内容
单元测试 VectorIndexManager 生命周期
单元测试 ServerVectorRuntime 增量更新与搜索
单元测试 AbstractVectorRuntime 版本化持久化
集成测试 RocksDB + JVector 端到端搜索
性能测试 不同规模向量下的搜索延迟

验证这些更改

  • 无需测试的微小重构/代码清理。
  • 已由现有测试覆盖,例如 (请在此处修改测试)
  • 需要测试,可通过以下方式验证:
    • xxx

本 PR 是否可能影响以下部分?

JisoLya and others added 22 commits October 31, 2025 19:03
…che#2893)

* docs(pd): update test commands and improve documentation clarity

* Update README.md

---------

Co-authored-by: imbajin <jin@apache.org>
* update(store): fix some problem and clean up code

- chore(store): clean some comments
- chore(store): using Slf4j instead of System.out to print log
- update(store): update more reasonable timeout setting
- update(store): add close method for CopyOnWriteCache to avoid potential memory leak
- update(store): delete duplicated beginTx() statement
- update(store): extract parameter for compaction thread pool(move to configuration file in the future)
- update(store): add default logic in AggregationFunctions
- update(store): fix potential concurrency problem in QueryExecutor

* Update hugegraph-store/hg-store-common/src/main/java/org/apache/hugegraph/store/query/func/AggregationFunctions.java

---------

Co-authored-by: Peng Junzhi <78788603+Pengzna@users.noreply.github.com>
* fix(store): fix duplicated definition log root
…p ci & remove duplicate module (apache#2910)

* add missing license and remove binary license.txt

* remove dist in commons

* fix tinkerpop test open graph panic and other bugs

* empty commit to trigger ci
# This is the 1st commit message:

add Licensed to files

# This is the commit message apache#2:

feat(server): support vector index in graphdb  (apache#2856)

* feat(server): Add the vector index type and the detection of related fields to the index label.

* fix code format

* add annsearch API

* add doc to explain the plan

delete redundency in vertexapi
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. dependencies Incompatible dependencies of package feature New feature labels Dec 19, 2025
@codecov
Copy link

codecov bot commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 0% with 587 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (vector-index@c92710c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...java/org/apache/hugegraph/api/auth/ManagerAPI.java 0.00% 105 Missing ⚠️
...g/apache/hugegraph/vector/ServerVectorRuntime.java 0.00% 76 Missing ⚠️
...pache/hugegraph/vector/ServerVectorStateStore.java 0.00% 60 Missing ⚠️
...apache/hugegraph/structure/HugeVectorIndexMap.java 0.00% 46 Missing ⚠️
...hugegraph/backend/serializer/BinarySerializer.java 0.00% 39 Missing ⚠️
...n/java/org/apache/hugegraph/core/GraphManager.java 0.00% 33 Missing ⚠️
...va/org/apache/hugegraph/api/filter/PathFilter.java 0.00% 22 Missing ⚠️
...apache/hugegraph/type/define/IndexVectorState.java 0.00% 20 Missing ⚠️
...he/hugegraph/store/client/query/QueryExecutor.java 0.00% 15 Missing ⚠️
...he/hugegraph/backend/tx/GraphIndexTransaction.java 0.00% 14 Missing ⚠️
... and 37 more
Additional details and impacted files
@@              Coverage Diff               @@
##             vector-index   #2922   +/-   ##
==============================================
  Coverage                ?   0.07%           
  Complexity              ?      22           
==============================================
  Files                   ?     785           
  Lines                   ?   65385           
  Branches                ?    8367           
==============================================
  Hits                    ?      51           
  Misses                  ?   65332           
  Partials                ?       2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Incompatible dependencies of package feature New feature size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants