-
Notifications
You must be signed in to change notification settings - Fork 577
chore: introduce VectorIndexManager runtime framework with incremental sync, ANN search and versioned persistence #2922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hahahahbenny
wants to merge
24
commits into
apache:vector-index
Choose a base branch
from
hahahahbenny:vector-manager
base: vector-index
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…che#2893) * docs(pd): update test commands and improve documentation clarity * Update README.md --------- Co-authored-by: imbajin <jin@apache.org>
* update(store): fix some problem and clean up code - chore(store): clean some comments - chore(store): using Slf4j instead of System.out to print log - update(store): update more reasonable timeout setting - update(store): add close method for CopyOnWriteCache to avoid potential memory leak - update(store): delete duplicated beginTx() statement - update(store): extract parameter for compaction thread pool(move to configuration file in the future) - update(store): add default logic in AggregationFunctions - update(store): fix potential concurrency problem in QueryExecutor * Update hugegraph-store/hg-store-common/src/main/java/org/apache/hugegraph/store/query/func/AggregationFunctions.java --------- Co-authored-by: Peng Junzhi <78788603+Pengzna@users.noreply.github.com>
* fix(store): fix duplicated definition log root
…p ci & remove duplicate module (apache#2910) * add missing license and remove binary license.txt * remove dist in commons * fix tinkerpop test open graph panic and other bugs * empty commit to trigger ci
…fields to the index label.
# This is the 1st commit message: add Licensed to files # This is the commit message apache#2: feat(server): support vector index in graphdb (apache#2856) * feat(server): Add the vector index type and the detection of related fields to the index label. * fix code format * add annsearch API * add doc to explain the plan delete redundency in vertexapi
… ANN search and versioned persistence
…uenceAllocator interface and the VectorIdAllocator class.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
dependencies
Incompatible dependencies of package
feature
New feature
size:XXL
This PR changes 1000+ lines, ignoring generated files.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose of the PR
This PR implements the vector-index runtime management framework (
VectorIndexManager), which coordinates data synchronization between the RocksDB storage layer and the JVector in-memory index, and supports incremental vector updates, ANN search, and index persistence.Architecture
Overall Architecture
flowchart TB %% --- Node style definitions (ClassDefs) --- classDef manager fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000; classDef memoryComp fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000; classDef memoryStruct fill:#e1bee7,stroke:#4a148c,stroke-width:2px,stroke-dasharray: 5 5,color:#000; classDef disk fill:#ffe0b2,stroke:#e65100,stroke-width:2px,shape:cylinder,color:#000; classDef file fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,shape:note,color:#000; %% ================= Main Architecture ================= %% --- 1. Top: Orchestration Layer (green background) --- subgraph Orchestration ["Orchestration Layer"] direction TB M["VectorIndexManager<br/>(Main Coordinator)"]:::manager end %% --- 2. Middle: Memory Layer (blue background) --- subgraph MemoryLayer ["Memory Space"] direction TB %% Three core component containers (white background to highlight inner components) subgraph Components ["Core Components"] direction LR subgraph Components_vector ["Vector-Related Components"] %% StateStore SS["VectorStateStore<br/>(KV Data Abstraction)"]:::memoryComp %% Runtime RT["VectorRuntime<br/>(JVector In-Memory Graph)"]:::memoryComp end %% Scheduler & EventHub subgraph Scheduler_Wrap ["Async Scheduling"] style Scheduler_Wrap fill:none SC["VectorTaskScheduler<br/>(Task Scheduler)"]:::memoryComp EH[("EventHub<br/>(In-Memory Queue/RingBuffer)")]:::memoryStruct end end end %% --- 3. Bottom: Disk Persistence Layer (orange background) --- subgraph DiskLayer ["Disk Space"] direction LR %% RocksDB ROCKS[("RocksDB<br/>(WAL / SSTables)")]:::disk %% JVector Files subgraph JVectorFiles ["JVector Persistence"] JV_IDX["Index File<br/>(Vector-Graph Data)"]:::file JV_META["Meta Data<br/>(Sequence / VectorID)"]:::file end end %% ================= Connection Logic ================= %% Manager interactions M -->|Interactive| SS M -->|Interactive| RT %% Inside scheduler SC -- "push/poll" --> EH %% Update coordination flow (dashed for async/data flow) M -.->|Submit Update Task| SC SC -.->|1. Async: Scan Deltas| SS SC -.->|2. Async: Build/Update| RT %% Persistence interactions SS -- "scanDeltas / getVertex" --- ROCKS RT -- "Load / Flush" --- JV_IDX RT -- "Read / Write" --- JV_META %% Force alignment SS ~~~ SC ~~~ RT %% ================= Subgraph color styling ================= %% 1. Orchestration Layer: light green style Orchestration fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,stroke-dasharray: 5 5 %% 2. Memory Layer: light blue style MemoryLayer fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,stroke-dasharray: 5 5 %% 3. Core Components area: pure white (makes blue nodes stand out) style Components fill:#ffffff,stroke:#90caf9,stroke-width:1px %% 4. Disk Layer: light orange style DiskLayer fill:#fff3e0,stroke:#ff9800,stroke-width:2px,stroke-dasharray: 5 5 %% 5. JVector file area: transparent or light yellow style JVectorFiles fill:#fffde7,stroke:#fbc02d,stroke-width:1px,stroke-dasharray: 3 3Data Flow
sequenceDiagram participant GIT as GraphIndexTransaction participant M as VectorIndexManager participant SC as Scheduler participant SS as StateStore participant RT as Runtime participant JV as JVector Note over GIT,JV: Write Flow (async) GIT->>M: signal(indexLabelId) M->>SC: execute(task) SC->>SS: scanDeltas(indexLabelId, fromSeq) SS-->>SC: Iterator VectorRecord SC->>RT: update(indexLabelId, records) RT->>JV: addGraphNode / markNodeDeleted Note over GIT,JV: Search Flow (sync) GIT->>M: searchVector(indexLabelId, vector, topK) M->>RT: search(indexLabelId, vector, topK) RT->>JV: GraphSearcher.search() JV-->>RT: Iterator vectorId RT-->>M: Iterator vectorId M->>SS: getVertex(indexLabelId, vectorIds) SS-->>M: Set vertexId M-->>GIT: Set IdMain Changes
hugegraph-common (abstraction layer)
VectorIndexManagerVectorIndexRuntimeAbstractVectorRuntimeVectorIndexStateStoreVectorTaskSchedulerVectorRecordhugegraph-core (server-side implementation)
ServerVectorRuntimeServerVectorStateStoreServerVectorSchedulerCore Design
1. Incremental Sync Mechanism
Uses
sequencewatermarks to track and sync only newly added/modified vector records to the JVector in-memory index.2. IndexContext Management
Each IndexLabel corresponds to one IndexContext, which encapsulates vector data, JVector builder, and metadata.
3. Versioned Persistence
Employs symbolic link switching to support atomic version updates and rollback of old versions.
flowchart LR subgraph Dir["Directory Structure"] BASE["{basePath}/{indexLabelId}/"] BASE --> CUR["current → version_xxx (symlink)"] BASE --> V1["version_20250101_120000/"] BASE --> V2["version_20250101_110000/"] V1 --> IDX1["index.inline"] V1 --> META1["vector_meta.json"] end4. Soft Delete Strategy
Deletion operations only mark nodes as deleted; actual cleanup occurs during flush.
Search Flow
Search returns
Set<Id>(vertexId), which can be directly used to buildFixedIdHolderforIdHolderList, seamlessly integrating with the existing index query framework.New Dependencies
Follow-up Work
To be completed
GraphIndexTransaction.queryByUserprop()query pathdoVectorIndex()methodstop()Tests to be added
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need概述
本 PR 实现了向量索引的运行时管理框架(
VectorIndexManager),负责协调 RocksDB 存储层与 JVector 内存索引之间的数据同步,支持向量的增量更新、ANN 搜索和索引持久化。架构设计
整体架构
flowchart TB %% --- 节点样式定义 (ClassDefs) --- classDef manager fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px,color:#000; classDef memoryComp fill:#bbdefb,stroke:#0d47a1,stroke-width:2px,color:#000; classDef memoryStruct fill:#e1bee7,stroke:#4a148c,stroke-width:2px,stroke-dasharray: 5 5,color:#000; classDef disk fill:#ffe0b2,stroke:#e65100,stroke-width:2px,shape:cylinder,color:#000; classDef file fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,shape:note,color:#000; %% ================= 架构图主体 ================= %% --- 1. 顶层:协调层 (绿色背景) --- subgraph Orchestration ["Orchestration Layer (协调层)"] direction TB M["VectorIndexManager<br/>(总协调器)"]:::manager %% --- 2. 中层:内存组件层 (蓝色背景) --- subgraph MemoryLayer ["Memory Space (内存层)"] direction TB %% 三大组件容器 (白色背景,突出内部组件) subgraph Components ["Core Components (核心组件)"] direction LR subgraph Components_vector [直接与vector相关组件] %% StateStore SS["VectorStateStore<br/>(KV数据抽象)"]:::memoryComp %% Runtime RT["VectorRuntime<br/>(JVector 内存图结构)"]:::memoryComp end %% Scheduler & EventHub subgraph Scheduler_Wrap ["异步调度"] style Scheduler_Wrap fill:none SC["VectorTaskScheduler<br/>(任务调度器)"]:::memoryComp EH[("EventHub<br/>(内存队列/RingBuffer)")]:::memoryStruct end end end end %% --- 3. 底层:磁盘持久化层 (橙色背景) --- subgraph DiskLayer ["Disk Space (磁盘层)"] direction LR %% RocksDB ROCKS[("RocksDB<br/>(WAL / SSTables)")]:::disk %% JVector Files subgraph JVectorFiles ["JVector Persistence"] JV_IDX["Index File<br/>(向量图数据)"]:::file JV_META["Meta Data<br/>(Sequence / VectorID)"]:::file end end %% ================= 连线逻辑 ================= %% Manager 交互 M -->|Interactive| SS M -->|Interactive| RT %% 调度器内部 SC -- "push/poll" --> EH %% Update 协同流程 (虚线体现异步/数据流) M -.->|Submit Update Task| SC SC -.->|1. Async: Scan Deltas| SS SC -.->|2. Async: Build/Update| RT %% 持久化交互 SS -- "scanDeltas / getVertex" --- ROCKS RT -- "Load / Flush" --- JV_IDX RT -- "Read / Write" --- JV_META %% 强制对齐 SS ~~~ SC ~~~ RT %% ================= Subgraph 颜色样式配置 ================= %% 语法: style [SubgraphID] fill:[背景色],stroke:[边框色],stroke-width:[宽],color:[文字色] %% 1. 协调层:浅绿色 style Orchestration fill:#e8f5e9,stroke:#4caf50,stroke-width:2px,stroke-dasharray: 5 5 %% 2. 内存层:浅蓝色 style MemoryLayer fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,stroke-dasharray: 5 5 %% 3. 核心组件区:纯白 (让内部的蓝色节点更突出) style Components fill:#ffffff,stroke:#90caf9,stroke-width:1px %% 4. 磁盘层:浅橙色 style DiskLayer fill:#fff3e0,stroke:#ff9800,stroke-width:2px,stroke-dasharray: 5 5 %% 5. JVector文件区:透明或微黄 style JVectorFiles fill:#fffde7,stroke:#fbc02d,stroke-width:1px,stroke-dasharray: 3 3数据流
sequenceDiagram participant GIT as GraphIndexTransaction participant M as VectorIndexManager participant SC as Scheduler participant SS as StateStore participant RT as Runtime participant JV as JVector Note over GIT,JV: 写入流程(异步) GIT->>M: signal(indexLabelId) M->>SC: execute(task) SC->>SS: scanDeltas(indexLabelId, fromSeq) SS-->>SC: Iterator VectorRecord SC->>RT: update(indexLabelId, records) RT->>JV: addGraphNode / markNodeDeleted Note over GIT,JV: 搜索流程(同步) GIT->>M: searchVector(indexLabelId, vector, topK) M->>RT: search(indexLabelId, vector, topK) RT->>JV: GraphSearcher.search() JV-->>RT: Iterator vectorId RT-->>M: Iterator vectorId M->>SS: getVertex(indexLabelId, vectorIds) SS-->>M: Set vertexId M-->>GIT: Set Id主要变更
hugegraph-common(抽象层)
VectorIndexManagerVectorIndexRuntimeAbstractVectorRuntimeVectorIndexStateStoreVectorTaskSchedulerVectorRecordhugegraph-core(服务端实现)
ServerVectorRuntimeServerVectorStateStoreServerVectorScheduler核心设计
1. 增量同步机制
通过
sequence水位追踪,仅同步新增或修改的向量记录到 JVector 内存索引。2. IndexContext 管理
每个 IndexLabel 对应一个 IndexContext,封装向量数据、JVector 构建器与元数据。
3. 版本化持久化
采用符号链接切换机制,支持原子性版本更新与旧版本回滚。
flowchart LR subgraph Dir["目录结构"] BASE["{basePath}/{indexLabelId}/"] BASE --> CUR["current → version_xxx(符号链接)"] BASE --> V1["version_20250101_120000/"] BASE --> V2["version_20250101_110000/"] V1 --> IDX1["index.inline"] V1 --> META1["vector_meta.json"] end4. 软删除策略
删除操作仅将节点标记为已删除状态,实际清理在 flush 时进行。
搜索流程
搜索返回
Set<Id>(vertexId),可直接用于构建FixedIdHolder进而得到IdHolderList,与现有索引查询框架无缝集成。新增依赖
后续工作
待完成
GraphIndexTransaction.queryByUserprop()查询路径doVectorIndex()方法stop()时的旧版本文件清理待补充测试
验证这些更改
本 PR 是否可能影响以下部分?