feat(ai): Adding Lucene & Embedding-Based Search Operators to Apache GeaFlow (incubating) for Lightweight Context Memory #716

Leomrlin · 2025-12-17T13:01:43Z

We're excited to introduce initial support for context-aware memory operations in Apache GeaFlow (incubating) through the integration of two key retrieval operators: Lucene-powered keyword search and embedding-based semantic search. This enhancement lays the foundational layer for building dynamic, AI-driven graph memory systems — enabling real-time, hybrid querying over structured graph data and unstructured semantic intent.

✅ Key Features Implemented

KeywordVector + Lucene Indexing: Enables fast, full-text retrieval of entities using BM25-style keyword matching. Ideal for surfacing exact or near-exact matches from entity attributes (e.g., names, emails, titles).
EmbeddingVector + Vector Index Store: Supports semantic search via high-dimensional embeddings. Queries are encoded using a configured embedding model and matched against pre-indexed node representations.
Hybrid VectorSearch Interface: Combines multiple vector types (keyword, embedding, traversal hints) into a single search context, paving the way for multimodal retrieval.
End-to-End Query Pipeline: From query ingestion → hybrid indexing → graph retrieval → context verbalization, demonstrated with LDBC-scale data.

🧪 Validated Use Cases

Our GraphMemoryTest suite demonstrates:

Resolving ambiguous queries like "Chaim Azriel" into multiple candidate persons using keyword + embedding fusion.
Traversing relationships (e.g., Comment_hasCreator_Person) in follow-up rounds via contextual refinement.
Iterative context expansion across multiple search cycles — mimicking agent memory evolution.

🔮 Why This Matters

This work represents the first step toward Graphiti-inspired, relationship-aware AI memory within GeaFlow:

Instead of treating context as static text, we model it as a dynamic, evolving subgraph, enriched by both semantic similarity and topological structure.

By leveraging GeaFlow’s native streaming graph engine, we aim to go beyond batch RAG — supporting incremental updates, temporal reasoning, and multi-hop inference at low latency.

Next Steps:
We propose incubating this as the GeaFlow Memory Engine, with upcoming support for:

Graph traversal-guided re-ranking
Agent session management with episodic memory
Integration with LLM agents for autonomous reasoning

This PR sets the stage: from graph analytics to graph-native AI memory.

Let’s build the future of contextual intelligence — on streaming graphs. 🚀

Appointat

test

Appointat

Thanks for your PR. Left some comments.

Appointat · 2025-12-30T09:11:57Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/common/model/EmbeddingService.java

+        context.userSay(sentence);
+        return model.chat(context);
+    }
+


The method name singleSentence() is not accurate; it actually sends a single message and retrieves a reply.
Suggestion: rename to chat() or sendMessage()

chat() has been replaced here

Appointat · 2025-12-30T09:12:10Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/common/model/EmbeddingService.java

+            builder.append(json);
+        }
+        return builder.toString();
+    }


ChatRobot is responsible for both chat and embedding, with a mixed set of responsibilities.

Suggestion: consider splitting into ChatService and EmbeddingService? May be better?

The two functions have been separate into ChatService and EmbeddingService.

Appointat · 2025-12-30T09:12:29Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/common/model/EmbeddingService.java

+        public EmbeddingResult(String input, double[] embedding) {
+            this.input = input;
+            this.embedding = embedding;
+        }


what about merge it to "EmbeddingResponse"?

EmbeddingResponse and EmbeddingResult have no common fields and serve different purposes, so they are not suitable for merging

Appointat · 2025-12-30T09:34:17Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/common/model/ModelInfo.java

+
+package org.apache.geaflow.ai.common.model;
+
+public class ModelInfo {


Rename to ModelConfig may be better. This is the mainstream naming.

Yes, it's now renamed to ModelConfig.

Appointat · 2025-12-30T09:39:21Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/common/model/Response.java

+import java.util.List;
+
+
+public class Response {


If you need to integrate an API response compatible with the OpenAI/Gemini API, please import these variables. Viewing the confusing point in this code, why is the choice needed? If the Response class is a generic class (not only for LLM responses), then I think usage and choice may be not necessary; these information could be stored as meta attributes of the message.

choice is part of the interface and must be retained. Currently, Response is not yet a generic class, so it only converts the model's response output.

Appointat · 2025-12-30T11:20:25Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/index/vector/TraversalVector.java

+    @Override
+    public String toString() {
+
+        StringBuilder builder = new StringBuilder();


@Override public String toString() { StringBuilder sb = new StringBuilder("TraversalVector{vec="); for (int i = 0; i < vec.length; i++) { if (i > 0) { sb.append(i % 3 == 0 ? "; " : "-"); } sb.append(vec[i]); if (i % 3 == 2) { sb.append(">"); } } return sb.append('}').toString(); }

It has been modified accordingly.

Appointat · 2025-12-30T11:23:17Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/index/EmbeddingIndexStore.java

+        ChatRobot chatRobot = new ChatRobot();
+        chatRobot.setModelInfo(modelInfo);
+
+        final int BATCH_SIZE = 32;


Magic numbers should be extracted as constants or configured or environment variables. Other magic numbers should also be changed.

Currently, model.Constants and GraphMemoryConfigKeys have been added, and all necessary literal values have been moved and declared within them.

Appointat · 2025-12-30T11:25:04Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/graph/io/CsvFileReader.java

@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one


I have not reviewed the io folder yet.

Appointat · 2025-12-30T11:27:25Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/operator/SessionOperator.java

+
+    private GraphSearchStore initSearchStore(Map<GraphEntity, List<IVector>> entityIndexMap) {
+        GraphSearchStore searchStore = new GraphSearchStore();
+        for (Map.Entry<GraphEntity, List<IVector>> entry : entityIndexMap.entrySet()) {


Do we rebuild the in-memory index on every search (GraphSearchStore includes Lucene)? Is there a better optimization method? For example, clearing/initializing the GraphSearchStore instead of rebuilding.

Unfortunately, there is no better solution at the moment. The algorithm requires that during each iteration, the Operator searches within the potential traversal area. Using the previous GraphSearchStore—whether based on the previous iteration or the entire graph—would inevitably lead to an expanding search scope. Fortunately, though, the retrieval area is not particularly large in the graph.

Appointat · 2025-12-30T11:29:05Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/operator/EmbeddingOperator.java

+    public EmbeddingOperator(GraphAccessor accessor, IndexStore store) {
+        this.graphAccessor = Objects.requireNonNull(accessor);
+        this.indexStore = Objects.requireNonNull(store);
+        this.threshold = 0.50;


Should it be placed in the configuration, or, another optiona -- as a configurable parameter?

ditto, already modified.

kitalkuyo-gita · 2026-01-14T04:00:30Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/session/SessionManagement.java

+    }
+
+    public void setSubGraph(String sessionId, List<SubGraph> subGraphs) {
+        this.session2Graphs.put(sessionId, subGraphs);


SessionManagement.createSession(String) only writes the time to session2ActiveTime but does not create the corresponding empty list in session2Graphs. GraphMemoryServer.verbalize directly calls sessionManagement.getSubGraph(sessionId); if subGraphList is null when calling new ArrayList<>(subGraphList.size()), it will throw a NullPointerException.

It is recommended to call session2Graphs.put(sessionId, new ArrayList<>()) when createSession(String) and createSession() are successful.
Also, change getSubGraph to return a non-null value.

samples:

// 将 Map 改为并发实现 private final ConcurrentMap<String, Long> session2ActiveTime = new ConcurrentHashMap<>(); private final ConcurrentMap<String, List<SubGraph>> session2Graphs = new ConcurrentHashMap<>(); public boolean createSession(String sessionId) { if (sessionId == null) { return false; } Long prev = session2ActiveTime.putIfAbsent(sessionId, System.nanoTime()); if (prev != null) { return false; } // 初始化 subgraphs 为可变空列表，避免 NPE session2Graphs.putIfAbsent(sessionId, new ArrayList<>()); return true; } public String createSession() { String sessionId = Constants.PREFIX_TMP_SESSION + System.nanoTime() + UUID.randomUUID().toString().replace("-", "").substring(0, 8); return createSession(sessionId) ? sessionId : null; } // 返回不可为 null 的 List（防止调用者 NPE） public List<SubGraph> getSubGraph(String sessionId) { List<SubGraph> l = this.session2Graphs.get(sessionId); return l == null ? new ArrayList<>() : l; } public void setSubGraph(String sessionId, List<SubGraph> subGraphs) { // 安全性：确保 map 存在 key this.session2Graphs.put(sessionId, subGraphs == null ? new ArrayList<>() : subGraphs); }

kitalkuyo-gita · 2026-01-14T05:49:28Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/operator/SearchUtils.java

+     * @return an unmodifiable set of ignored characters
+     */
+    private static Set<Character> buildIgnoredChars() {
+        Set<Character> ignored = new HashSet<>(EXCLUDED_CHARS);


It is recommended not to include EXCLUDED_CHARS in IGNORE_CHARS, as this may cause errors in SearchStore query string construction and semantic filtering.

For example, SubgraphSemanticPromptFunction.verbalize filters strings using .filter(str -> !SearchUtils.isAllAllowedChars(str)). An incorrect set of allowed characters will lead to incorrect filtering behavior (strings that should be kept are removed, and strings that should be removed are kept).

samples:

// SearchUtils.java: 修复 buildIgnoredChars() private static final Set<Character> EXCLUDED_CHARS = new HashSet<>(Arrays.asList( '*', '#', '-', '?', '`', '{', '}', '[', ']', '(', ')', '>', '<', ':', '/', '.' )); private static final Set<Character> IGNORE_CHARS = buildIgnoredChars(); private static Set<Character> buildIgnoredChars() { Set<Character> allowed = new HashSet<>(); // 加入英文字母（大小写） for (char c = 'a'; c <= 'z'; c++) allowed.add(c); for (char c = 'A'; c <= 'Z'; c++) allowed.add(c); // 加入数字 for (char c = '0'; c <= '9'; c++) allowed.add(c); // 加入常用安全字符（空格、下划线等） allowed.add(' '); allowed.add('_'); allowed.add('-'); allowed.add('@'); allowed.add('+'); allowed.add('!'); allowed.add('$'); allowed.add('%'); allowed.add('&'); allowed.add('='); allowed.add('~'); // 不要加入 EXCLUDED_CHARS ! return Collections.unmodifiableSet(allowed); }

kitalkuyo-gita · 2026-01-14T05:50:31Z

geaflow-ai/src/main/java/org/apache/geaflow/ai/operator/SessionOperator.java

+                }
+            }
+            //recall compute
+            GraphSearchStore searchStore = initSearchStore(extendEntityIndexMap);


Perhaps it would be better to explicitly call writer.commit() (or close()) in initSearchStore after all the addDoc operations are completed?

Leomrlin added 18 commits November 19, 2025 19:18

init dcp code

7d8832f

Merge remote-tracking branch 'origin/master' into dev_init_dcp

1b10d23

add lucene search

73b97d1

add prompt formatter

0da962e

add test case

18b359f

handle ldbc id conflict

3bd80f0

support llm

4c1aa15

support embedding index store

e0e983a

add embedding op

b945221

refine test case

3253a0e

delete test data

1b2fe59

add MockChatRobot

5ee48a1

fix checkstyle

a127c4b

Merge remote-tracking branch 'origin/master' into dev_init_dcp

ce2fde1

fix pom

8ccc524

fix finishReason

453dfd9

fix ci tests

a80cc46

fix ci tests

bb12777

yaozhongq requested a review from cbqiao December 29, 2025 07:52

Appointat reviewed Dec 30, 2025

View reviewed changes

Appointat suggested changes Dec 30, 2025

View reviewed changes

Leomrlin added 2 commits January 6, 2026 16:11

fix comments

3975294

fix codestyle

adffdd9

Leomrlin changed the title ~~feat(dsl): Adding Lucene & Embedding-Based Search Operators to Apache GeaFlow (incubating) for Lightweight Context Memory~~ feat(ai): Adding Lucene & Embedding-Based Search Operators to Apache GeaFlow (incubating) for Lightweight Context Memory Jan 6, 2026

Leomrlin added 2 commits January 7, 2026 15:16

support mutable graph

65835ce

Merge remote-tracking branch 'origin/master' into dev_init_dcp

9f48bf1

kitalkuyo-gita reviewed Jan 14, 2026

View reviewed changes


		package org.apache.geaflow.ai.common.model;

		public class ModelInfo {

		@@ -0,0 +1,149 @@
		/*
		* Licensed to the Apache Software Foundation (ASF) under one

feat(ai): Adding Lucene & Embedding-Based Search Operators to Apache GeaFlow (incubating) for Lightweight Context Memory #716

Are you sure you want to change the base?

feat(ai): Adding Lucene & Embedding-Based Search Operators to Apache GeaFlow (incubating) for Lightweight Context Memory #716

Uh oh!

Conversation

Leomrlin commented Dec 17, 2025

✅ Key Features Implemented

🧪 Validated Use Cases

🔮 Why This Matters

Uh oh!

Appointat left a comment

Choose a reason for hiding this comment

Uh oh!

Appointat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Leomrlin Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Leomrlin Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kitalkuyo-gita Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Leomrlin Jan 6, 2026 •

edited

Loading

Leomrlin Jan 6, 2026 •

edited

Loading

kitalkuyo-gita Jan 14, 2026 •

edited

Loading