Skip to content

(WIP): Refactor Knowledge#12340

Draft
eeee0717 wants to merge 94 commits intoCherryHQ:v2from
eeee0717:v2-knowledge-backend
Draft

(WIP): Refactor Knowledge#12340
eeee0717 wants to merge 94 commits intoCherryHQ:v2from
eeee0717:v2-knowledge-backend

Conversation

@eeee0717
Copy link
Collaborator

@eeee0717 eeee0717 commented Jan 7, 2026

任务进度

  • Knowledge表设计 & Data API实现
  • Embedjs -> Vectorstores(基础功能测试完毕)
  • 知识库界面改为使用Data API
  • 知识库队列设计 & 实现
  • render仍有一些使用redux的组件/hooks
  • 使用ai sdk替换嵌入✅和重排模型(需要更新ai sdkv6)调用
  • 迁移model/provider获取
  • 移除knowledge adapter
  • 文档预处理 -> 从 preprocess provider中获取
  • 测试
  • 暴露API服务
    • 文档设计(Knowledge Facade)

Add SQLite schema and TypeScript types for the knowledge module migration
from Redux/Dexie to the v2 Data API architecture:

- Add knowledge_base and knowledge_item table schemas with proper indexes
- Add KnowledgeItemData discriminated union types for type-safe item data
- Add FileMetadata type to shared package
- Include design documentation for the migration approach

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@eeee0717 eeee0717 changed the title (WIP): Refactor Knowledge Backend (WIP): Refactor Knowledge Jan 7, 2026
@CherryHQ CherryHQ deleted a comment from eeee0717 Jan 7, 2026
@0xfullex
Copy link
Collaborator

0xfullex commented Jan 7, 2026

Note

This comment was translated by Claude.

  • File comments should all be in English
  • I see the documentation is placed in the knowledge directory instead of the docs directory? What's the reason?

knowledgeBase table

  • modelId is TBD for now, we can define it like this first and migrate later if needed
  • What is preprocessProvider for?
  • config field: In database design, if the data is not complex, variable, or unimportant, we generally don't store it as JSON. config should be a core parameter and relatively simple, right? So it should be split out
  • Why create an index on updatedAt? Do we frequently need to sort by updatedAt in business logic?

knowledgeItem table

  • id doesn't seem to need uuidv7, this isn't a project as time-sensitive as messages or orders (if the IDs of these projects are in chronological order, searching by ID will be faster)
  • progress/stage What kind of status is this? Does our processing of knowledge depend on this status to continue? If not, it might not be suitable to put in the database
  • .on(t.baseId, t.updatedAt), will we frequently sort by these two? Also if it's a composite index, then the first baseId doesn't need an additional separate index

Original Content
  • 文件注释都用英文
  • 文档我看放knowledge目录下,而不是docs目录下了?原因是?

knowledgeBase表

  • modelId 现在待定,现在可以先这样定义,回头有需要再迁移
  • preprocessProvider是做什么的?
  • config字段:数据库设计中,如果不是复杂多变或者不重要的数据,我们一般不通过json来存储,config应该是比较核心的参数,且比较简单对吧?那应该拆出来
  • 为什么给 updatedAt做索引?业务上经常性要对updatedAt排序吗

knowledageItem表

  • id 似乎不需要用 uuidv7,这不是个像message或者订单那样对时间很敏感的项目(这些项目的id如果按时间顺序,那么按id查找会更快)
  • progress/stage 这是个什么样的状态?我们对knowledge的处理需要依赖这个状态才能继续吗?如果不是,那可能并不适合放数据库
  • .on(t.baseId, t.updatedAt),会经常按这两个排序吗? 另外如果联合索引,那么第一个baseId就不用额外单独索引了

@DeJeune
Copy link
Collaborator

DeJeune commented Jan 7, 2026

Note

This comment was translated by Claude.

Isn't this PR about migrating the vectorstore?


Original Content

迁移vectorstore是这个PR不

@eeee0717
Copy link
Collaborator Author

eeee0717 commented Jan 8, 2026

Note

This comment was translated by Claude.

@0xfullex Updated

  • All comments have been changed to English
  • Documentation moved to v2-refactor-temp/docs/knowledge/ directory

knowledge_base table:

  • config JSON field split into independent columns: chunkSize, chunkOverlap, threshold
  • Removed knowledge_base_updated_at_idx index
  • preprocessProviderId is used to index document preprocessing, this part needs to be checked by @EurFelux

knowledge_item table:

  • id changed to uuidPrimaryKey(), no longer using uuidv7
  • Removed stage and progress fields, merged into status
    • New ItemStatus: 'idle' | 'pending' | 'preprocessing' | 'embedding' | 'completed' | 'failed'
    • Progress info is pushed to UI in real-time via IPC events, not persisted to database
  • Removed all indexes (including redundant base_id_idx), only keeping check constraints

Original Content

@0xfullex 已修改

  • 所有注释已改为英文
  • 文档已移动到 v2-refactor-temp/docs/knowledge/ 目录

knowledge_base 表:

  • config JSON 字段已拆分为独立列:chunkSize、chunkOverlap、threshold
  • 移除了 knowledge_base_updated_at_idx 索引
  • preprocessProviderId 是用来索引文档预处理的,这部分得看 @EurFelux

knowledge_item 表:

  • id 改为 uuidPrimaryKey(),不再使用 uuidv7
  • 移除 stage 和 progress 字段,合并到 status
    • 新的 ItemStatus: 'idle' | 'pending' | 'preprocessing' | 'embedding' | 'completed' | 'failed'
    • 进度信息通过 IPC 事件实时推送给 UI,不持久化到数据库
  • 移除所有索引(包括冗余的 base_id_idx),只保留 check 约束

@eeee0717
Copy link
Collaborator Author

eeee0717 commented Jan 8, 2026

Note

This comment was translated by Claude.

@DeJeune Yes, it will be submitted later, still testing now


Original Content

@DeJeune 是的,后面会提交,现在还在测试

@0xfullex 0xfullex added the v2 label Jan 8, 2026
@0xfullex 0xfullex added this to the v2.0.0 milestone Jan 8, 2026
eeee0717 and others added 30 commits January 16, 2026 14:07
- Replace `store.getState().knowledge.bases` with `dataApiService.get('/knowledge-bases')`
- Replace `window.api.knowledgeBase.search/rerank` with `dataApiService.get('/knowledge-bases/:id/search')`
- Move threshold filtering and documentCount limiting from renderer to KnowledgeBaseService (server-side)
- Add `threshold` and `documentCount` params to KnowledgeSearchRequest schema
- Remove unused imports (DEFAULT_KNOWLEDGE_DOCUMENT_COUNT, estimateTextTokens)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants