Skip to content

v2 file processing db schema#13328

Open
eeee0717 wants to merge 6 commits intoCherryHQ:v2from
eeee0717:v2-fileprocessing-db
Open

v2 file processing db schema#13328
eeee0717 wants to merge 6 commits intoCherryHQ:v2from
eeee0717:v2-fileprocessing-db

Conversation

@eeee0717
Copy link
Collaborator

@eeee0717 eeee0717 commented Mar 9, 2026

@EurFelux 之前讨论的不同的feature有不同的endpoint,我看了paddle的文档,现在有异步解析,多个模型都已经统一url: https://paddleocr.aistudio-app.com/api/v2/ocr/jobs, 只需要修改model即可调用不同的解析模型,目前没看到有其他的服务商是不同endpoint有不同的url

@eeee0717 eeee0717 requested a review from a team March 9, 2026 07:57
@eeee0717 eeee0717 requested a review from 0xfullex as a code owner March 9, 2026 07:57
@eeee0717 eeee0717 requested a review from DeJeune March 9, 2026 07:58
@eeee0717 eeee0717 changed the title feat: add file processing db schema and handler v2 file processing db schema Mar 9, 2026
@eeee0717 eeee0717 added this to the v2.0.0 milestone Mar 9, 2026
@eeee0717 eeee0717 added the v2 label Mar 9, 2026
@EurFelux
Copy link
Collaborator

EurFelux commented Mar 9, 2026

Note

This comment was translated by Claude.

Should we consider compatibility with paddle's old version?


Original Content

是否需要考虑兼容paddle旧版本?

@eeee0717
Copy link
Collaborator Author

eeee0717 commented Mar 9, 2026

Should we consider compatibility with paddle's old version?

Note

This comment was translated by Claude.

The keys are consistent between the new and old versions. When migrating, you only need to migrate the keys; there's no need to migrate the URL.


Original Content

新旧版本key是一致的,迁移的时候只需要迁移key,不用迁移url即可

@EurFelux
Copy link
Collaborator

EurFelux commented Mar 9, 2026

Note

This comment was translated by Claude.

I mean the old version of the paddleOCR service probably still uses different URLs for different services, right? Of course, we can also only support the new version, which would be more convenient for design.


Original Content

我的意思是paddleOCR服务的旧版本,应该还是给不同的服务配置了不同的URL吧。当然我们也可以只支持新版本,这样设计起来更方便一些。

@eeee0717
Copy link
Collaborator Author

eeee0717 commented Mar 9, 2026

I mean the old version of the paddleOCR service probably still uses different URLs for different services, right? Of course, we can also only support the new version, which would be more convenient for design.

Note

This comment was translated by Claude.

This doesn't affect anything, the key is the same. Previously, URLs were used to distinguish which model was being used, but now models are distinguished by modelId.


Original Content

这个不影响,key是同一个,原先是用url来区分使用的是什么模型,现在模型用modelId来区分

@eeee0717
Copy link
Collaborator Author

eeee0717 commented Mar 12, 2026

Note

This comment was translated by Claude.

@EurFelux Are there any other places in this design that need to be changed?


Original Content

@EurFelux 这个设计还有需要更改的地方吗

@DeJeune
Copy link
Collaborator

DeJeune commented Mar 12, 2026

Note

This comment was translated by Claude.

It needs to be cleaned up a bit, it seems there are many merged commits.


Original Content

需要清理一下,看来有很多合并过的提交

@EurFelux
Copy link
Collaborator

EurFelux commented Mar 12, 2026

Note

This comment was translated by Claude.

The issue with this commit 8934fbf is that it introduced many unrelated changes.


Original Content

8934fbf 这个commit的问题,引入了很多无关改动

Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Copy link
Collaborator

@DeJeune DeJeune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Well-structured PR that establishes the file processing data layer for v2. The schema design (templates + overrides), migration logic, and service layer are well thought out. Good test coverage across all new modules.

Critical / Bug (1)

  • updateProcessor validation ordergetPresetById() is called after preferenceService.set(), meaning invalid processor IDs corrupt preferences before the error is thrown.

Significant (2)

  • Runtime parse at import timevoid FileProcessorTemplatesSchema.parse(PRESETS_FILE_PROCESSORS) will crash the app at startup if validation fails. There's already a test for this; consider making the import-time assertion softer or at least documenting the intent.
  • Merged schema strict mode mismatchFileProcessorMergedSchema inherits strict capability schemas from the template, but mergeProcessorConfig spreads override fields that aren't in the strict schema. This would fail re-validation.

Minor / Nit (3)

  • Empty options: {} on every mergemergeProcessorOverrides always creates an empty options object even when neither side has options.
  • Metadata schema inconsistencyFileProcessorMetadataSchema (z.never() values) vs CapabilityMetadataSchema (z.unknown() values) is confusing without explanation.
  • Dual source of truth — Plain TS types in preferenceTypes.ts duplicate Zod schemas in file-processing.ts; consider deriving one from the other.

Positives

  • Clean separation of concerns: templates (read-only presets) vs overrides (user config) vs merged (API responses)
  • Thorough migration logic with good edge case handling (paddleocr special-casing, preset default skipping, empty pruning)
  • Solid test coverage for migration, service, and schema validation
  • Good use of discriminated unions for capability types
  • CI is green ✅

Comment on lines +73 to +82
public async updateProcessor(id: FileProcessorId, updates: FileProcessorOverride): Promise<FileProcessorMerged> {
const overrides = this.getOverrides()
const nextOverrides: FileProcessorOverrides = {
...overrides,
[id]: mergeProcessorOverrides(overrides[id], updates)
}

this.getPresetById(id)

await preferenceService.set('file_processing.overrides', nextOverrides)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: getPresetById(id) is called after preferenceService.set(). If the processor ID doesn't exist in presets, the preference is already written with invalid data before the notFound error is thrown.

Move the validation before the write:

public async updateProcessor(id: FileProcessorId, updates: FileProcessorOverride): Promise<FileProcessorMerged> {
  this.getPresetById(id) // validate first
  const overrides = this.getOverrides()
  const nextOverrides: FileProcessorOverrides = {
    ...overrides,
    [id]: mergeProcessorOverrides(overrides[id], updates)
  }

  await preferenceService.set('file_processing.overrides', nextOverrides)
  // ...
}

Comment on lines +35 to +47
function mergeProcessorOverrides(
current?: FileProcessorOverride,
updates?: FileProcessorOverride
): FileProcessorOverride {
return {
...current,
...updates,
capabilities: mergeCapabilityOverrides(current?.capabilities, updates?.capabilities),
options: {
...current?.options,
...updates?.options
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: mergeProcessorOverrides always produces options: {} even when both current and updates have no options, since spreading two undefineds into an object yields {}. This means every update will persist an empty options field unnecessarily.

Consider guarding:

function mergeProcessorOverrides(
  current?: FileProcessorOverride,
  updates?: FileProcessorOverride
): FileProcessorOverride {
  const mergedOptions = current?.options || updates?.options
    ? { ...current?.options, ...updates?.options }
    : undefined

  return {
    ...current,
    ...updates,
    capabilities: mergeCapabilityOverrides(current?.capabilities, updates?.capabilities),
    ...(mergedOptions !== undefined && { options: mergedOptions })
  }
}

export type FileProcessorInput = FeatureCapability['inputs'][number]

/**
* Output type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: FileProcessorMetadataSchema uses z.record(z.string(), z.never()) which only allows {}, while CapabilityMetadataSchema on line 113 uses z.record(z.string(), z.unknown()) which allows anything. This inconsistency is confusing — the comment says "reserved for future use" but one schema is locked down and the other is wide open. Consider aligning them or adding a brief comment explaining the intentional difference.

}
] as const satisfies readonly FileProcessorTemplate[]

void FileProcessorTemplatesSchema.parse(PRESETS_FILE_PROCESSORS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant: This void FileProcessorTemplatesSchema.parse(...) runs at module import time and will crash the entire app at startup if validation fails. While the intent (catching preset data bugs early) is good, this is risky in production. Consider:

  1. Wrapping in a try/catch that logs the error but doesn't crash, or
  2. Moving this to a test-only assertion (there's already a test for this: 'validates built-in presets'), or
  3. At minimum, adding a comment explaining the intentional crash-on-import behavior so future contributors don't accidentally break startup.

Comment on lines +259 to +261
output: 'markdown',
apiHost: 'https://paddleocr.aistudio-app.com/',
modelId: 'PaddleOCR-VL-1.5',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant: The FileProcessorMergedSchema extends FileProcessorTemplateSchema which inherits the .strict() constraint. However, the merged type adds apiKeys and options via .extend() — that's fine. But the capabilities array still uses the strict FeatureCapabilitySchema from the template.

When mergeProcessorConfig in FileProcessingService spreads override fields into capabilities:

capabilities: preset.capabilities.map((capability) => ({
  ...capability,
  ...override?.capabilities?.[capability.feature]
}))

The spread can add extra fields (from CapabilityOverride) that aren't defined in the strict capability schema. This means the merged result would fail if re-validated against FileProcessorMergedSchema. Since you're not re-validating the merged output this works at runtime, but it's a type-level inconsistency worth noting. Consider either:

  • Relaxing the capability schema in the merged variant, or
  • Explicitly picking only known fields from the override when merging

Comment on lines +108 to +145
export const FILE_PROCESSOR_TYPES = ['api', 'builtin'] as const

export type FileProcessorType = (typeof FILE_PROCESSOR_TYPES)[number]

export const FILE_PROCESSOR_FEATURES = ['text_extraction', 'markdown_conversion'] as const

export type FileProcessorFeature = (typeof FILE_PROCESSOR_FEATURES)[number]

export const FILE_PROCESSOR_IDS = [
'tesseract',
'system',
'paddleocr',
'ovocr',
'mineru',
'doc2x',
'mistral',
'open-mineru'
] as const

export type FileProcessorId = (typeof FILE_PROCESSOR_IDS)[number]

export type FileProcessorOptions = Record<string, unknown>

export type CapabilityOverride = {
apiHost?: string
modelId?: string
metadata?: Record<string, unknown>
}

export type FileProcessorCapabilityOverrides = Partial<Record<FileProcessorFeature, CapabilityOverride>>

export type FileProcessorOverride = {
apiKeys?: string[]
capabilities?: FileProcessorCapabilityOverrides
options?: FileProcessorOptions
}

export type FileProcessorOverrides = Partial<Record<FileProcessorId, FileProcessorOverride>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Agreeing with @EurFelux's earlier comment — since there are already Zod schemas defined in file-processing.ts for these same types (e.g., FileProcessorOverrideSchema, CapabilityOverrideSchema), having plain TypeScript type definitions here creates a dual-source-of-truth risk. If either side drifts, the types will silently diverge. Consider deriving these types from the Zod schemas using `z.infer` (or vice versa) to keep them in sync, similar to how `FileProcessorTemplate` is already done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants