Conversation
src/main/data/migration/v2/migrators/mappings/FileProcessingOverrideMappings.ts
Show resolved
Hide resolved
|
Note This comment was translated by Claude. Should we consider compatibility with paddle's old version? Original Content是否需要考虑兼容paddle旧版本? |
Note This comment was translated by Claude. The keys are consistent between the new and old versions. When migrating, you only need to migrate the keys; there's no need to migrate the URL. Original Content新旧版本key是一致的,迁移的时候只需要迁移key,不用迁移url即可 |
|
Note This comment was translated by Claude. I mean the old version of the paddleOCR service probably still uses different URLs for different services, right? Of course, we can also only support the new version, which would be more convenient for design. Original Content我的意思是paddleOCR服务的旧版本,应该还是给不同的服务配置了不同的URL吧。当然我们也可以只支持新版本,这样设计起来更方便一些。 |
Note This comment was translated by Claude. This doesn't affect anything, the key is the same. Previously, URLs were used to distinguish which model was being used, but now models are distinguished by modelId. Original Content这个不影响,key是同一个,原先是用url来区分使用的是什么模型,现在模型用modelId来区分 |
|
Note This comment was translated by Claude. It needs to be cleaned up a bit, it seems there are many merged commits. Original Content需要清理一下,看来有很多合并过的提交 |
Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
DeJeune
left a comment
There was a problem hiding this comment.
Review Summary
Well-structured PR that establishes the file processing data layer for v2. The schema design (templates + overrides), migration logic, and service layer are well thought out. Good test coverage across all new modules.
Critical / Bug (1)
updateProcessorvalidation order —getPresetById()is called afterpreferenceService.set(), meaning invalid processor IDs corrupt preferences before the error is thrown.
Significant (2)
- Runtime parse at import time —
void FileProcessorTemplatesSchema.parse(PRESETS_FILE_PROCESSORS)will crash the app at startup if validation fails. There's already a test for this; consider making the import-time assertion softer or at least documenting the intent. - Merged schema strict mode mismatch —
FileProcessorMergedSchemainherits strict capability schemas from the template, butmergeProcessorConfigspreads override fields that aren't in the strict schema. This would fail re-validation.
Minor / Nit (3)
- Empty
options: {}on every merge —mergeProcessorOverridesalways creates an empty options object even when neither side has options. - Metadata schema inconsistency —
FileProcessorMetadataSchema(z.never()values) vsCapabilityMetadataSchema(z.unknown()values) is confusing without explanation. - Dual source of truth — Plain TS types in
preferenceTypes.tsduplicate Zod schemas infile-processing.ts; consider deriving one from the other.
Positives
- Clean separation of concerns: templates (read-only presets) vs overrides (user config) vs merged (API responses)
- Thorough migration logic with good edge case handling (paddleocr special-casing, preset default skipping, empty pruning)
- Solid test coverage for migration, service, and schema validation
- Good use of discriminated unions for capability types
- CI is green ✅
| public async updateProcessor(id: FileProcessorId, updates: FileProcessorOverride): Promise<FileProcessorMerged> { | ||
| const overrides = this.getOverrides() | ||
| const nextOverrides: FileProcessorOverrides = { | ||
| ...overrides, | ||
| [id]: mergeProcessorOverrides(overrides[id], updates) | ||
| } | ||
|
|
||
| this.getPresetById(id) | ||
|
|
||
| await preferenceService.set('file_processing.overrides', nextOverrides) |
There was a problem hiding this comment.
Bug: getPresetById(id) is called after preferenceService.set(). If the processor ID doesn't exist in presets, the preference is already written with invalid data before the notFound error is thrown.
Move the validation before the write:
public async updateProcessor(id: FileProcessorId, updates: FileProcessorOverride): Promise<FileProcessorMerged> {
this.getPresetById(id) // validate first
const overrides = this.getOverrides()
const nextOverrides: FileProcessorOverrides = {
...overrides,
[id]: mergeProcessorOverrides(overrides[id], updates)
}
await preferenceService.set('file_processing.overrides', nextOverrides)
// ...
}| function mergeProcessorOverrides( | ||
| current?: FileProcessorOverride, | ||
| updates?: FileProcessorOverride | ||
| ): FileProcessorOverride { | ||
| return { | ||
| ...current, | ||
| ...updates, | ||
| capabilities: mergeCapabilityOverrides(current?.capabilities, updates?.capabilities), | ||
| options: { | ||
| ...current?.options, | ||
| ...updates?.options | ||
| } | ||
| } |
There was a problem hiding this comment.
Nit: mergeProcessorOverrides always produces options: {} even when both current and updates have no options, since spreading two undefineds into an object yields {}. This means every update will persist an empty options field unnecessarily.
Consider guarding:
function mergeProcessorOverrides(
current?: FileProcessorOverride,
updates?: FileProcessorOverride
): FileProcessorOverride {
const mergedOptions = current?.options || updates?.options
? { ...current?.options, ...updates?.options }
: undefined
return {
...current,
...updates,
capabilities: mergeCapabilityOverrides(current?.capabilities, updates?.capabilities),
...(mergedOptions !== undefined && { options: mergedOptions })
}
}| export type FileProcessorInput = FeatureCapability['inputs'][number] | ||
|
|
||
| /** | ||
| * Output type |
There was a problem hiding this comment.
Nit: FileProcessorMetadataSchema uses z.record(z.string(), z.never()) which only allows {}, while CapabilityMetadataSchema on line 113 uses z.record(z.string(), z.unknown()) which allows anything. This inconsistency is confusing — the comment says "reserved for future use" but one schema is locked down and the other is wide open. Consider aligning them or adding a brief comment explaining the intentional difference.
| } | ||
| ] as const satisfies readonly FileProcessorTemplate[] | ||
|
|
||
| void FileProcessorTemplatesSchema.parse(PRESETS_FILE_PROCESSORS) |
There was a problem hiding this comment.
Significant: This void FileProcessorTemplatesSchema.parse(...) runs at module import time and will crash the entire app at startup if validation fails. While the intent (catching preset data bugs early) is good, this is risky in production. Consider:
- Wrapping in a try/catch that logs the error but doesn't crash, or
- Moving this to a test-only assertion (there's already a test for this:
'validates built-in presets'), or - At minimum, adding a comment explaining the intentional crash-on-import behavior so future contributors don't accidentally break startup.
| output: 'markdown', | ||
| apiHost: 'https://paddleocr.aistudio-app.com/', | ||
| modelId: 'PaddleOCR-VL-1.5', |
There was a problem hiding this comment.
Significant: The FileProcessorMergedSchema extends FileProcessorTemplateSchema which inherits the .strict() constraint. However, the merged type adds apiKeys and options via .extend() — that's fine. But the capabilities array still uses the strict FeatureCapabilitySchema from the template.
When mergeProcessorConfig in FileProcessingService spreads override fields into capabilities:
capabilities: preset.capabilities.map((capability) => ({
...capability,
...override?.capabilities?.[capability.feature]
}))The spread can add extra fields (from CapabilityOverride) that aren't defined in the strict capability schema. This means the merged result would fail if re-validated against FileProcessorMergedSchema. Since you're not re-validating the merged output this works at runtime, but it's a type-level inconsistency worth noting. Consider either:
- Relaxing the capability schema in the merged variant, or
- Explicitly picking only known fields from the override when merging
| export const FILE_PROCESSOR_TYPES = ['api', 'builtin'] as const | ||
|
|
||
| export type FileProcessorType = (typeof FILE_PROCESSOR_TYPES)[number] | ||
|
|
||
| export const FILE_PROCESSOR_FEATURES = ['text_extraction', 'markdown_conversion'] as const | ||
|
|
||
| export type FileProcessorFeature = (typeof FILE_PROCESSOR_FEATURES)[number] | ||
|
|
||
| export const FILE_PROCESSOR_IDS = [ | ||
| 'tesseract', | ||
| 'system', | ||
| 'paddleocr', | ||
| 'ovocr', | ||
| 'mineru', | ||
| 'doc2x', | ||
| 'mistral', | ||
| 'open-mineru' | ||
| ] as const | ||
|
|
||
| export type FileProcessorId = (typeof FILE_PROCESSOR_IDS)[number] | ||
|
|
||
| export type FileProcessorOptions = Record<string, unknown> | ||
|
|
||
| export type CapabilityOverride = { | ||
| apiHost?: string | ||
| modelId?: string | ||
| metadata?: Record<string, unknown> | ||
| } | ||
|
|
||
| export type FileProcessorCapabilityOverrides = Partial<Record<FileProcessorFeature, CapabilityOverride>> | ||
|
|
||
| export type FileProcessorOverride = { | ||
| apiKeys?: string[] | ||
| capabilities?: FileProcessorCapabilityOverrides | ||
| options?: FileProcessorOptions | ||
| } | ||
|
|
||
| export type FileProcessorOverrides = Partial<Record<FileProcessorId, FileProcessorOverride>> |
There was a problem hiding this comment.
Nit: Agreeing with @EurFelux's earlier comment — since there are already Zod schemas defined in file-processing.ts for these same types (e.g., FileProcessorOverrideSchema, CapabilityOverrideSchema), having plain TypeScript type definitions here creates a dual-source-of-truth risk. If either side drifts, the types will silently diverge. Consider deriving these types from the Zod schemas using `z.infer` (or vice versa) to keep them in sync, similar to how `FileProcessorTemplate` is already done.
@EurFelux 之前讨论的不同的feature有不同的endpoint,我看了paddle的文档,现在有异步解析,多个模型都已经统一url:
https://paddleocr.aistudio-app.com/api/v2/ocr/jobs, 只需要修改model即可调用不同的解析模型,目前没看到有其他的服务商是不同endpoint有不同的url