fix(avro): use standard SchemaNormalization for canonicalization and recompute stale hashes#7550
Conversation
…EnhancedAvroContentCanonicalizer to AvroContentCanonicalizer
…ization Replace 130 lines of custom normalization logic in AvroContentCanonicalizer with Apache Avro's SchemaNormalization.toParsingForm(), which fixes known bugs (nullable unions without default, self-referencing maps, enum defaults) and reduces maintenance burden. Add a DB upgrader (version 102) that recomputes canonical hashes for all existing Avro content to match the new canonicalization output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n' into feat/avro-standard-canonicalization
…icalHashUpgraderTest
|
@AndreaScarselliAx do you mind addressing the conflicts first? You'll need to pump the version of your upgrade. |
|
Hi @carlesarnal , thanks for the feedback, I have completed the rebase |
| ContentWrapperDto data = ContentWrapperDto.builder() | ||
| .content(ContentHandle.create(entity.contentEntity.contentBytes)) | ||
| .contentType(entity.contentEntity.contentType).references(references) | ||
| .artifactType(entity.contentEntity.contentType).build(); |
There was a problem hiding this comment.
This is wrong, the content type is not the same as the artifact type. The artifact type would be something like JSON or AVRO whereas the content type is the actual MIME type.
|
|
||
| return ContentWrapperDto.builder().content(ContentHandle.create(entity.contentBytes)) | ||
| .contentType(entity.contentType) | ||
| .references(RegistryContentUtils.deserializeReferences(entity.serializedReferences)) |
There was a problem hiding this comment.
Same here regarding contentType/artifactType.
|
|
||
| private final ArtifactTypeUtilProviderFactory factory = new DefaultArtifactTypeUtilProviderImpl(true); | ||
|
|
||
| @Override |
There was a problem hiding this comment.
@EricWittmann what do you think about transactionality here? The db version has been already bumped, so, if anything fails mid process, some rows will be upgraded and some don't. This should be shielded behind a transaction, no?
There was a problem hiding this comment.
All DB operations done during DB upgrade should be part of the same Tx - so if anything fails the entire upgrade rolls back. Is that not how it always works?
|
@AndreaScarselliAx I'm a bit surprised by no test failing due to the error mentioned in the review. Would you be able to add a new test to cover that case? If you need some assistance, just let me know! |
Summary
Replace Apicurio Registry's custom Avro schema normalization logic with Apache Avro's standard
SchemaNormalization.toParsingForm(), and add a database upgrader to recompute stale canonicalhashes for existing data.
Problem
With this contribution, I added a custom Avro canonicalizer. Later, while running v2, we found a bug: a ccompat lookup between the following two schemas returned a 40403.
{ "type": "record", "name": "Application", "namespace": "nl.andrea.general", "doc": "Identification of an application", "fields": [ { "name": "name", "type": "string", "doc": "The name of the application" }, { "name": "version", "type": ["null", "string"], "doc": "(Optional) The application version" } ] }{ "type": "record", "name": "Application", "namespace": "nl.andrea.general", "doc": "Identification of an application", "fields": [ { "name": "name", "type": "string", "doc": "The name of the application" }, { "name": "version", "type": ["null", "string"], "doc": "(Optional) The application version", "default": null } ] }These two schemas are semantically equivalent —
"default": null. The custom canonicalizer was incorrectly producing different hashes for them. We found thatSchemaNormalization.toParsingForm()from the Avro library handles this correctly out of the box, and switching to it prevents this class of bug entirely.Changes
AvroContentCanonicalizer— replaced ~130 lines of custom normalization withSchemaNormalization.toParsingForm(). This is the standard Avro canonical form.AvroCanonicalHashUpgrader— new DB upgrader (version 102) that recomputes canonical hashes for all existing Avro content. Without this, schemas registered before this change would retain stale hashes and still fail ccompat lookups against newly registered equivalent schemas.Test fixes —
AvroCompatibilityTestwas incorrectly canonicalizing schemas before passing them to the compatibility checker, which stripped aliases and defaults needed for correct compatibility evaluation. Tests now parse schemas directly, matching production behavior. One assertion was also corrected: renaming a field with an alias is backward compatible per the Avro spec.