Problem
We currently use a hash-derived schema_id
and rely on equality of that id to decide if a schema already exists in a batch. Hashes can collide. Also, metadata.schema_ids
is emitted as MD5(hex) and semicolon-separated, which isn’t what we want going forward.
What we want
- Use a local, per-batch auto-incrementing schema_id (0,1,2,…) for each unique schema shape inside the batch.
- Deduplicate schemas by exact schema equality, not by hash.
- Emit metadata.schema_ids as a comma-separated list of those local ids, sorted ascending. Example: 0,1,2.
- Keep the CentralBlob wire format the same otherwise.