Skip to content

Investigate: Schema-automator generated schemas fail validation on mixed string/integer columns #232

@amc-corey-cox

Description

@amc-corey-cox

Problem

When using schemauto generalize-tsvs to create schemas from TSV data, columns containing mixed string and integer values result in validation failures. The generated schema creates string enums, but integer values in the data fail type validation.

Example Errors

[ERROR] 13111001 is not of type 'string', 'null' in /phv00108617
[ERROR] 25 is not of type 'string' in /phv00108343
[ERROR] 25 is not one of ['WED', 'TUESDAY', 'FRIDAY', ... '25', ...] in /phv00108343
[ERROR] 1994 is not of type 'string' in /phv00108344
[ERROR] 1994 is not one of ['APRIL', 'XMAS', ... '1994', ...] in /phv00108344

Note: In some cases the value exists in the enum as a string (e.g., '25', '1994'), but the data contains the integer form, causing a type mismatch.

Analysis

This issue spans multiple components in the validation chain:

  1. Schema-automator reads TSV → creates LinkML schema with string enums (correct - LinkML enums must be strings)
  2. LinkML generates JSON Schema from LinkML (enums are type: string with enum: [...])
  3. Validator loads TSV data → values like 25 may be parsed as integers
  4. JSON Schema validation rejects integer 25 for a string enum (correct per JSON Schema spec)

Potential Fix Points

Component Possible Fix Trade-offs
schema-automator Detect numeric-looking strings and avoid creating enums Can't know intent; '25' is valid in an enum
schema-automator Use union types for mixed data columns Adds schema complexity
linkml validator Coerce types before enum comparison Deviates from JSON Schema spec
Data loader Coerce all values to strings based on schema May mask real type errors
dm-bip preprocessing Ensure consistent types before validation Adds pipeline complexity

Tasks

  • Investigate where type coercion should occur
  • Check if there are existing upstream issues in linkml/schema-automator or linkml/linkml
  • Open upstream issue(s) as appropriate
  • Determine if dm-bip needs a workaround in the interim

Related

Impact

This affects validation of dbGaP datasets which commonly have columns with mixed string/integer values.

Metadata

Metadata

Assignees

Labels

Audit LogsHuman Readable logs and trans specs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions