-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Audit LogsHuman Readable logs and trans specsHuman Readable logs and trans specs
Description
Problem
When using schemauto generalize-tsvs to create schemas from TSV data, columns containing mixed string and integer values result in validation failures. The generated schema creates string enums, but integer values in the data fail type validation.
Example Errors
[ERROR] 13111001 is not of type 'string', 'null' in /phv00108617
[ERROR] 25 is not of type 'string' in /phv00108343
[ERROR] 25 is not one of ['WED', 'TUESDAY', 'FRIDAY', ... '25', ...] in /phv00108343
[ERROR] 1994 is not of type 'string' in /phv00108344
[ERROR] 1994 is not one of ['APRIL', 'XMAS', ... '1994', ...] in /phv00108344
Note: In some cases the value exists in the enum as a string (e.g., '25', '1994'), but the data contains the integer form, causing a type mismatch.
Analysis
This issue spans multiple components in the validation chain:
- Schema-automator reads TSV → creates LinkML schema with string enums (correct - LinkML enums must be strings)
- LinkML generates JSON Schema from LinkML (enums are
type: stringwithenum: [...]) - Validator loads TSV data → values like
25may be parsed as integers - JSON Schema validation rejects integer
25for a string enum (correct per JSON Schema spec)
Potential Fix Points
| Component | Possible Fix | Trade-offs |
|---|---|---|
| schema-automator | Detect numeric-looking strings and avoid creating enums | Can't know intent; '25' is valid in an enum |
| schema-automator | Use union types for mixed data columns | Adds schema complexity |
| linkml validator | Coerce types before enum comparison | Deviates from JSON Schema spec |
| Data loader | Coerce all values to strings based on schema | May mask real type errors |
| dm-bip preprocessing | Ensure consistent types before validation | Adds pipeline complexity |
Tasks
- Investigate where type coercion should occur
- Check if there are existing upstream issues in linkml/schema-automator or linkml/linkml
- Open upstream issue(s) as appropriate
- Determine if dm-bip needs a workaround in the interim
Related
- Generalize-tsvs creating malformed enums causes crash schema-automator#169 - Generalize-tsvs creating malformed enums causes crash
- can any of the importers or generalizers infer that 0/1/True/False etc. are of type boolean? schema-automator#93 - Type inference for boolean-like values
Impact
This affects validation of dbGaP datasets which commonly have columns with mixed string/integer values.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Audit LogsHuman Readable logs and trans specsHuman Readable logs and trans specs