Skip to content

Package schema#4

Merged
jbragg merged 3 commits intomainfrom
jbragg/package-schema
May 8, 2025
Merged

Package schema#4
jbragg merged 3 commits intomainfrom
jbragg/package-schema

Conversation

@jbragg
Copy link
Copy Markdown
Collaborator

@jbragg jbragg commented May 7, 2025

Resolves an encountered problem where some scored data was set to a default value (number of tokens = 0), the field was dropped during serialization (due to exclude_defaults=True in model_dump_json), the HuggingFace dataset inferred a null value, and de-serialization in the leaderboard app failed (the number of tokens field expected an int).

This PR

  • Adds functionality to produce a fixed schema file (dataset_infos.json) which after being uploaded to the root of the results HuggingFace repo should obviate the need for HuggingFace schema inference.
  • Removes the exclusions of default and None values during serialization. These exclusions were put in place with the goal of avoiding auto schema inference problems, which are no longer relevant (and which caused problems like the one described above).

@jbragg jbragg force-pushed the jbragg/package-schema branch 17 times, most recently from bacb5bc to c5ebed5 Compare May 8, 2025 07:06
@jbragg jbragg force-pushed the jbragg/package-schema branch from c5ebed5 to 9502370 Compare May 8, 2025 07:12
@jbragg jbragg marked this pull request as ready for review May 8, 2025 07:25
@jbragg jbragg requested review from AmberRose2 and rodneykinney May 8, 2025 07:26
@jbragg jbragg marked this pull request as draft May 8, 2025 16:26
Copy link
Copy Markdown
Member

@rodneykinney rodneykinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@AmberRose2 AmberRose2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! I think this also makes it a lot clearer for users!

@jbragg jbragg force-pushed the jbragg/package-schema branch from f672ad7 to c45c522 Compare May 8, 2025 23:08
@jbragg jbragg force-pushed the jbragg/package-schema branch from c45c522 to 53a56b9 Compare May 8, 2025 23:17
@jbragg jbragg marked this pull request as ready for review May 8, 2025 23:18
@jbragg
Copy link
Copy Markdown
Collaborator Author

jbragg commented May 8, 2025

It turned out that HF wasn't reading from dataset_infos.json so I switched to adding the schema info to the README, which seems to work

@jbragg jbragg merged commit fde83e3 into main May 8, 2025
3 checks passed
@jbragg jbragg deleted the jbragg/package-schema branch May 8, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants