Skip to content

Support JSONL in Dataset and @DatasetSource annotation #7 #21

@fkapsahili

Description

@fkapsahili

Is your feature request related to a problem? Please describe.

The Dataset class currently supports JSON and CSV formats via fromJson() and fromCsv() methods, and the @DatasetSource annotation supports inline JSON via the json() attribute. However, there is no support for JSONL (JSON Lines), which is a common format for datasets where each line is a separate JSON object.

Describe the solution you'd like

  1. Add Dataset.fromJsonl(Path path) and Dataset.fromJsonl(String jsonl) methods to parse JSONL format.
  2. Update ClasspathDatasetResolver and FileDatasetResolver to detect .jsonl extension and route to the new parser
  3. Add a jsonl() attribute to @DatasetSource for inline JSONL in tests

Expected JSONL format:

{"input": "What is 2+2?", "expectedOutput": "4"}
{"input": "What is 2+3?", "expectedOutput": "5"}

Or with the full structure:

{"inputs": {"question": "..."}, "expectedOutputs": {"answer": "..."}, "metadata": {"source": "..."}}

JSONL is streaming-friendly and doesn't require loading the entire file into memory for parsing. The existing logic for parsing individual JSON examples can be reused since each line in JSONL is a valid JSON object, which matches the current example structure.

Describe alternatives you've considered

An alternative would be to stick with the JSON-implementation, but I think having jsonl() as an additional option would absolutely make sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions