-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Is your feature request related to a problem? Please describe.
The Dataset class currently supports JSON and CSV formats via fromJson() and fromCsv() methods, and the @DatasetSource annotation supports inline JSON via the json() attribute. However, there is no support for JSONL (JSON Lines), which is a common format for datasets where each line is a separate JSON object.
Describe the solution you'd like
- Add
Dataset.fromJsonl(Path path)andDataset.fromJsonl(String jsonl)methods to parse JSONL format. - Update
ClasspathDatasetResolverandFileDatasetResolverto detect .jsonl extension and route to the new parser - Add a jsonl() attribute to
@DatasetSourcefor inline JSONL in tests
Expected JSONL format:
{"input": "What is 2+2?", "expectedOutput": "4"}
{"input": "What is 2+3?", "expectedOutput": "5"}Or with the full structure:
{"inputs": {"question": "..."}, "expectedOutputs": {"answer": "..."}, "metadata": {"source": "..."}}JSONL is streaming-friendly and doesn't require loading the entire file into memory for parsing. The existing logic for parsing individual JSON examples can be reused since each line in JSONL is a valid JSON object, which matches the current example structure.
Describe alternatives you've considered
An alternative would be to stick with the JSON-implementation, but I think having jsonl() as an additional option would absolutely make sense.