Skip to content

Conversation

@andreatgretel
Copy link
Contributor

@andreatgretel andreatgretel commented Nov 11, 2025

This PR implements a new processor that uses a Jinja template to automatically generate JSONL files that can be readily consumed by fine-tuning APIs.

Example usage

See examples/example.py for a fully functional example. Specifically, this is how the new processor is used:

jsonl_entry_template = {
    "messages": [
        {
            "role": "system",
            "content": (
                "You are an expert ER triage nurse. Your task is to classify the following triage note into one of the five Emergency Severity Index (ESI) levels."
                f" The possible levels are: {', '.join([repr(level) for level in ESI_LEVELS])}."
                " Carefully analyze the clinical details in the triage note, focusing on patient acuity, resource needs, and risk of rapid deterioration."
                " Respond with only the selected ESI level description, exactly matching one of the listed possibilities. Do not provide extra text or explanation."
            ),
        },
        {
            "role": "user",
            "content": (
                "Triage Note: {{ content }}\n"
                "Classify the ESI level for this note based on the provided definitions."
                ' Respond in JSON format only: { "esi_level_description": "..." }'
            ),
        },
        {"role": "assistant", "content": ('{ "esi_level_description": "{{ esi_level_description }}" }')},
    ],
}

config_builder.add_processor(
    AncillaryDatasetProcessorConfig(
        name="jsonl_output",
        template=jsonl_entry_template,
    )
)

results = dd.create(config_builder, num_records=20)
path_to_processor_artifacts = results.get_path_to_processor_artifacts("jsonl_output")

import pandas as pd
pd.read_parquet(path_to_processor_artifacts).to_json("./output.jsonl", orient="records", lines=True)

Closes #25

@github-actions
Copy link
Contributor

github-actions bot commented Nov 11, 2025

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@andreatgretel andreatgretel changed the title 🧩 Implement chat template transform within general Processor framework feat: 🧩 Implement chat template transform within general Processor framework Nov 11, 2025
@andreatgretel
Copy link
Contributor Author

I have read the DCO document and I hereby sign the DCO.

@andreatgretel andreatgretel force-pushed the andreatgretel/processor-for-jsonl-output branch 2 times, most recently from c30bd73 to 37fffe4 Compare November 12, 2025 13:47
@andreatgretel andreatgretel marked this pull request as ready for review November 12, 2025 13:58
@andreatgretel andreatgretel force-pushed the andreatgretel/processor-for-jsonl-output branch 2 times, most recently from 21600b9 to db6a276 Compare November 20, 2025 14:28
@andreatgretel
Copy link
Contributor Author

Made changes to the PR following our offline discussion:

  • This is now a OutputFormatProcessor: the user inputs a Jinja2 template (could be JSONL, could be CSV, or whatever other format); we format it and add it to a Parquet file at the end of each batch; and finally after generation we collect all Parquet files and write each cell as a row of a text file. This file has the extension picked by the user when configuring the processor (these are the two only parameters essentially, template and extension).
  • It is still a single Jinja2 template, i.e., a string; we could pass an object as discussed, but I agree with @nabinchha that this is very little work for the user -- it's just a call to json.dumps.
  • Processors can now implement a write_outputs_to_disk method, which tells the results object how all artifacts should be collected, combined and saved to disk. The artifacts for this processor are Parquet files, but I think more generically they could be anything. In summary, the processor must now have two methods: process, which might change the dataframe and save artifacts to disk; and write_outputs_to_disk, which contains the logic for collecting artifacts.
  • I don't like much how write_outputs_to_disk is called from DatasetCreationResults, but at least like this the processor is self-contained. Also I had to add a cache to the ArtifactStorage to be able to preview processor artifacts. Added comments on the code about these two points.
  • There's no splitting of any kind, we can add a helper function for that.

@andreatgretel andreatgretel force-pushed the andreatgretel/processor-for-jsonl-output branch 2 times, most recently from f3b8aa1 to 457a583 Compare December 4, 2025 18:22
@andreatgretel andreatgretel changed the title feat: 🧩 Implement chat template transform within general Processor framework feat: implement processor allowing part of dataset to be easily exported to JSONL Dec 5, 2025
@andreatgretel andreatgretel changed the title feat: implement processor allowing part of dataset to be easily exported to JSONL feat: processor to easily export part of dataset to JSONL Dec 5, 2025
@andreatgretel andreatgretel marked this pull request as draft December 5, 2025 18:51
@andreatgretel
Copy link
Contributor Author

I've updated the PR following what we discussed last week - the goal of this processor now is generating an auxiliary ("ancillary"?) dataset, which is saved in parquets separately from the main dataset. With this, one can, for instance, do prompt/completion columns, or a messages column with the proper JSON template (with role/content etc.)

Two points that I still need to address following Nabin's comments above, will do it asap:

  • Jinja2 templates should be validated beforehand
  • for preview, we shouldn't store anything in memory, but rather in temporary files

@andreatgretel andreatgretel force-pushed the andreatgretel/processor-for-jsonl-output branch 2 times, most recently from 16dc9fd to f1c1ec8 Compare December 8, 2025 20:48
@andreatgretel andreatgretel marked this pull request as ready for review December 9, 2025 19:08
@andreatgretel andreatgretel force-pushed the andreatgretel/processor-for-jsonl-output branch from c3df2f1 to 8e6bc66 Compare December 10, 2025 21:49
johnnygreco
johnnygreco previously approved these changes Dec 10, 2025
Copy link
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pushing this through @andreatgretel 🚀

Co-authored-by: Nabin Mulepati <[email protected]>
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🚢 🚢 🚢 🚢 🚢 🚢

Copy link
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛸

@andreatgretel andreatgretel merged commit f55211c into main Dec 10, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🧩 Implement chat template transform within general Processor framework

5 participants