-
Notifications
You must be signed in to change notification settings - Fork 51
feat: processor to easily export part of dataset to JSONL #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: processor to easily export part of dataset to JSONL #26
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
c30bd73 to
37fffe4
Compare
21600b9 to
db6a276
Compare
|
Made changes to the PR following our offline discussion:
|
src/data_designer/engine/processing/processors/output_format.py
Outdated
Show resolved
Hide resolved
f3b8aa1 to
457a583
Compare
|
I've updated the PR following what we discussed last week - the goal of this processor now is generating an auxiliary ("ancillary"?) dataset, which is saved in parquets separately from the main dataset. With this, one can, for instance, do prompt/completion columns, or a messages column with the proper JSON template (with role/content etc.) Two points that I still need to address following Nabin's comments above, will do it asap:
|
16dc9fd to
f1c1ec8
Compare
Co-authored-by: Nabin Mulepati <[email protected]>
c3df2f1 to
8e6bc66
Compare
johnnygreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pushing this through @andreatgretel 🚀
Co-authored-by: Nabin Mulepati <[email protected]>
nabinchha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 🚢 🚢 🚢 🚢 🚢 🚢
johnnygreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛸
This PR implements a new processor that uses a Jinja template to automatically generate JSONL files that can be readily consumed by fine-tuning APIs.
Example usage
See
examples/example.pyfor a fully functional example. Specifically, this is how the new processor is used:Closes #25