Support system prompt in preprocess_conversations.#181
Support system prompt in preprocess_conversations.#181Ximingwang-09 wants to merge 4 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Ximingwang-09, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces support for system prompts within datasets, allowing for more flexible training of draft models, especially when fine-tuning models that use specific system prompts. This change aligns the data processing with the OpenAI Chat API protocol, enabling datasets to explicitly define a system message at the beginning of a conversation rather than relying solely on a default template.
Highlights
- Dataset Schema Update: The
README.mdhas been updated to reflect that therolefield within theconversationsarray in thejsonldataset format now accepts "system" as a valid value, alongside "user" and "assistant". - System Prompt Handling in Preprocessing: The
preprocess_conversationsfunction inspecforge/data/preprocessing.pyhas been modified to check if the first message in a conversation source has a "system" role. If it does, that content is used as the system prompt; otherwise, the default system prompt from the chat template is applied.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request adds support for using a system prompt from the dataset itself, instead of always relying on the default system prompt from the chat template. This is achieved by checking if the first message in a conversation has the 'system' role. The change is implemented in specforge/data/preprocessing.py and the documentation in README.md is updated accordingly.
The logic for handling the system prompt is correct. I've suggested a minor refactoring in specforge/data/preprocessing.py to make the code more concise and improve readability. Overall, this is a good addition that increases flexibility in dataset formatting.
|
I think in the newest version of the main branch, this issue has already been fixed. |
Motivation
As mentioned in #169, the community has provided a series of EAGLE3 models for the base models. However, there are many use cases where we want to train a draft model on the same dataset that was used for fine-tuning the target model, which has its own system prompt. This PR provides a simple way to use a system prompt, which is more in line with the OpenAI Chat API protocol.
Modifications
Related Issues
Accuracy Test
Benchmark & Profiling
Checklist