Skip to content

Latest commit

 

History

History
63 lines (50 loc) · 2.92 KB

File metadata and controls

63 lines (50 loc) · 2.92 KB

Data Format

This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information.

Hugging Face Chat Datasets

Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a messages key. The messages should be a list of dictionaries, each with a role and content key. The role typically has one of the following values: system, user, and assistant. For example:

{
    "messages": [
        {
            "role": "system",
            "content": "This is a helpful system message."
        },
        {
            "role": "user",
            "content": "This is a user's question"
        },
        {
            "role": "assistant",
            "content": "This is the assistant's response."
        }
    ]
}

Chat Templates

Formatting the data in this way allows us to take advantage of the Hugging Face tokenizers' apply_chat_template functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the HuggingFace apply_chat_template documentation for details.

By default, apply_chat_template attempts to apply the chat_template associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated chat_templates, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to apply_chat_template. The following is an example using a simple template which prepends a role header to each turn:

from transformers import AutoTokenizer

example_template = "{% for message in messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{{ content }}{% endfor %}"

example_input = [
    {
        'role': 'user',
        'content': 'Hello!'
    },
    {
        'role': 'assistant',
        'content': 'Hi there!'
    }
]
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
output = tokenizer.apply_chat_template(example_input, chat_template=example_template, tokenize=False)

## this is the output string we expect
expected_output = '<|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there!<|eot_id|>'
assert output == expected_output
:hide:

For more details on creating chat templates, refer to the Hugging Face documentation.