| license | apache-2.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dataset_tags |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| task_categories |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| language |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| size_categories |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| dataset_info |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| configs |
|
An instruction-following dataset for Danish.
This dataset features 5 million examples of multi-turn conversations in Danish, designed to train instruction-following models, with a commercially usable license.
All examples in the dataset are structured as follows:
{
"messages": [
{
"role": "user",
"content": "(...)"
},
{
"role": "assistant",
"content": "(...)"
},
{
"role": "user",
"content": "(...)"
},
(...)
{
"role": "assistant",
"content": "(...)"
}
}The dataset was created using several steps, each of which is described in detail in the subsections below. The code base used for the dataset generation can be found here. We used the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model for all steps (except the manual Step 1).
We started by generating a set of 176 Danish seed prompts and answers manually, adapted from the English Self-Instruct seed prompts as well as prompts crowdsourced as part of the EU Horizon project TrustLLM (grant agreement number 101135671). These seed prompts can be found here.
With the seed prompts in hand, we used an improved version of the Alpaca recipe to generate an initial instruction dataset with 1 million examples. The main differences were that we used structured generation to ensure the correct outputs, and that we used MinHash deduplication rather than ROUGE computations, as this was many orders of magnitude faster. This used the seed prompts from the previous step as few-shot examples, and were filtered using filters that checked that the generated examples were not too short or too long, were not too similar to existing instructions, and did not contain prompt words.
The generated dataset was then grammar-corrected, which includes translation to Danish in case this was necessary. I.e., if the instruction was specifically about translation to a non-Danish language, then we don't translate the output, but in some cases the model ended up generating non-Danish instructions/outputs, so these were translated to Danish here.
A number of the generated examples were non-sensical or generally of low quality, so we run the generated instructions through the model again, this time asking it to rewrite the instructions to improve their quality, in case they were of low quality.
We next used the Evol-Instruct recipe to evolve the dataset for 4 generations. This process both makes the examples more complex and diverse. All the new evolved examples were added to the dataset and shuffled with the previous examples.
Finally, we added 3 follow-up queries and answers to each of the examples in the dataset.
This dataset is licensed under the Apache 2.0 license, allowing the dataset to be used for any purpose, including commercial purposes. The model that we used was also released with this license.
This dataset was created by Dan Saattrup Smart and Sofie Helene Bruun from the Alexandra Institute as part of the Danish Foundation Models project. The project is funded by the Danish Research Reserve as part of the national budget of Denmark for 2025, and consists of the following partners: