Skip to content

Latest commit

 

History

History
174 lines (149 loc) · 5.22 KB

File metadata and controls

174 lines (149 loc) · 5.22 KB
license apache-2.0
dataset_tags
instruction-following
task_categories
text-generation
language
da
size_categories
1M<n<10M
dataset_info
config_name features splits download_size dataset_size
base
name dtype
instruction
string
name dtype
output
string
name num_bytes num_examples
train
1047470220
1005163
656907545
1047470220
config_name features splits download_size dataset_size
grammar_corrected
name dtype
instruction
string
name dtype
output
string
name dtype
hash
string
name num_bytes num_examples
train
624398370
1002111
406049978
624398370
config_name features splits download_size dataset_size
quality_corrected
name dtype
instruction
string
name dtype
output
string
name dtype
hash
string
name num_bytes num_examples
train
779474387
999588
499250083
779474387
configs
config_name data_files
base
split path
train
base/train-*
config_name data_files
grammar_corrected
split path
train
grammar_corrected/train-*
config_name data_files
quality_corrected
split path
train
quality_corrected/train-*

Lærebogen

An instruction-following dataset for Danish.

This dataset features 5 million examples of multi-turn conversations in Danish, designed to train instruction-following models, with a commercially usable license.

Dataset Structure

All examples in the dataset are structured as follows:

{
  "messages": [
 {
  "role": "user",
  "content": "(...)"
 },
 {
  "role": "assistant",
  "content": "(...)"
 },
 {
  "role": "user",
  "content": "(...)"
 },
 (...)
 {
  "role": "assistant",
  "content": "(...)"
 }
}

Dataset Generation Process

The dataset was created using several steps, each of which is described in detail in the subsections below. The code base used for the dataset generation can be found here. We used the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 model for all steps (except the manual Step 1).

Step 1: Seed Generation

We started by generating a set of 176 Danish seed prompts and answers manually, adapted from the English Self-Instruct seed prompts as well as prompts crowdsourced as part of the EU Horizon project TrustLLM (grant agreement number 101135671). These seed prompts can be found here.

Step 2: Base Dataset Generation

With the seed prompts in hand, we used an improved version of the Alpaca recipe to generate an initial instruction dataset with 1 million examples. The main differences were that we used structured generation to ensure the correct outputs, and that we used MinHash deduplication rather than ROUGE computations, as this was many orders of magnitude faster. This used the seed prompts from the previous step as few-shot examples, and were filtered using filters that checked that the generated examples were not too short or too long, were not too similar to existing instructions, and did not contain prompt words.

Step 3: Grammar Correction

The generated dataset was then grammar-corrected, which includes translation to Danish in case this was necessary. I.e., if the instruction was specifically about translation to a non-Danish language, then we don't translate the output, but in some cases the model ended up generating non-Danish instructions/outputs, so these were translated to Danish here.

Step 4: Quality Improvement

A number of the generated examples were non-sensical or generally of low quality, so we run the generated instructions through the model again, this time asking it to rewrite the instructions to improve their quality, in case they were of low quality.

Step 5: Evolving the Dataset

We next used the Evol-Instruct recipe to evolve the dataset for 4 generations. This process both makes the examples more complex and diverse. All the new evolved examples were added to the dataset and shuffled with the previous examples.

Step 6: Adding Follow-Up Questions

Finally, we added 3 follow-up queries and answers to each of the examples in the dataset.

License

This dataset is licensed under the Apache 2.0 license, allowing the dataset to be used for any purpose, including commercial purposes. The model that we used was also released with this license.

Creators and Funders

This dataset was created by Dan Saattrup Smart and Sofie Helene Bruun from the Alexandra Institute as part of the Danish Foundation Models project. The project is funded by the Danish Research Reserve as part of the national budget of Denmark for 2025, and consists of the following partners: