|
| 1 | +# Offline Data Preprocessing |
| 2 | + |
| 3 | +Our library provides a [script](../scripts/offline_data_processing.py) that allows users to perform standalone data preprocessing, independent of tuning/training. This script enables users to process raw datasets, apply basic/advanced data preprocessing, and save the train and validation datasets in Parquet format inside the specified `output_dir`. When the `--num_dataset_shards` argument is specified, the datasets are divided and saved into multiple shards. |
| 4 | + |
| 5 | +Users can pass any data config to this script. The goal of the script is to take the provided data config and generate a dataset that can be used directly for training, without requiring any online processing. As an example see this data config below: |
| 6 | + |
| 7 | +```yaml |
| 8 | +dataprocessor: |
| 9 | + type: default |
| 10 | + sampling_stopping_strategy: first_exhausted |
| 11 | + seed: 66 |
| 12 | +datasets: |
| 13 | + - name: dataset_1 |
| 14 | + data_paths: |
| 15 | + - tests/artifacts/testdata/jsonl/twitter_complaints_input_output.jsonl |
| 16 | + data_handlers: |
| 17 | + - name: tokenize_and_apply_input_masking |
| 18 | + arguments: |
| 19 | + remove_columns: all |
| 20 | + batched: false |
| 21 | + fn_kwargs: |
| 22 | + input_field_name: input |
| 23 | + output_field_name: output |
| 24 | +``` |
| 25 | +
|
| 26 | +After preparing the data configuration YAML file, run the script with the following example command to perform offline data preprocessing: |
| 27 | +
|
| 28 | +``` |
| 29 | +python scripts/offline_data_processing.py \ |
| 30 | +--data_config_path /path/to/data_config.yaml \ |
| 31 | +--model_name_or_path "model_name" \ |
| 32 | +--max_seq_length 4096 \ |
| 33 | +--output_dir /path/to/output/directory \ |
| 34 | +--log_level info \ |
| 35 | +--num_dataset_shards 3 |
| 36 | +``` |
| 37 | + |
| 38 | +Additionally, once the offline data processing is complete, users can leverage the shards stored in `output_dir` for tuning by passing it through the `--training_data_path` flag or passing it via `data_paths` argument in data config yaml, provided they find the sharded datasets beneficial for training. |
| 39 | + |
| 40 | +## Example Usage |
| 41 | +### Applying Chat Template |
| 42 | + |
| 43 | +This is a sample use case of the offline processing script being applied to a dataset with a chat template, after which the offline processed dataset is used to train a model. |
| 44 | + |
| 45 | +In this use case, the chat template is applied to a dataset using the `apply_tokenizer_chat_template` handler, followed by additional data transformation handlers. |
| 46 | + |
| 47 | +**NOTE**: Streaming of the dataset is not supported when running the offline data preprocessing script. Therefore, in the data config, the `streaming` argument should either be set to `False` or left unassigned. |
| 48 | + |
| 49 | +```yaml |
| 50 | +dataprocessor: |
| 51 | + type: default |
| 52 | + sampling_stopping_strategy: first_exhausted |
| 53 | + seed: 66 |
| 54 | + streaming: False |
| 55 | + chat_template: | |
| 56 | + {%- for message in messages['messages'] %} |
| 57 | + {%- if message['role'] == 'system' %} |
| 58 | + {{ '<|start_of_role|>system<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 59 | + {%- elif message['role'] == 'user' %} |
| 60 | + {{ '<|start_of_role|>user<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 61 | + {%- elif message['role'] == 'assistant' %} |
| 62 | + {{ '<|start_of_role|>assistant<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 63 | + {%- elif message['role'] == 'tools' %} |
| 64 | + {{ '<|start_of_role|>tools<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 65 | + {%- elif message['role'] == 'tool' %} |
| 66 | + {{ '<|start_of_role|>tool<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 67 | + {%- elif message['role'] == 'documents' %} |
| 68 | + {{ '<|start_of_role|>documents<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 69 | + {%- else %} |
| 70 | + {{ '<|start_of_role|>unknown<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }} |
| 71 | + {%- endif %} |
| 72 | + {%- endfor %} |
| 73 | +datasets: |
| 74 | + - name: dataset_1 |
| 75 | + retain_columns: |
| 76 | + - "formatted_chat" |
| 77 | + data_paths: |
| 78 | + - "/app/arb30_100.jsonl" |
| 79 | + data_handlers: |
| 80 | + - name: apply_tokenizer_chat_template |
| 81 | + arguments: |
| 82 | + fn_kwargs: |
| 83 | + dataset_text_field: "formatted_chat" |
| 84 | + - name: tokenize |
| 85 | + arguments: |
| 86 | + batched: false |
| 87 | + fn_kwargs: |
| 88 | + dataset_text_field: "formatted_chat" |
| 89 | + truncation: False |
| 90 | + max_length: 4096 |
| 91 | + - name: skip_large_text |
| 92 | + arguments: |
| 93 | + fn_kwargs: |
| 94 | + column_name: "input_ids" |
| 95 | + max_length: 4096 |
| 96 | + - name: retain_columns |
| 97 | + arguments: |
| 98 | + columns: |
| 99 | + - "formatted_chat" |
| 100 | +``` |
| 101 | +
|
| 102 | +Command to run the offline data processing script: |
| 103 | +
|
| 104 | +```yaml |
| 105 | +python scripts/offline_data_processing.py \ |
| 106 | +--data_config_path "data_config.yaml" \ |
| 107 | +--instruction_template "<|start_of_role|>user<|end_of_role|>" \ |
| 108 | +--max_seq_length "8192" \ |
| 109 | +--model_name_or_path "/test/models/granite-3.1-8b-instruct" \ |
| 110 | +--output_dir "/test/data/offline_processing_shards" \ |
| 111 | +--packing "False" \ |
| 112 | +--response_template "<|start_of_role|>assistant<|end_of_role|>" \ |
| 113 | +--split_batches "true" \ |
| 114 | +--use_flash_attn "true" \ |
| 115 | +--num_dataset_shards "10" |
| 116 | +``` |
| 117 | + |
| 118 | +The resulting shards are saved in the directory `/test/data/offline_processing_shards`, as specified by the `--output_dir` argument. These shards can then be used for tuning the model by pointing the `training_data_path` argument to the directory where the shards are stored—in this example, |
| 119 | +`/test/data/offline_processing_shards`. |
| 120 | + |
| 121 | +Command to run the tuning: |
| 122 | + |
| 123 | +```yaml |
| 124 | +accelerate launch \ |
| 125 | + --num_processes=8 \ |
| 126 | + --dynamo_backend="no" \ |
| 127 | + --fsdp_auto_wrap_policy="TRANSFORMER_BASED_WRAP" \ |
| 128 | + --fsdp_cpu_ram_efficient_loading="true" \ |
| 129 | + --fsdp_forward_prefetch="false" \ |
| 130 | + --fsdp_offload_params="false" \ |
| 131 | + --fsdp_sharding_strategy="HYBRID_SHARD" \ |
| 132 | + --fsdp_state_dict_type="FULL_STATE_DICT" \ |
| 133 | + --fsdp_sync_module_states="true" \ |
| 134 | + --machine_rank="${RANK}" \ |
| 135 | + --main_process_ip="${MASTER_ADDR}" \ |
| 136 | + --main_process_port="${MASTER_PORT}" \ |
| 137 | + --mixed_precision="no" \ |
| 138 | + --num_machines="${WORLD_SIZE}" \ |
| 139 | + --rdzv_backend="static" \ |
| 140 | + --same_network \ |
| 141 | + --use_fsdp \ |
| 142 | + -m tuning.sft_trainer \ |
| 143 | + --training_data_path "/test/data/offline_processing_shards" \ |
| 144 | + --adam_beta1="0.9" \ |
| 145 | + --adam_beta2="0.98" \ |
| 146 | + --adam_epsilon="1e-10" \ |
| 147 | + --aim_repo="${AIMSTACK_DB}" \ |
| 148 | + --dataloader_drop_last="true" \ |
| 149 | + --dataset_text_field="random" \ |
| 150 | + --evaluation_strategy="no" \ |
| 151 | + --experiment="train-nb-g8b-r26-e0e88b40-dbd8-41ae-a744-c853959495f2" \ |
| 152 | + --gradient_accumulation_steps="1" \ |
| 153 | + --gradient_checkpointing="true" \ |
| 154 | + --include_tokens_per_second="false" \ |
| 155 | + --instruction_template="<|start_of_role|>user<|end_of_role|>" \ |
| 156 | + --learning_rate="1e-06" \ |
| 157 | + --logging_steps="1" \ |
| 158 | + --logging_strategy="steps" \ |
| 159 | + --lr_scheduler_type="cosine" \ |
| 160 | + --max_seq_length="8192" \ |
| 161 | + --max_steps="12400" \ |
| 162 | + --model_name_or_path="/test/models/granite-3.1-8b-instruct" \ |
| 163 | + --num_train_epochs="3" \ |
| 164 | + --optim="adamw_torch" \ |
| 165 | + --output_dir="/hfcache/data_mixing/data_mixing/wca_summ/run26_rb_mix" \ |
| 166 | + --packing="False" \ |
| 167 | + --per_device_train_batch_size="32" \ |
| 168 | + --response_template="<|start_of_role|>assistant<|end_of_role|>" \ |
| 169 | + --save_steps="100" \ |
| 170 | + --save_strategy="steps" \ |
| 171 | + --split_batches="true" \ |
| 172 | + --torch_dtype="bfloat16" \ |
| 173 | + --use_flash_attn="true" \ |
| 174 | + --use_reentrant="true" \ |
| 175 | + --warmup_ratio="0.1" \ |
| 176 | + --warmup_steps="200" \ |
| 177 | + --weight_decay="0.1" |
| 178 | +``` |
0 commit comments