You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* support LLaMA-3
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Run pre-commit
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
*[2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
50
51
*[2024/01][Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
373
372
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
374
-
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally.
375
-
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format.
376
-
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly.
373
+
* Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
374
+
*`cache`: Directory to store Hugging Face data cache.
375
+
*`jsonl`: Output directory to store converted dataset in jsonl format.
376
+
*`arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
377
377
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
378
378
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
379
379
@@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
0 commit comments