Skip to content

Commit 9b89591

Browse files
authored
Feat: Add doc on loading datasets and support for Azure/OCI (axolotl-ai-cloud#2482)
* fix: remove unused config * feat: add doc on dataset loading * feat: enable azure and oci remote file system * feat: add adlfs and ocifs to requirements * fix: add links between dataset formats and dataset loading * fix: remove unused condition * Revert "fix: remove unused condition" This reverts commit 5fe13be.
1 parent 31498d0 commit 9b89591

File tree

7 files changed

+328
-31
lines changed

7 files changed

+328
-31
lines changed

_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ website:
231231
- docs/reward_modelling.qmd
232232
- docs/lr_groups.qmd
233233
- docs/lora_optims.qmd
234+
- docs/dataset_loading.qmd
234235

235236
- section: "Core Concepts"
236237
contents:

docs/config.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ datasets:
109109
preprocess_shards: # Optional[int] process dataset in N sequential chunks for memory efficiency (exclusive with `shards`)
110110

111111
name: # Optional[str] name of dataset configuration to load
112-
train_on_split: train # Optional[str] name of dataset split to load from
112+
split: train # Optional[str] name of dataset split to load from
113113
revision: # Optional[str] The specific revision of the dataset to use when loading from the Hugging Face Hub. This can be a commit hash, tag, or branch name. If not specified, the latest version will be used. This parameter is ignored for local datasets.
114114
trust_remote_code: # Optional[bool] Trust remote code for untrusted source
115115

docs/dataset-formats/index.qmd

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@ As there are a lot of available options in Axolotl, this guide aims to provide a
1313

1414
Axolotl supports 3 kinds of training methods: pre-training, supervised fine-tuning, and preference-based post-training (e.g. DPO, ORPO, PRMs). Each method has their own dataset format which are described below.
1515

16+
::: {.callout-tip}
17+
18+
This guide will mainly use JSONL as an introduction. Please refer to the [dataset loading docs](../dataset_loading.qmd) to understand how to load datasets from other sources.
19+
20+
For `pretraining_dataset:` specifically, please refer to the [Pre-training section](#pre-training).
21+
:::
22+
1623
## Pre-training
1724

1825
When aiming to train on large corpora of text datasets, pre-training is your go-to choice. Due to the size of these datasets, downloading the entire-datasets before beginning training would be prohibitively time-consuming. Axolotl supports [streaming](https://huggingface.co/docs/datasets/en/stream) to only load batches into memory at a time.

docs/dataset_loading.qmd

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
---
2+
title: Dataset Loading
3+
description: Understanding how to load datasets from different sources
4+
back-to-top-navigation: true
5+
toc: true
6+
toc-depth: 5
7+
---
8+
9+
## Overview
10+
11+
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
12+
13+
## Loading Datasets
14+
15+
We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them.
16+
17+
You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file.
18+
19+
```yaml
20+
datasets:
21+
- path:
22+
name:
23+
data_files:
24+
split:
25+
revision:
26+
trust_remote_code:
27+
```
28+
29+
::: {.callout-tip}
30+
31+
Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`.
32+
33+
:::
34+
35+
This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home.
36+
37+
For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading).
38+
39+
For full details on the config, see [config.qmd](config.qmd).
40+
41+
::: {.callout-note}
42+
43+
You can set multiple datasets in the config file by more than one entry under `datasets`.
44+
45+
```yaml
46+
datasets:
47+
- path: /path/to/your/dataset
48+
- path: /path/to/your/other/dataset
49+
```
50+
51+
:::
52+
53+
### Local dataset
54+
55+
#### Files
56+
57+
Usually, to load a JSON file, you would do something like this:
58+
59+
```python
60+
from datasets import load_dataset
61+
62+
dataset = load_dataset("json", data_files="data.json")
63+
```
64+
65+
Which translates to the following config:
66+
67+
```yaml
68+
datasets:
69+
- path: json
70+
data_files: /path/to/your/file.jsonl
71+
```
72+
73+
However, to make things easier, we have added a few shortcuts for loading local dataset files.
74+
75+
You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file:
76+
77+
```yaml
78+
datasets:
79+
- path: /path/to/your/file.jsonl
80+
ds_type: json
81+
```
82+
83+
This works for CSV, JSON, Parquet, and Arrow files.
84+
85+
::: {.callout-tip}
86+
87+
If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like.
88+
89+
:::
90+
91+
#### Directory
92+
93+
If you're loading a directory, you can point the `path` to the directory.
94+
95+
Then, you have two options:
96+
97+
##### Loading entire directory
98+
99+
You do not need any additional configs.
100+
101+
We will attempt to load in the following order:
102+
- datasets saved with `datasets.save_to_disk`
103+
- loading entire directory of files (such as with parquet/arrow files)
104+
105+
```yaml
106+
datasets:
107+
- path: /path/to/your/directory
108+
```
109+
110+
##### Loading specific files in directory
111+
112+
Provide `data_files` with a list of files to load.
113+
114+
```yaml
115+
datasets:
116+
# single file
117+
- path: /path/to/your/directory
118+
ds_type: csv
119+
data_files: file1.csv
120+
121+
# multiple files
122+
- path: /path/to/your/directory
123+
ds_type: json
124+
data_files:
125+
- file1.jsonl
126+
- file2.jsonl
127+
128+
# multiple files for parquet
129+
- path: /path/to/your/directory
130+
ds_type: parquet
131+
data_files:
132+
- file1.parquet
133+
- file2.parquet
134+
135+
```
136+
137+
### HuggingFace Hub
138+
139+
The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
140+
141+
::: {.callout-note}
142+
143+
If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file.
144+
145+
:::
146+
147+
#### Folder uploaded
148+
149+
This would mean that the dataset is a single file or file(s) uploaded to the Hub.
150+
151+
```yaml
152+
datasets:
153+
- path: org/dataset-name
154+
data_files:
155+
- file1.jsonl
156+
- file2.jsonl
157+
```
158+
159+
#### HuggingFace Dataset
160+
161+
This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`.
162+
163+
```yaml
164+
datasets:
165+
- path: org/dataset-name
166+
```
167+
168+
::: {.callout-note}
169+
170+
There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset.
171+
172+
:::
173+
174+
### Remote Filesystems
175+
176+
Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
177+
178+
::: {.callout-warning}
179+
180+
This is currently experimental. Please let us know if you run into any issues!
181+
182+
:::
183+
184+
The only difference between the providers is that you need to prepend the path with the respective protocols.
185+
186+
```yaml
187+
datasets:
188+
# Single file
189+
- path: s3://bucket-name/path/to/your/file.jsonl
190+
191+
# Directory
192+
- path: s3://bucket-name/path/to/your/directory
193+
```
194+
195+
For directory, we load via `load_from_disk`.
196+
197+
#### S3
198+
199+
Prepend the path with `s3://`.
200+
201+
The credentials are pulled in the following order:
202+
203+
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables
204+
- from the `~/.aws/credentials` file
205+
- for nodes on EC2, the IAM metadata provider
206+
207+
::: {.callout-note}
208+
209+
We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
210+
211+
:::
212+
213+
Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)
214+
215+
#### GCS
216+
217+
Prepend the path with `gs://` or `gcs://`.
218+
219+
The credentials are loaded in the following order:
220+
221+
- gcloud credentials
222+
- for nodes on GCP, the google metadata service
223+
- anonymous access
224+
225+
#### Azure
226+
227+
##### Gen 1
228+
229+
Prepend the path with `adl://`.
230+
231+
Ensure you have the following environment variables set:
232+
233+
- `AZURE_STORAGE_TENANT_ID`
234+
- `AZURE_STORAGE_CLIENT_ID`
235+
- `AZURE_STORAGE_CLIENT_SECRET`
236+
237+
##### Gen 2
238+
239+
Prepend the path with `abfs://` or `az://`.
240+
241+
Ensure you have the following environment variables set:
242+
243+
- `AZURE_STORAGE_ACCOUNT_NAME`
244+
- `AZURE_STORAGE_ACCOUNT_KEY`
245+
246+
Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials)
247+
248+
#### OCI
249+
250+
Prepend the path with `oci://`.
251+
252+
It would attempt to read in the following order:
253+
254+
- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables
255+
- when on OCI resource, resource principal
256+
257+
Other environment variables:
258+
259+
- `OCI_REGION_METADATA`
260+
261+
Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables).
262+
263+
### HTTPS
264+
265+
The path should start with `https://`.
266+
267+
```yaml
268+
datasets:
269+
- path: https://path/to/your/dataset/file.jsonl
270+
```
271+
272+
This must be publically accessible.
273+
274+
## Next steps
275+
276+
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats).

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ python-dotenv==1.0.1
4949
# remote filesystems
5050
s3fs>=2024.5.0
5151
gcsfs>=2024.5.0
52-
# adlfs
52+
adlfs>=2024.5.0
53+
ocifs==1.3.2
5354

5455
zstandard==0.22.0
5556
fastcore

0 commit comments

Comments
 (0)