|
| 1 | +--- |
| 2 | +title: Dataset Loading |
| 3 | +description: Understanding how to load datasets from different sources |
| 4 | +back-to-top-navigation: true |
| 5 | +toc: true |
| 6 | +toc-depth: 5 |
| 7 | +--- |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored. |
| 12 | + |
| 13 | +## Loading Datasets |
| 14 | + |
| 15 | +We use the `datasets` library to load datasets and a mix of `load_dataset` and `load_from_disk` to load them. |
| 16 | + |
| 17 | +You may recognize the similar named configs between `load_dataset` and the `datasets` section of the config file. |
| 18 | + |
| 19 | +```yaml |
| 20 | +datasets: |
| 21 | + - path: |
| 22 | + name: |
| 23 | + data_files: |
| 24 | + split: |
| 25 | + revision: |
| 26 | + trust_remote_code: |
| 27 | +``` |
| 28 | +
|
| 29 | +::: {.callout-tip} |
| 30 | +
|
| 31 | +Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be `path` and sometimes `data_files`. |
| 32 | + |
| 33 | +::: |
| 34 | + |
| 35 | +This matches the API of [`datasets.load_dataset`](https://github.com/huggingface/datasets/blob/0b5998ac62f08e358f8dcc17ec6e2f2a5e9450b6/src/datasets/load.py#L1838-L1858), so if you're familiar with that, you will feel right at home. |
| 36 | + |
| 37 | +For HuggingFace's guide to load different dataset types, see [here](https://huggingface.co/docs/datasets/loading). |
| 38 | + |
| 39 | +For full details on the config, see [config.qmd](config.qmd). |
| 40 | + |
| 41 | +::: {.callout-note} |
| 42 | + |
| 43 | +You can set multiple datasets in the config file by more than one entry under `datasets`. |
| 44 | + |
| 45 | +```yaml |
| 46 | +datasets: |
| 47 | + - path: /path/to/your/dataset |
| 48 | + - path: /path/to/your/other/dataset |
| 49 | +``` |
| 50 | + |
| 51 | +::: |
| 52 | + |
| 53 | +### Local dataset |
| 54 | + |
| 55 | +#### Files |
| 56 | + |
| 57 | +Usually, to load a JSON file, you would do something like this: |
| 58 | + |
| 59 | +```python |
| 60 | +from datasets import load_dataset |
| 61 | +
|
| 62 | +dataset = load_dataset("json", data_files="data.json") |
| 63 | +``` |
| 64 | + |
| 65 | +Which translates to the following config: |
| 66 | + |
| 67 | +```yaml |
| 68 | +datasets: |
| 69 | + - path: json |
| 70 | + data_files: /path/to/your/file.jsonl |
| 71 | +``` |
| 72 | + |
| 73 | +However, to make things easier, we have added a few shortcuts for loading local dataset files. |
| 74 | + |
| 75 | +You can just point the `path` to the file or directory along with the `ds_type` to load the dataset. The below example shows for a JSON file: |
| 76 | + |
| 77 | +```yaml |
| 78 | +datasets: |
| 79 | + - path: /path/to/your/file.jsonl |
| 80 | + ds_type: json |
| 81 | +``` |
| 82 | + |
| 83 | +This works for CSV, JSON, Parquet, and Arrow files. |
| 84 | + |
| 85 | +::: {.callout-tip} |
| 86 | + |
| 87 | +If `path` points to a file and `ds_type` is not specified, we will automatically infer the dataset type from the file extension, so you could omit `ds_type` if you'd like. |
| 88 | + |
| 89 | +::: |
| 90 | + |
| 91 | +#### Directory |
| 92 | + |
| 93 | +If you're loading a directory, you can point the `path` to the directory. |
| 94 | + |
| 95 | +Then, you have two options: |
| 96 | + |
| 97 | +##### Loading entire directory |
| 98 | + |
| 99 | +You do not need any additional configs. |
| 100 | + |
| 101 | +We will attempt to load in the following order: |
| 102 | +- datasets saved with `datasets.save_to_disk` |
| 103 | +- loading entire directory of files (such as with parquet/arrow files) |
| 104 | + |
| 105 | +```yaml |
| 106 | +datasets: |
| 107 | + - path: /path/to/your/directory |
| 108 | +``` |
| 109 | + |
| 110 | +##### Loading specific files in directory |
| 111 | + |
| 112 | +Provide `data_files` with a list of files to load. |
| 113 | + |
| 114 | +```yaml |
| 115 | +datasets: |
| 116 | + # single file |
| 117 | + - path: /path/to/your/directory |
| 118 | + ds_type: csv |
| 119 | + data_files: file1.csv |
| 120 | +
|
| 121 | + # multiple files |
| 122 | + - path: /path/to/your/directory |
| 123 | + ds_type: json |
| 124 | + data_files: |
| 125 | + - file1.jsonl |
| 126 | + - file2.jsonl |
| 127 | +
|
| 128 | + # multiple files for parquet |
| 129 | + - path: /path/to/your/directory |
| 130 | + ds_type: parquet |
| 131 | + data_files: |
| 132 | + - file1.parquet |
| 133 | + - file2.parquet |
| 134 | +
|
| 135 | +``` |
| 136 | + |
| 137 | +### HuggingFace Hub |
| 138 | + |
| 139 | +The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed. |
| 140 | + |
| 141 | +::: {.callout-note} |
| 142 | + |
| 143 | +If you're using a private dataset, you will need to enable the `hf_use_auth_token` flag in the root-level of the config file. |
| 144 | + |
| 145 | +::: |
| 146 | + |
| 147 | +#### Folder uploaded |
| 148 | + |
| 149 | +This would mean that the dataset is a single file or file(s) uploaded to the Hub. |
| 150 | + |
| 151 | +```yaml |
| 152 | +datasets: |
| 153 | + - path: org/dataset-name |
| 154 | + data_files: |
| 155 | + - file1.jsonl |
| 156 | + - file2.jsonl |
| 157 | +``` |
| 158 | + |
| 159 | +#### HuggingFace Dataset |
| 160 | + |
| 161 | +This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via `datasets.push_to_hub`. |
| 162 | + |
| 163 | +```yaml |
| 164 | +datasets: |
| 165 | + - path: org/dataset-name |
| 166 | +``` |
| 167 | + |
| 168 | +::: {.callout-note} |
| 169 | + |
| 170 | +There are some other configs which may be required like `name`, `split`, `revision`, `trust_remote_code`, etc depending on the dataset. |
| 171 | + |
| 172 | +::: |
| 173 | + |
| 174 | +### Remote Filesystems |
| 175 | + |
| 176 | +Via the `storage_options` config under `load_dataset`, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI. |
| 177 | + |
| 178 | +::: {.callout-warning} |
| 179 | + |
| 180 | +This is currently experimental. Please let us know if you run into any issues! |
| 181 | + |
| 182 | +::: |
| 183 | + |
| 184 | +The only difference between the providers is that you need to prepend the path with the respective protocols. |
| 185 | + |
| 186 | +```yaml |
| 187 | +datasets: |
| 188 | + # Single file |
| 189 | + - path: s3://bucket-name/path/to/your/file.jsonl |
| 190 | +
|
| 191 | + # Directory |
| 192 | + - path: s3://bucket-name/path/to/your/directory |
| 193 | +``` |
| 194 | + |
| 195 | +For directory, we load via `load_from_disk`. |
| 196 | + |
| 197 | +#### S3 |
| 198 | + |
| 199 | +Prepend the path with `s3://`. |
| 200 | + |
| 201 | +The credentials are pulled in the following order: |
| 202 | + |
| 203 | +- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN` environment variables |
| 204 | +- from the `~/.aws/credentials` file |
| 205 | +- for nodes on EC2, the IAM metadata provider |
| 206 | + |
| 207 | +::: {.callout-note} |
| 208 | + |
| 209 | +We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this. |
| 210 | + |
| 211 | +::: |
| 212 | + |
| 213 | +Other environment variables that can be set can be found in [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables) |
| 214 | + |
| 215 | +#### GCS |
| 216 | + |
| 217 | +Prepend the path with `gs://` or `gcs://`. |
| 218 | + |
| 219 | +The credentials are loaded in the following order: |
| 220 | + |
| 221 | +- gcloud credentials |
| 222 | +- for nodes on GCP, the google metadata service |
| 223 | +- anonymous access |
| 224 | + |
| 225 | +#### Azure |
| 226 | + |
| 227 | +##### Gen 1 |
| 228 | + |
| 229 | +Prepend the path with `adl://`. |
| 230 | + |
| 231 | +Ensure you have the following environment variables set: |
| 232 | + |
| 233 | +- `AZURE_STORAGE_TENANT_ID` |
| 234 | +- `AZURE_STORAGE_CLIENT_ID` |
| 235 | +- `AZURE_STORAGE_CLIENT_SECRET` |
| 236 | + |
| 237 | +##### Gen 2 |
| 238 | + |
| 239 | +Prepend the path with `abfs://` or `az://`. |
| 240 | + |
| 241 | +Ensure you have the following environment variables set: |
| 242 | + |
| 243 | +- `AZURE_STORAGE_ACCOUNT_NAME` |
| 244 | +- `AZURE_STORAGE_ACCOUNT_KEY` |
| 245 | + |
| 246 | +Other environment variables that can be set can be found in [adlfs docs](https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials) |
| 247 | + |
| 248 | +#### OCI |
| 249 | + |
| 250 | +Prepend the path with `oci://`. |
| 251 | + |
| 252 | +It would attempt to read in the following order: |
| 253 | + |
| 254 | +- `OCIFS_IAM_TYPE`, `OCIFS_CONFIG_LOCATION`, and `OCIFS_CONFIG_PROFILE` environment variables |
| 255 | +- when on OCI resource, resource principal |
| 256 | + |
| 257 | +Other environment variables: |
| 258 | + |
| 259 | +- `OCI_REGION_METADATA` |
| 260 | + |
| 261 | +Please see the [ocifs docs](https://ocifs.readthedocs.io/en/latest/getting-connected.html#Using-Environment-Variables). |
| 262 | + |
| 263 | +### HTTPS |
| 264 | + |
| 265 | +The path should start with `https://`. |
| 266 | + |
| 267 | +```yaml |
| 268 | +datasets: |
| 269 | + - path: https://path/to/your/dataset/file.jsonl |
| 270 | +``` |
| 271 | + |
| 272 | +This must be publically accessible. |
| 273 | + |
| 274 | +## Next steps |
| 275 | + |
| 276 | +Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format [dataset formats docs](dataset-formats). |
0 commit comments