Skip to content

Commit 1070319

Browse files
authored
Add general example of split/subset on the main "Data Files Configuration" page (#1538)
* add general example of split/subset * minor
1 parent 859b2d7 commit 1070319

File tree

2 files changed

+37
-14
lines changed

2 files changed

+37
-14
lines changed

docs/hub/datasets-data-files-configuration.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,37 @@ Machine learning datasets typically have splits and may also have subsets. A dat
1111

1212
![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)
1313

14-
## File names and splits
14+
## Automatic splits detection
15+
16+
Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits:
17+
18+
```
19+
my_dataset_repository/
20+
├── README.md
21+
├── train.csv
22+
├── test.csv
23+
└── validation.csv
24+
```
1525

1626
To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135).
1727

18-
## Manual configuration
28+
## Manual splits and subsets configuration
1929

2030
You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
2131
It is useful if you want to specify which file goes into which split manually.
2232

2333
You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).
2434

35+
Here is an example of a configuration defining a subset called "benchmark" with a `test` split.
36+
37+
```yaml
38+
configs:
39+
- config_name: benchmark
40+
data_files:
41+
- split: test
42+
path: benchmark.csv
43+
```
44+
2545
See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87).
2646
2747
## Supported file formats

docs/hub/datasets-manual-configuration.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,21 @@ configs:
103103
---
104104
```
105105

106+
<Tip>
107+
108+
You can set a default subset using `default: true`
109+
110+
```yaml
111+
- config_name: main_data
112+
data_files: "main_data.csv"
113+
default: true
114+
```
115+
116+
This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default.
117+
118+
</Tip>
119+
120+
106121
## Builder parameters
107122
108123
Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files:
@@ -120,15 +135,3 @@ configs:
120135
```
121136

122137
Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have.
123-
124-
<Tip>
125-
126-
You can set a default subset using `default: true`
127-
128-
```yaml
129-
- config_name: main_data
130-
data_files: "main_data.csv"
131-
default: true
132-
```
133-
134-
</Tip>

0 commit comments

Comments
 (0)