You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"""Load a dataset builder from the Hugging Face Hub, or a local dataset. A dataset builder can be used to inspect general information that is required to build a dataset (cache directory, config, dataset info, etc.)
1755
-
without downloading the dataset itself.
1756
-
1757
-
You can find the list of datasets on the [Hub](https://huggingface.co/datasets) or with [`huggingface_hub.list_datasets`].
1756
+
"""Load a dataset builder which can be used to:
1758
1757
1759
-
A dataset is a directory that contains:
1758
+
- Inspect general information that is required to build a dataset (cache directory, config, dataset info, features, data files, etc.)
1759
+
- Download and prepare the dataset as Arrow files in the cache
1760
+
- Get a streaming dataset without downloading or caching anything
1760
1761
1761
-
- some data files in generic formats (JSON, CSV, Parquet, text, etc.)
1762
-
- and optionally a dataset script, if it requires some code to read the data files. This is used to load any kind of formats or structures.
1762
+
You can find the list of datasets on the [Hub](https://huggingface.co/datasets) or with [`huggingface_hub.list_datasets`].
1763
1763
1764
-
Note that dataset scripts can also download and read data files from anywhere - in case your data files already exist online.
1764
+
A dataset is a directory that contains some data files in generic formats (JSON, CSV, Parquet, etc.) and possibly
1765
+
in a generic structure (Webdataset, ImageFolder, AudioFolder, VideoFolder, etc.)
1765
1766
1766
1767
Args:
1767
1768
1768
1769
path (`str`):
1769
1770
Path or name of the dataset.
1770
-
Depending on `path`, the dataset builder that is used comes from a generic dataset script (JSON, CSV, Parquet, text etc.) or from the dataset script (a python file) inside the dataset directory.
1771
1771
1772
-
For local datasets:
1772
+
- if `path` is a dataset repository on the HF hub (list all available datasets with [`huggingface_hub.list_datasets`])
1773
+
-> load the dataset builder from supported files in the repository (csv, json, parquet, etc.)
1774
+
e.g. `'username/dataset_name'`, a dataset repository on the HF hub containing the data files.
1773
1775
1774
-
- if `path` is a local directory (containing data files only)
1775
-
-> load a generic dataset builder (csv, json, text etc.) based on the content of the directory
1776
+
- if `path` is a local directory
1777
+
-> load the dataset builder from supported files in the directory (csv, json, parquet, etc.)
1776
1778
e.g. `'./path/to/directory/with/my/csv/data'`.
1777
-
- if `path` is a local dataset script or a directory containing a local dataset script (if the script has the same name as the directory)
1778
-
-> load the dataset builder from the dataset script
1779
-
e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'`.
1780
-
1781
-
For datasets on the Hugging Face Hub (list all available datasets with [`huggingface_hub.list_datasets`])
1782
1779
1783
-
- if `path` is a dataset repository on the HF hub (containing data files only)
1784
-
-> load a generic dataset builder (csv, text etc.) based on the content of the repository
1785
-
e.g. `'username/dataset_name'`, a dataset repository on the HF hub containing your data files.
1786
-
- if `path` is a dataset repository on the HF hub with a dataset script (if the script has the same name as the directory)
1787
-
-> load the dataset builder from the dataset script in the dataset repository
1788
-
e.g. `glue`, `squad`, `'username/dataset_name'`, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.
1780
+
- if `path` is the name of a dataset builder and `data_files` or `data_dir` is specified
You can find the list of datasets on the [Hub](https://huggingface.co/datasets) or with [`huggingface_hub.list_datasets`].
1933
1929
1934
-
A dataset is a directory that contains:
1935
-
1936
-
- some data files in generic formats (JSON, CSV, Parquet, text, etc.).
1937
-
- and optionally a dataset script, if it requires some code to read the data files. This is used to load any kind of formats or structures.
1938
-
1939
-
Note that dataset scripts can also download and read data files from anywhere - in case your data files already exist online.
1930
+
A dataset is a directory that contains some data files in generic formats (JSON, CSV, Parquet, etc.) and possibly
1931
+
in a generic structure (Webdataset, ImageFolder, AudioFolder, VideoFolder, etc.)
1940
1932
1941
1933
This function does the following under the hood:
1942
1934
1943
-
1. Download and import in the library the dataset script from `path` if it's not already cached inside the library.
1935
+
1. Load a dataset builder:
1944
1936
1945
-
If the dataset has no dataset script, then a generic dataset script is imported instead (JSON, CSV, Parquet, text, etc.)
1937
+
* Find the most common data format in the dataset and pick its associated builder (JSON, CSV, Parquet, Webdataset, ImageFolder, AudioFolder, etc.)
1938
+
* Find which file goes into which split (e.g. train/test) based on file and directory names or on the YAML configuration
1939
+
* It is also possible to specify `data_files` manually, and which dataset builder to use (e.g. "parquet").
1946
1940
1947
-
Dataset scripts are small python scripts that define dataset builders. They define the citation, info and format of the dataset,
1948
-
contain the path or URL to the original data files and the code to load examples from the original data files.
1941
+
2. Run the dataset builder:
1949
1942
1950
-
You can find the complete list of datasets in the Datasets [Hub](https://huggingface.co/datasets).
1943
+
In the general case:
1951
1944
1952
-
2. Run the dataset script which will:
1953
-
1954
-
* Download the dataset file from the original URL (see the script) if it's not already available locally or cached.
1945
+
* Download the data files from the dataset if they are not already available locally or cached.
1955
1946
* Process and cache the dataset in typed Arrow tables for caching.
1956
1947
1957
1948
Arrow table are arbitrarily long, typed tables which can store nested objects and be mapped to numpy/pandas/python generic types.
1958
1949
They can be directly accessed from disk, loaded in RAM or even streamed over the web.
1959
1950
1951
+
In the streaming case:
1952
+
1953
+
* Don't download or cache anything. Instead, the dataset is lazily loaded and will be streamed on-the-fly when iterating on it.
1954
+
1960
1955
3. Return a dataset built from the requested splits in `split` (default: all).
1961
1956
1962
-
It also allows to load a dataset from a local directory or a dataset repository on the Hugging Face Hub without dataset script.
1963
-
In this case, it automatically loads all the data files from the directory or the dataset repository.
1957
+
It can also use a custom dataset builder if the dataset contains a dataset script, but this feature is mostly for backward compatibility.
1958
+
In this case the dataset script file must be named after the dataset repository or directory and end with ".py".
1964
1959
1965
1960
Args:
1966
1961
1967
1962
path (`str`):
1968
1963
Path or name of the dataset.
1969
-
Depending on `path`, the dataset builder that is used comes from a generic dataset script (JSON, CSV, Parquet, text etc.) or from the dataset script (a python file) inside the dataset directory.
1970
1964
1971
-
For local datasets:
1965
+
- if `path` is a dataset repository on the HF hub (list all available datasets with [`huggingface_hub.list_datasets`])
1966
+
-> load the dataset from supported files in the repository (csv, json, parquet, etc.)
1967
+
e.g. `'username/dataset_name'`, a dataset repository on the HF hub containing the data files.
1972
1968
1973
-
- if `path` is a local directory (containing data files only)
1974
-
-> load a generic dataset builder (csv, json, text etc.) based on the content of the directory
1969
+
- if `path` is a local directory
1970
+
-> load the dataset from supported files in the directory (csv, json, parquet, etc.)
1975
1971
e.g. `'./path/to/directory/with/my/csv/data'`.
1976
-
- if `path` is a local dataset script or a directory containing a local dataset script (if the script has the same name as the directory)
1977
-
-> load the dataset builder from the dataset script
1978
-
e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'`.
1979
1972
1980
-
For datasets on the Hugging Face Hub (list all available datasets with [`huggingface_hub.list_datasets`])
1981
-
1982
-
- if `path` is a dataset repository on the HF hub (containing data files only)
1983
-
-> load a generic dataset builder (csv, text etc.) based on the content of the repository
1984
-
e.g. `'username/dataset_name'`, a dataset repository on the HF hub containing your data files.
1985
-
- if `path` is a dataset repository on the HF hub with a dataset script (if the script has the same name as the directory)
1986
-
-> load the dataset builder from the dataset script in the dataset repository
1987
-
e.g. `glue`, `squad`, `'username/dataset_name'`, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`.
1973
+
- if `path` is the name of a dataset builder and `data_files` or `data_dir` is specified
0 commit comments