diff --git a/README.md b/README.md index 2ed27078cba..d4162b9e761 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ 🤗 Datasets is a lightweight library providing **two** main features: - **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), -- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. +- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share) diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx index 23cd9aa5de0..443148babed 100644 --- a/docs/source/loading.mdx +++ b/docs/source/loading.mdx @@ -178,6 +178,17 @@ The cache directory to store intermediate processing results will be the Arrow f For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported. +## HDF5 files + +[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files: + +```py +>>> from datasets import load_dataset +>>> dataset = load_dataset("hdf5", data_files="data.h5") +``` + +Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension. + ### SQL Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries: diff --git a/docs/source/tabular_load.mdx b/docs/source/tabular_load.mdx index 9a49d9505fc..1d5a54831d6 100644 --- a/docs/source/tabular_load.mdx +++ b/docs/source/tabular_load.mdx @@ -4,6 +4,7 @@ A tabular dataset is a generic dataset used to describe any data stored in rows - CSV files - Pandas DataFrames +- HDF5 files - Databases ## CSV files @@ -63,6 +64,17 @@ Use the `splits` parameter to specify the name of the dataset split: If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`. +## HDF5 files + +[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files: + +```py +>>> from datasets import load_dataset +>>> dataset = load_dataset("hdf5", data_files="data.h5") +``` + +Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension. + ## Databases Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.