Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@
title: Create a document dataset
- local: nifti_dataset
title: Create a medical imaging dataset
- local: dicom_dataset
title: Create a medical dataset, containing images, signals or videos and additional metadata
title: "Vision"
- sections:
- local: nlp_load
Expand Down
Original file line number Diff line number Diff line change
@@ -1,26 +1,34 @@
# Create a NIfTI dataset
## Medical Imaging Dataset Guide

This page shows how to create and share a dataset of medical images in NIfTI format (.nii / .nii.gz) using the `datasets` library.
There are a couple of formats commonly used for medical imaging data, including DICOM and NIfTI. This guide covers how to create and share datasets in both formats using the `datasets` library.

These are typically used for:
- NifTI: Storing MRI, fMRI, CT, PET scans in research settings. NifTI stands for Neuroimaging Informatics Technology Initiative.
- DICOM: Storing medical images in clinical settings, including metadata about patients and imaging procedures. DICOM stands for Digital Imaging and Communications in Medicine.

### Create a NIfTI dataset

This page shows how to create and share a dataset of medical images in NIfTI format (.nii / .nii.gz) or DICOM format (.dcm) using the `datasets` library.

You can share a dataset with your team or with anyone in the community by creating a dataset repository on the Hugging Face Hub:

```py
from datasets import load_dataset

dataset = load_dataset("<username>/my_nifti_dataset")
dataset = load_dataset("<username>/my_nifti_or_dicom_dataset")
```

There are two common ways to create a NIfTI dataset:
There are two common ways to create a NIfTI or DICOM dataset:

- Create a dataset from local NIfTI files in Python and upload it with `Dataset.push_to_hub`.
- Create a dataset from local files in Python and upload it with `Dataset.push_to_hub`.
- Use a folder-based convention (one file per example) and a small helper to convert it into a `Dataset`.

> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information.

## Local files
### Local files

If you already have a list of file paths to NIfTI files, the easiest workflow is to create a `Dataset` from that list and cast the column to the `Nifti` feature.
If you already have a list of file paths to medical imaging files, the easiest workflow is to create a `Dataset` from that list and cast the column to the `Nifti` feature.

```py
from datasets import Dataset
Expand All @@ -35,7 +43,17 @@ ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
# or a dict {'bytes': None, 'path': '...'} when decode=False
```

The `Nifti` feature supports a `decode` parameter. When `decode=True` (the default), it loads the NIfTI file into a `nibabel.nifti1.Nifti1Image` object. You can access the image data as a numpy array with `img.get_fdata()`. When `decode=False`, it returns a dict with the file path and bytes.
For DICOM use:
```python
from datasets import Dataset, Dicom

# simple example: create a dataset from file paths
files = ["/path/to/file_001.dcm", "/path/to/file_002.dcm"]
ds = Dataset.from_dict({"dicom": files}).cast_column("dicom", Dicom())
```

The `Nifti` and `Dicom` feature support a `decode` parameter. When `decode=True` (the default), it loads the NIfTI file into a `nibabel.nifti1.Nifti1Image` object, and the DICOM file into a `pydicom.dataset.FileDataset` respectively. For NifTI files you can access the image data as a numpy array with `img.get_fdata()`. For DICOM files use `img.pixel_array`.
When `decode=False`, it returns a dict with the file path and bytes.

```py
from datasets import Dataset, Nifti
Expand All @@ -45,15 +63,23 @@ img = ds[0]["nifti"] # instance of: nibabel.nifti1.Nifti1Image
arr = img.get_fdata()
```

```python
from datasets import Dataset, Dicom

ds = Dataset.from_dict({"dicom": ["/path/to/file_without_meta.dcm"]}).cast_column("dicom", Dicom(decode=True))
img = ds[0]["dicom"]
arr = img.pixel_array
```

After preparing the dataset you can push it to the Hub:

```py
ds.push_to_hub("<username>/my_nifti_dataset")
ds.push_to_hub("<username>/my_nifti_or_dicom_dataset")
```

This will create a dataset repository containing your NIfTI dataset with a `data/` folder of parquet shards.
This will create a dataset repository containing your medical imaging dataset with a `data/` folder of parquet shards.

## Folder conventions and metadata
### Folder conventions and metadata

If you organize your dataset in folders you can create splits automatically (train/test/validation) by following a structure like:

Expand All @@ -64,7 +90,7 @@ dataset/validation/scan_1001.nii
dataset/test/scan_2001.nii
```

If you have labels or other metadata, provide a `metadata.csv`, `metadata.jsonl`, or `metadata.parquet` in the folder so files can be linked to metadata rows. The metadata must contain a `file_name` (or `*_file_name`) field with the relative path to the NIfTI file next to the metadata file.
If you have labels or other metadata, provide a `metadata.csv`, `metadata.jsonl`, or `metadata.parquet` in the folder so files can be linked to metadata rows. The metadata must contain a `file_name` (or `*_file_name`) field with the relative path to the NIfTI/DICOM file next to the metadata file.

Example `metadata.csv`:

Expand All @@ -74,7 +100,7 @@ scan_0001.nii.gz,P001,45,healthy
scan_0002.nii.gz,P002,59,disease_x
```

The `Nifti` feature works with zipped datasets too — each zip can contain NIfTI files and a metadata file. This is useful when uploading large datasets as archives.
The `Nifti` feature works with zipped datasets too — each zip can contain NIfTI files and a metadata file. This is useful when uploading large datasets as archives. NOTE: This is not supported for DICOM files.
This means your dataset structure could look like this (mixed compressed and uncompressed files):
```
dataset/train/scan_0001.nii.gz
Expand All @@ -83,7 +109,7 @@ dataset/validation/scan_1001.nii.gz
dataset/test/scan_2001.nii
```

## Converting to PyTorch tensors
### Converting to PyTorch tensors

Use the [`~Dataset.set_transform`] function to apply the transformation on-the-fly to batches of the dataset:

Expand All @@ -99,10 +125,23 @@ def transform_to_pytorch(example):
ds.set_transform(transform_to_pytorch)

```
Accessing elements now (e.g. `ds[0]`) will yield torch tensors in the `"nifti_torch"` key.

```py
import torch
import pydicom
import numpy as np

def transform_to_pytorch(example):
example["dicom_torch"] = [torch.tensor(ex.pixel_array) for ex in example["dicom"]]
return example

ds.set_transform(transform_to_pytorch)

```
Accessing elements now (e.g. `ds[0]`) will yield torch tensors in the `"nifti_torch"/"dicom_torch"` key.


## Usage of NifTI1Image
### Usage of NifTI1Image

NifTI is a format to store the result of 3 (or even 4) dimensional brain scans. This includes 3 spatial dimensions (x,y,z)
and optionally a time dimension (t). Furthermore, the given positions here are only relative to the scanner, therefore
Expand All @@ -127,4 +166,33 @@ for epi_img in nifti_ds:
```

For further reading we refer to the [nibabel documentation](https://nipy.org/nibabel/index.html) and especially [this nibabel tutorial](https://nipy.org/nibabel/coordinate_systems.html)

### Usage of Pydicom

The DICOM files are loaded using the [pydicom](https://pydicom.github.io/) library. Therefore, you can use all functionality of pydicom to access metadata and pixel data.

```python
from datasets import load_dataset
dicom_ds = load_dataset("<username>/my_dicom_dataset")
for dicom_img in dicom_ds:
dicom_object = dicom_img["dicom"]
print(dicom_object.PatientID)
print(dicom_object.StudyDate)
pixel_array = dicom_object.pixel_array
print(pixel_array.shape)
```

You can visualize the DICOM images using matplotlib as follows:

```Python
import matplotlib.pyplot as plt
from datasets import load_dataset
dicom_ds = load_dataset("<username>/my_dicom_dataset")
for dicom_img in dicom_ds:
dicom_object = dicom_img["dicom"]
plt.imshow(dicom_object.pixel_array, cmap=plt.cm.gray)
plt.show()
```

For further reading we refer to the [pydicom documentation](https://pydicom.github.io/pydicom/stable/) and [tutorials](https://pydicom.github.io/pydicom/stable/tutorials/index.html)
---
6 changes: 6 additions & 0 deletions docs/source/package_reference/loading_methods.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")

[[autodoc]] datasets.packaged_modules.niftifolder.NiftiFolder

### Dicom

[[autodoc]] datasets.packaged_modules.dicomfolder.DicomFolderConfig

[[autodoc]] datasets.packaged_modules.dicomfolder.DicomFolder

### WebDataset

[[autodoc]] datasets.packaged_modules.webdataset.WebDataset
4 changes: 4 additions & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,10 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable

[[autodoc]] datasets.Nifti

### Dicom

[[autodoc]] datasets.Dicom

## Filesystems

[[autodoc]] datasets.filesystems.is_remote_filesystem
Expand Down
3 changes: 3 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@

NIBABEL_REQUIRE = ["nibabel>=5.3.2"]

PYDICOM_REQUIRE = ["pydicom>=3.0.1"]

EXTRAS_REQUIRE = {
"audio": AUDIO_REQUIRE,
"vision": VISION_REQUIRE,
Expand All @@ -228,6 +230,7 @@
"docs": DOCS_REQUIRE,
"pdfs": PDFS_REQUIRE,
"nibabel": NIBABEL_REQUIRE,
"pydicom": PYDICOM_REQUIRE,
}

setup(
Expand Down
1 change: 1 addition & 0 deletions src/datasets/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@
TORCHVISION_AVAILABLE = importlib.util.find_spec("torchvision") is not None
PDFPLUMBER_AVAILABLE = importlib.util.find_spec("pdfplumber") is not None
NIBABEL_AVAILABLE = importlib.util.find_spec("nibabel") is not None
PYDICOM_AVAILABLE = importlib.util.find_spec("pydicom") is not None

# Optional compression tools
RARFILE_AVAILABLE = importlib.util.find_spec("rarfile") is not None
Expand Down
2 changes: 2 additions & 0 deletions src/datasets/features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
"Video",
"Pdf",
"Nifti",
"Dicom",
]
from .audio import Audio
from .dicom import Dicom
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, LargeList, List, Sequence, Value
from .image import Image
from .nifti import Nifti
Expand Down
Loading