Skip to content

load_dataset revision param not respected when fetching from cache #7928

@Scott-Simmons

Description

@Scott-Simmons

Describe the bug

datasets.load_dataset revision semantics are a bit inconsistent when the dataset is not found on the huggingface hub. When fetching the latest cached version of the dataset, the revision argument is ignored, so long as any cached versions of the dataset already exist in the HF cache.

Steps to reproduce the bug

import datasets
datasets.load_dataset(
    "sentientfutures/ahb",
    "dimensions",
    split="train",
    revision="main"
)
# would expect some error to raise here
datasets.load_dataset(
    "sentientfutures/ahb",
    "dimensions",
    split="train",
    revision="invalid_revision"
)

Expected behavior

On the second call to datasets.load_dataset in the 'steps to reproduce the bug' example, expect something like:

raise DatasetNotFoundError(
datasets.exceptions.DatasetNotFoundError: Revision 'invalid_revision' doesn't exist for dataset 'sentientfutures/ahb' on the Hub.

Environment info

  • datasets version: 4.4.1
  • Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
  • Python version: 3.12.12
  • huggingface_hub version: 0.36.0
  • PyArrow version: 22.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2025.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions