Skip to content

Help needed: fixing non-standard resource.path in vega-datasets #3946

@dsmedia

Description

@dsmedia

What is your suggestion?

I'm working on improving Data Package compliance in vega-datasets (PR #755). One blocker is that the current datapackage.json uses non-standard paths that fail validation.

This is my design oversight in vega-datasets, not an Altair issue — but I need Altair's help to fix it without breaking things.

The Problem

vega-datasets/
├── datapackage.json      # resource.path = "airports.csv"  ← wrong
└── data/
    └── airports.csv

Current datapackage.json:

{
  "name": "vega-datasets",
  "resources": [
    {
      "name": "airports",
      "path": "airports.csv",
      "format": "csv",
      "bytes": 21176,
      "hash": "sha256:608ba6d..."
    }
  ]
}

Per the Data Package spec, paths must be relative to the descriptor. Since datapackage.json is at the repo root and files live in data/, the correct path should be data/airports.csv.

Altair correctly compensates for this oversight by hardcoding data/ in the base URL (npm.py:62):

def dataset_base_url(self, version: BranchOrTag, /) -> LiteralString:
    return f"{self._prefix(version)}data/"

If I fix vega-datasets now, Altair's URL construction breaks (double data/data/).

Request

Could Altair handle both path formats? This would let me fix vega-datasets without a coordinated release.

Format resource.path Expected URL
Current airports.csv .../data/airports.csv
Fixed data/airports.csv .../data/airports.csv

Suggested Approach

# npm.py - remove hardcoded "data/"
def dataset_base_url(self, version: BranchOrTag, /) -> LiteralString:
    return f"{self._prefix(version)}"

# datapackage.py - normalize paths for backwards compatibility
@property
def _url(self) -> Column:
    path_col = col("path")
    normalized = pl.when(path_col.str.starts_with("data/")).then(
        path_col
    ).otherwise(
        pl.concat_str(pl.lit("data/"), path_col)
    )
    expr = pl.concat_str(pl.lit(self._base_url), normalized)
    return Column("url", expr, "Remote url used to access dataset.")

Also file_name would need to extract just the filename:

Column("file_name", col("path").str.split("/").list.last(), ...)

Migration

  1. Altair adds backwards-compatible path handling
  2. vega-datasets fixes paths to data/airports.csv
  3. Standard validators pass, Altair continues working

Happy to help with the implementation if useful.

Related

Have you considered any alternative solutions?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions