Fixing and refactoring `parse_dataset_uri()` #1352

ilongin · 2025-09-20T23:45:35Z

Fixing function to work with namespace names that start with @v which is the same as version separator - this is where the bug was.
Also refactoring function to return namespace name and project name as well. Using regex instead of complicated parsing logic which also simplifies.
Added missing tests for parse_dataset_uri() as well.

Summary by Sourcery

Refactor dataset URI parsing to extract fully qualified identifiers using regex, fix namespace-vs-version ambiguity, update dependent code paths, and enhance test coverage.

New Features:

parse_dataset_uri now returns namespace, project, name, and optional version from ds:// URIs

Bug Fixes:

correctly handle namespaces starting with "@v" without misinterpreting them as version suffix

Enhancements:

use regex-based parsing and remove redundant parse_dataset_name calls in catalog processing
update catalog instantiation and node conversion to unpack full URI components

Tests:

add unit tests for valid and invalid parse_dataset_uri scenarios
update functional pull_dataset test to expect full semver format

cloudflare-workers-and-pages · 2025-09-20T23:45:48Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`4011829`
Status:	✅ Deploy successful!
Preview URL:	https://0ed10c6d.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1351-fix-parsing-dat.datachain-documentation.pages.dev

View logs

sourcery-ai · 2025-09-20T23:52:22Z

Reviewer's Guide

Refactor parse_dataset_uri to use a regex-based parser that extracts namespace, project, name, and optional semantic version, update its signature and error handling, adjust call sites to the new return values, and add comprehensive unit tests while correcting a functional test version assertion.

Class diagram for updated parse_dataset_uri() return structure

classDiagram
class parse_dataset_uri {
  +str namespace
  +str project
  +str name
  +str|None version
}

Flow diagram for new parse_dataset_uri() logic

flowchart TD
    A["Input: ds://<namespace>.<project>.<name>[@v<semver>]"] --> B["Check prefix is 'ds://'"]
    B -->|Valid| C["Remove 'ds://' prefix"]
    C --> D["Apply regex to extract namespace, project, name, version"]
    D -->|Match| E["Return (namespace, project, name, version)"]
    D -->|No match| F["Raise ValueError: Invalid dataset URI format"]
    B -->|Invalid| G["Raise ValueError: Invalid dataset URI"]

File-Level Changes

Change	Details	Files
Reworked parse_dataset_uri function to use regex and return detailed components	Expanded return signature to (namespace, project, name, version) Replaced urlparse and manual splitting with a verbose regex for validation Improved prefix check and raised ValueError on invalid URIs	`src/datachain/dataset.py`
Added unit tests for valid and invalid dataset URIs	Parametrized tests for URIs with and without @vversion, including namespaces starting with '@v' Tests asserting ValueError messages on malformed URIs	`tests/unit/test_dataset.py`
Updated catalog code to unpack the new parse_dataset_uri output	Replaced separate parse_dataset_uri and parse_dataset_name calls with a single unpack Removed redundant namespace/project extraction and assertions	`src/datachain/catalog/catalog.py`
Corrected functional test expectation for version formatting	Changed test_pull to use full semantic version '5.0.0' instead of '5' Updated expected error message to include '5.0.0'	`tests/func/test_pull.py`

Possibly linked issues

remove docstring from DataModel.__pydantic__init_subclass__ #123: The PR fixes parse_dataset_uri to correctly handle namespaces starting with @v, which was the exact bug described in the issue.
remove docstring from DataModel.__pydantic__init_subclass__ #123: PR refactors parse_dataset_uri() to extract namespace and project names, enabling Catalog.get_dataset() to accept them directly.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codecov · 2025-09-20T23:53:00Z

Codecov Report

❌ Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.60%. Comparing base (51bb4b4) to head (4011829).

Files with missing lines	Patch %	Lines
src/datachain/dataset.py	83.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1352      +/-   ##
==========================================
- Coverage   87.60%   87.60%   -0.01%     
==========================================
  Files         157      157              
  Lines       14627    14623       -4     
  Branches     2107     2106       -1     
==========================================
- Hits        12814    12810       -4     
  Misses       1334     1334              
  Partials      479      479

Flag	Coverage Δ
datachain	`87.53% <85.71%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/catalog/catalog.py	`84.11% <100.00%> (-0.08%)`	⬇️
src/datachain/dataset.py	`86.99% <83.33%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
- Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
- In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors

## Individual Comments

### Comment 1
<location> `src/datachain/catalog/catalog.py:1514-1519` </location>
<code_context>

         try:
-            remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
+            (remote_namespace, remote_project, remote_ds_name, version) = (
+                parse_dataset_uri(remote_ds_uri)
+            )
         except Exception as e:
</code_context>

<issue_to_address>
**suggestion:** Exception handling could be more specific than a generic Exception.

Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.

```suggestion
        try:
            (remote_namespace, remote_project, remote_ds_name, version) = (
                parse_dataset_uri(remote_ds_uri)
            )
        except ValueError as e:
            raise DataChainError("Error when parsing dataset uri") from e
```
</issue_to_address>

### Comment 2
<location> `src/datachain/dataset.py:48` </location>
<code_context>
def parse_dataset_uri(uri: str) -> tuple[str, str, str, str | None]:
    """
    Parse a dataset URI of the form:

        ds://<namespace>.<project>.<name>[@v<semver>]

    Components:
    - `ds://`        : required prefix identifying dataset URIs.
    - `namespace`    : required namespace, may start with '@' (e.g., "@user").
    - `project`      : required project name inside the namespace.
    - `name`         : required dataset name.
    - `@v<semver>`   : optional version suffix. Must start with '@v' and
                       be a semantic version string MAJOR.MINOR.PATCH
                       (e.g., "1.0.4").

    Returns:
        tuple[str, str, str, str | None]:
            (namespace, project, name, version) where version is None
            if not provided.

    Raises:
        ValueError: if the URI does not start with 'ds://' or does not
                    match the expected format.
    """

    if not uri.startswith("ds://"):
        raise ValueError(f"Invalid dataset URI: {uri}")

    body = uri[len("ds://") :]

    pattern = re.compile(
        r"""
        ^(?P<namespace>@?\w+)      # namespace, may start with '@'
        \. (?P<project>\w+)        # project
        \. (?P<name>\w+)           # dataset name
        (?:@v                      # optional version prefix must be '@v'
            (?P<version>\d+\.\d+\.\d+)
        )?$                        # end of string
        """,
        re.VERBOSE,
    )

    match = pattern.match(body)
    if not match:
        raise ValueError(f"Invalid dataset URI format: {uri}")

    return (
        match.group("namespace"),
        match.group("project"),
        match.group("name"),
        match.group("version"),
    )

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Replace m.group(x) with m[x] for re.Match objects [×4] ([`use-getitem-for-re-match-groups`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/use-getitem-for-re-match-groups/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-09-20T23:53:15Z

src/datachain/catalog/catalog.py

        try:
-            remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
+            (remote_namespace, remote_project, remote_ds_name, version) = (
+                parse_dataset_uri(remote_ds_uri)
+            )
        except Exception as e:
            raise DataChainError("Error when parsing dataset uri") from e


suggestion: Exception handling could be more specific than a generic Exception.

Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.

Suggested change

try:

remote_ds_name, version = parse_dataset_uri(remote_ds_uri)

(remote_namespace, remote_project, remote_ds_name, version) = (

parse_dataset_uri(remote_ds_uri)

)

except Exception as e:

raise DataChainError("Error when parsing dataset uri") from e

try:

(remote_namespace, remote_project, remote_ds_name, version) = (

parse_dataset_uri(remote_ds_uri)

)

except ValueError as e:

raise DataChainError("Error when parsing dataset uri") from e

src/datachain/dataset.py

shcheklein · 2025-09-21T00:25:59Z

src/datachain/dataset.py

+        ds://<namespace>.<project>.<name>[@v<semver>]
+
+    Components:
+    - `ds://`        : required prefix identifying dataset URIs.


can we drop the ds:// by now? do you remember why it was needed at all?

That goes way back ... It was part of the requirements but I don't remember all the reasoning behind it. We can discuss separately as I don't want to remove anything like that in this PR

dreadatour · 2025-09-21T15:26:23Z

src/datachain/dataset.py

+    pattern = re.compile(
+        r"""
+        ^(?P<namespace>@?\w+)      # namespace, may start with '@'
+        \. (?P<project>\w+)        # project
+        \. (?P<name>\w+)           # dataset name
+        (?:@v                      # optional version prefix must be '@v'
+            (?P<version>\d+\.\d+\.\d+)
+        )?$                        # end of string
+        """,
+        re.VERBOSE,
+    )


Instead of using regexp, here we can use parse_dataset_name first, to get optional namespace_name, optional project_name, and dataset_name (including version) and then parse dataset_name to split it into name and version 🤔

This way we will not split the logic of parsing full dataset name (including namespace and project) between two different methods (parse_dataset_uri and parse_dataset_name), and will keep it in one place, so next time we will need to change anything related, we will not forget to do so and it will be only one place where changes are required.

Refactored to set namespace and project as optional and to use parse_dataset_name().

dreadatour · 2025-09-21T15:27:40Z

src/datachain/dataset.py

+    pattern = re.compile(
+        r"""
+        ^(?P<namespace>@?\w+)      # namespace, may start with '@'
+        \. (?P<project>\w+)        # project


I bet we do allow @ in project name too. Let's check? 🙏

Yes, we do: https://github.com/iterative/datachain/blob/main/src/datachain/project.py#L10

@ is not allowed in project name and dataset name. Also, you cannot create namespace name with it, but we do create automatically namespaces with @<username> so it can appear here. Regardless, I think I think you are right to assume it can be present, as it's safer that way.

dreadatour · 2025-09-21T15:28:48Z

src/datachain/dataset.py

+        \. (?P<project>\w+)        # project
+        \. (?P<name>\w+)           # dataset name
+        (?:@v                      # optional version prefix must be '@v'
+            (?P<version>\d+\.\d+\.\d+)


If I am not mistaken, we still do support single-digit versions (not semver) 🤔

We only allow integer versions in dc.read_dataset() to be backward compatible with old user's scripts. In that method we convert integer version to correct string semver and continue with that.
parse_dataset_uri() and other internal methods expect full valid semver.

dreadatour · 2025-09-21T15:31:09Z

src/datachain/dataset.py

-        Output: (zalando, 3.0.1)
+    Parse a dataset URI of the form:
+
+        ds://<namespace>.<project>.<name>[@v<semver>]


Do we always require both namespace and project now? It looks like breaking changes 🤔

This method was used in places where it was expected to have both, but you are right to make this more generic and leave it as optional.

dreadatour · 2025-09-21T15:34:49Z

src/datachain/dataset.py

+        r"""
+        ^(?P<namespace>@?\w+)      # namespace, may start with '@'
+        \. (?P<project>\w+)        # project
+        \. (?P<name>\w+)           # dataset name


We also allow @ in dataset name: https://github.com/iterative/datachain/blob/main/src/datachain/dataset.py#L35

We don't, those are reserved characters that are not allowed

dreadatour

I would prefer to do not use regexp here, but rather use parse_dataset_name to get namespace and project name (as described here).

It also looks like it is not only basic fix and refactoring, but also this PR introduce breaking changes: namespace and project are required now. Or am I wrong? If this is ok, then we should have more clear message about these changes in PR description, what do you think?

I would also love to see more cases in added tests, may be ask AI to help with that?

dreadatour

I'm approving this so we don’t block the fix for the issue, but I believe we could achieve a stronger solution with proper refactoring and better test coverage.

dreadatour · 2025-09-23T03:54:39Z

src/datachain/dataset.py

-
-    match = pattern.match(body)
+    # Split off optional @v<version>
+    match = re.match(r"^(?P<name>.+?)(?:@v(?P<version>\d+\.\d+\.\d+))?$", body)


We still have quite a lot of cases where this regexp will not works (and we have no tests for all these cases), I would prefer to move parsing dataset name, version, namespace, project, etc to separate module and cover it all with tests, so we can reuse it all over the code, but same time it looks like a separate task to refactor this part and it is good enough for now for me.

dreadatour · 2025-09-23T03:55:17Z

tests/unit/test_dataset.py

-@pytest.mark.parametrize(
-    "uri",
-    [
-        "ds://result",
-        "ds://[email protected]",
-        "ds://@[email protected]",
-        "ds://@[email protected]",
-    ],
-)
-def test_parse_dataset_uri_invalid_format(uri):
-    with pytest.raises(ValueError) as excinfo:
-        parse_dataset_uri(uri)
-
-    assert str(excinfo.value) == f"Invalid dataset URI format: {uri}"


Not sure why did we remove these tests to catch all the failures and left only "happy path" tests :(

shcheklein · 2025-10-18T04:27:07Z

@ilongin can you wrap it all up?

fixing and refactoring parsing dataset uri

d143342

ilongin linked an issue Sep 20, 2025 that may be closed by this pull request

Parsing dataset uri is broken if namespace starts with @v #1351

Open

ilongin requested review from dreadatour and shcheklein September 20, 2025 23:46

sourcery-ai bot reviewed Sep 20, 2025

View reviewed changes

shcheklein reviewed Sep 21, 2025

View reviewed changes

dreadatour reviewed Sep 21, 2025

View reviewed changes

dreadatour requested changes Sep 21, 2025

View reviewed changes

fixin parsing uri

4011829

ilongin requested review from dreadatour and shcheklein September 22, 2025 10:04

dreadatour approved these changes Sep 23, 2025

View reviewed changes

Fixing and refactoring parse_dataset_uri() #1352

Are you sure you want to change the base?

Fixing and refactoring parse_dataset_uri() #1352

Uh oh!

Conversation

ilongin commented Sep 20, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

cloudflare-workers-and-pages bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

sourcery-ai bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated parse_dataset_uri() return structure

Flow diagram for new parse_dataset_uri() logic

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilongin Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Fixing and refactoring `parse_dataset_uri()` #1352

Fixing and refactoring `parse_dataset_uri()` #1352

ilongin commented Sep 20, 2025 •

edited by sourcery-ai bot

Loading

cloudflare-workers-and-pages bot commented Sep 20, 2025 •

edited

Loading

sourcery-ai bot commented Sep 20, 2025 •

edited

Loading

codecov bot commented Sep 20, 2025 •

edited

Loading

ilongin Sep 22, 2025 •

edited

Loading