Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Sep 20, 2025

Fixing function to work with namespace names that start with @v which is the same as version separator - this is where the bug was.
Also refactoring function to return namespace name and project name as well. Using regex instead of complicated parsing logic which also simplifies.
Added missing tests for parse_dataset_uri() as well.

Summary by Sourcery

Refactor dataset URI parsing to extract fully qualified identifiers using regex, fix namespace-vs-version ambiguity, update dependent code paths, and enhance test coverage.

New Features:

  • parse_dataset_uri now returns namespace, project, name, and optional version from ds:// URIs

Bug Fixes:

  • correctly handle namespaces starting with "@v" without misinterpreting them as version suffix

Enhancements:

  • use regex-based parsing and remove redundant parse_dataset_name calls in catalog processing
  • update catalog instantiation and node conversion to unpack full URI components

Tests:

  • add unit tests for valid and invalid parse_dataset_uri scenarios
  • update functional pull_dataset test to expect full semver format

@ilongin ilongin linked an issue Sep 20, 2025 that may be closed by this pull request
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Sep 20, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4011829
Status: ✅  Deploy successful!
Preview URL: https://0ed10c6d.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1351-fix-parsing-dat.datachain-documentation.pages.dev

View logs

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Sep 20, 2025

Reviewer's Guide

Refactor parse_dataset_uri to use a regex-based parser that extracts namespace, project, name, and optional semantic version, update its signature and error handling, adjust call sites to the new return values, and add comprehensive unit tests while correcting a functional test version assertion.

Class diagram for updated parse_dataset_uri() return structure

classDiagram
class parse_dataset_uri {
  +str namespace
  +str project
  +str name
  +str|None version
}
Loading

Flow diagram for new parse_dataset_uri() logic

flowchart TD
    A["Input: ds://<namespace>.<project>.<name>[@v<semver>]"] --> B["Check prefix is 'ds://'"]
    B -->|Valid| C["Remove 'ds://' prefix"]
    C --> D["Apply regex to extract namespace, project, name, version"]
    D -->|Match| E["Return (namespace, project, name, version)"]
    D -->|No match| F["Raise ValueError: Invalid dataset URI format"]
    B -->|Invalid| G["Raise ValueError: Invalid dataset URI"]
Loading

File-Level Changes

Change Details Files
Reworked parse_dataset_uri function to use regex and return detailed components
  • Expanded return signature to (namespace, project, name, version)
  • Replaced urlparse and manual splitting with a verbose regex for validation
  • Improved prefix check and raised ValueError on invalid URIs
src/datachain/dataset.py
Added unit tests for valid and invalid dataset URIs
  • Parametrized tests for URIs with and without @vversion, including namespaces starting with '@v'
  • Tests asserting ValueError messages on malformed URIs
tests/unit/test_dataset.py
Updated catalog code to unpack the new parse_dataset_uri output
  • Replaced separate parse_dataset_uri and parse_dataset_name calls with a single unpack
  • Removed redundant namespace/project extraction and assertions
src/datachain/catalog/catalog.py
Corrected functional test expectation for version formatting
  • Changed test_pull to use full semantic version '5.0.0' instead of '5'
  • Updated expected error message to include '5.0.0'
tests/func/test_pull.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Sep 20, 2025

Codecov Report

❌ Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.60%. Comparing base (51bb4b4) to head (4011829).

Files with missing lines Patch % Lines
src/datachain/dataset.py 83.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1352      +/-   ##
==========================================
- Coverage   87.60%   87.60%   -0.01%     
==========================================
  Files         157      157              
  Lines       14627    14623       -4     
  Branches     2107     2106       -1     
==========================================
- Hits        12814    12810       -4     
  Misses       1334     1334              
  Partials      479      479              
Flag Coverage Δ
datachain 87.53% <85.71%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/catalog/catalog.py 84.11% <100.00%> (-0.08%) ⬇️
src/datachain/dataset.py 86.99% <83.33%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
  • Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
  • In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
- Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
- In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors

## Individual Comments

### Comment 1
<location> `src/datachain/catalog/catalog.py:1514-1519` </location>
<code_context>

         try:
-            remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
+            (remote_namespace, remote_project, remote_ds_name, version) = (
+                parse_dataset_uri(remote_ds_uri)
+            )
         except Exception as e:
</code_context>

<issue_to_address>
**suggestion:** Exception handling could be more specific than a generic Exception.

Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.

```suggestion
        try:
            (remote_namespace, remote_project, remote_ds_name, version) = (
                parse_dataset_uri(remote_ds_uri)
            )
        except ValueError as e:
            raise DataChainError("Error when parsing dataset uri") from e
```
</issue_to_address>

### Comment 2
<location> `src/datachain/dataset.py:48` </location>
<code_context>
def parse_dataset_uri(uri: str) -> tuple[str, str, str, str | None]:
    """
    Parse a dataset URI of the form:

        ds://<namespace>.<project>.<name>[@v<semver>]

    Components:
    - `ds://`        : required prefix identifying dataset URIs.
    - `namespace`    : required namespace, may start with '@' (e.g., "@user").
    - `project`      : required project name inside the namespace.
    - `name`         : required dataset name.
    - `@v<semver>`   : optional version suffix. Must start with '@v' and
                       be a semantic version string MAJOR.MINOR.PATCH
                       (e.g., "1.0.4").

    Returns:
        tuple[str, str, str, str | None]:
            (namespace, project, name, version) where version is None
            if not provided.

    Raises:
        ValueError: if the URI does not start with 'ds://' or does not
                    match the expected format.
    """

    if not uri.startswith("ds://"):
        raise ValueError(f"Invalid dataset URI: {uri}")

    body = uri[len("ds://") :]

    pattern = re.compile(
        r"""
        ^(?P<namespace>@?\w+)      # namespace, may start with '@'
        \. (?P<project>\w+)        # project
        \. (?P<name>\w+)           # dataset name
        (?:@v                      # optional version prefix must be '@v'
            (?P<version>\d+\.\d+\.\d+)
        )?$                        # end of string
        """,
        re.VERBOSE,
    )

    match = pattern.match(body)
    if not match:
        raise ValueError(f"Invalid dataset URI format: {uri}")

    return (
        match.group("namespace"),
        match.group("project"),
        match.group("name"),
        match.group("version"),
    )

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Replace m.group(x) with m[x] for re.Match objects [×4] ([`use-getitem-for-re-match-groups`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/use-getitem-for-re-match-groups/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 1514 to 1519
try:
remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
(remote_namespace, remote_project, remote_ds_name, version) = (
parse_dataset_uri(remote_ds_uri)
)
except Exception as e:
raise DataChainError("Error when parsing dataset uri") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Exception handling could be more specific than a generic Exception.

Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.

Suggested change
try:
remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
(remote_namespace, remote_project, remote_ds_name, version) = (
parse_dataset_uri(remote_ds_uri)
)
except Exception as e:
raise DataChainError("Error when parsing dataset uri") from e
try:
(remote_namespace, remote_project, remote_ds_name, version) = (
parse_dataset_uri(remote_ds_uri)
)
except ValueError as e:
raise DataChainError("Error when parsing dataset uri") from e

ds://<namespace>.<project>.<name>[@v<semver>]
Components:
- `ds://` : required prefix identifying dataset URIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop the ds:// by now? do you remember why it was needed at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes way back ... It was part of the requirements but I don't remember all the reasoning behind it. We can discuss separately as I don't want to remove anything like that in this PR

Comment on lines 76 to 86
pattern = re.compile(
r"""
^(?P<namespace>@?\w+) # namespace, may start with '@'
\. (?P<project>\w+) # project
\. (?P<name>\w+) # dataset name
(?:@v # optional version prefix must be '@v'
(?P<version>\d+\.\d+\.\d+)
)?$ # end of string
""",
re.VERBOSE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using regexp, here we can use parse_dataset_name first, to get optional namespace_name, optional project_name, and dataset_name (including version) and then parse dataset_name to split it into name and version 🤔

This way we will not split the logic of parsing full dataset name (including namespace and project) between two different methods (parse_dataset_uri and parse_dataset_name), and will keep it in one place, so next time we will need to change anything related, we will not forget to do so and it will be only one place where changes are required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to set namespace and project as optional and to use parse_dataset_name().

pattern = re.compile(
r"""
^(?P<namespace>@?\w+) # namespace, may start with '@'
\. (?P<project>\w+) # project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet we do allow @ in project name too. Let's check? 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@ilongin ilongin Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ is not allowed in project name and dataset name. Also, you cannot create namespace name with it, but we do create automatically namespaces with @<username> so it can appear here. Regardless, I think I think you are right to assume it can be present, as it's safer that way.

\. (?P<project>\w+) # project
\. (?P<name>\w+) # dataset name
(?:@v # optional version prefix must be '@v'
(?P<version>\d+\.\d+\.\d+)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, we still do support single-digit versions (not semver) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only allow integer versions in dc.read_dataset() to be backward compatible with old user's scripts. In that method we convert integer version to correct string semver and continue with that.
parse_dataset_uri() and other internal methods expect full valid semver.

Output: (zalando, 3.0.1)
Parse a dataset URI of the form:
ds://<namespace>.<project>.<name>[@v<semver>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always require both namespace and project now? It looks like breaking changes 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was used in places where it was expected to have both, but you are right to make this more generic and leave it as optional.

r"""
^(?P<namespace>@?\w+) # namespace, may start with '@'
\. (?P<project>\w+) # project
\. (?P<name>\w+) # dataset name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, those are reserved characters that are not allowed

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do not use regexp here, but rather use parse_dataset_name to get namespace and project name (as described here).

It also looks like it is not only basic fix and refactoring, but also this PR introduce breaking changes: namespace and project are required now. Or am I wrong? If this is ok, then we should have more clear message about these changes in PR description, what do you think?

I would also love to see more cases in added tests, may be ask AI to help with that?

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this so we don’t block the fix for the issue, but I believe we could achieve a stronger solution with proper refactoring and better test coverage.


match = pattern.match(body)
# Split off optional @v<version>
match = re.match(r"^(?P<name>.+?)(?:@v(?P<version>\d+\.\d+\.\d+))?$", body)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have quite a lot of cases where this regexp will not works (and we have no tests for all these cases), I would prefer to move parsing dataset name, version, namespace, project, etc to separate module and cover it all with tests, so we can reuse it all over the code, but same time it looks like a separate task to refactor this part and it is good enough for now for me.

Comment on lines 201 to 214
@pytest.mark.parametrize(
"uri",
[
"ds://result",
"ds://[email protected]",
"ds://@[email protected]",
"ds://@[email protected]",
],
)
def test_parse_dataset_uri_invalid_format(uri):
with pytest.raises(ValueError) as excinfo:
parse_dataset_uri(uri)

assert str(excinfo.value) == f"Invalid dataset URI format: {uri}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why did we remove these tests to catch all the failures and left only "happy path" tests :(

@shcheklein
Copy link
Member

@ilongin can you wrap it all up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsing dataset uri is broken if namespace starts with @v

4 participants