-
Notifications
You must be signed in to change notification settings - Fork 132
Fixing and refactoring parse_dataset_uri()
#1352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
| Latest commit: |
4011829
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0ed10c6d.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1351-fix-parsing-dat.datachain-documentation.pages.dev |
Reviewer's GuideRefactor parse_dataset_uri to use a regex-based parser that extracts namespace, project, name, and optional semantic version, update its signature and error handling, adjust call sites to the new return values, and add comprehensive unit tests while correcting a functional test version assertion. Class diagram for updated parse_dataset_uri() return structureclassDiagram
class parse_dataset_uri {
+str namespace
+str project
+str name
+str|None version
}
Flow diagram for new parse_dataset_uri() logicflowchart TD
A["Input: ds://<namespace>.<project>.<name>[@v<semver>]"] --> B["Check prefix is 'ds://'"]
B -->|Valid| C["Remove 'ds://' prefix"]
C --> D["Apply regex to extract namespace, project, name, version"]
D -->|Match| E["Return (namespace, project, name, version)"]
D -->|No match| F["Raise ValueError: Invalid dataset URI format"]
B -->|Invalid| G["Raise ValueError: Invalid dataset URI"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1352 +/- ##
==========================================
- Coverage 87.60% 87.60% -0.01%
==========================================
Files 157 157
Lines 14627 14623 -4
Branches 2107 2106 -1
==========================================
- Hits 12814 12810 -4
Misses 1334 1334
Partials 479 479
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
- Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
- In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Precompile the URI parsing regex at module level rather than inside the function to avoid recompiling it on every call
- Consider extending the regex pattern to allow hyphens or other valid characters in namespace, project, and name instead of \w only
- In _instantiate you catch all Exceptions when parsing the URI – narrow this to ValueError to avoid masking other errors
## Individual Comments
### Comment 1
<location> `src/datachain/catalog/catalog.py:1514-1519` </location>
<code_context>
try:
- remote_ds_name, version = parse_dataset_uri(remote_ds_uri)
+ (remote_namespace, remote_project, remote_ds_name, version) = (
+ parse_dataset_uri(remote_ds_uri)
+ )
except Exception as e:
</code_context>
<issue_to_address>
**suggestion:** Exception handling could be more specific than a generic Exception.
Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.
```suggestion
try:
(remote_namespace, remote_project, remote_ds_name, version) = (
parse_dataset_uri(remote_ds_uri)
)
except ValueError as e:
raise DataChainError("Error when parsing dataset uri") from e
```
</issue_to_address>
### Comment 2
<location> `src/datachain/dataset.py:48` </location>
<code_context>
def parse_dataset_uri(uri: str) -> tuple[str, str, str, str | None]:
"""
Parse a dataset URI of the form:
ds://<namespace>.<project>.<name>[@v<semver>]
Components:
- `ds://` : required prefix identifying dataset URIs.
- `namespace` : required namespace, may start with '@' (e.g., "@user").
- `project` : required project name inside the namespace.
- `name` : required dataset name.
- `@v<semver>` : optional version suffix. Must start with '@v' and
be a semantic version string MAJOR.MINOR.PATCH
(e.g., "1.0.4").
Returns:
tuple[str, str, str, str | None]:
(namespace, project, name, version) where version is None
if not provided.
Raises:
ValueError: if the URI does not start with 'ds://' or does not
match the expected format.
"""
if not uri.startswith("ds://"):
raise ValueError(f"Invalid dataset URI: {uri}")
body = uri[len("ds://") :]
pattern = re.compile(
r"""
^(?P<namespace>@?\w+) # namespace, may start with '@'
\. (?P<project>\w+) # project
\. (?P<name>\w+) # dataset name
(?:@v # optional version prefix must be '@v'
(?P<version>\d+\.\d+\.\d+)
)?$ # end of string
""",
re.VERBOSE,
)
match = pattern.match(body)
if not match:
raise ValueError(f"Invalid dataset URI format: {uri}")
return (
match.group("namespace"),
match.group("project"),
match.group("name"),
match.group("version"),
)
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Replace m.group(x) with m[x] for re.Match objects [×4] ([`use-getitem-for-re-match-groups`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/use-getitem-for-re-match-groups/))
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| try: | ||
| remote_ds_name, version = parse_dataset_uri(remote_ds_uri) | ||
| (remote_namespace, remote_project, remote_ds_name, version) = ( | ||
| parse_dataset_uri(remote_ds_uri) | ||
| ) | ||
| except Exception as e: | ||
| raise DataChainError("Error when parsing dataset uri") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Exception handling could be more specific than a generic Exception.
Catching Exception may hide unrelated errors. Catch ValueError or a custom exception from parse_dataset_uri instead.
| try: | |
| remote_ds_name, version = parse_dataset_uri(remote_ds_uri) | |
| (remote_namespace, remote_project, remote_ds_name, version) = ( | |
| parse_dataset_uri(remote_ds_uri) | |
| ) | |
| except Exception as e: | |
| raise DataChainError("Error when parsing dataset uri") from e | |
| try: | |
| (remote_namespace, remote_project, remote_ds_name, version) = ( | |
| parse_dataset_uri(remote_ds_uri) | |
| ) | |
| except ValueError as e: | |
| raise DataChainError("Error when parsing dataset uri") from e |
src/datachain/dataset.py
Outdated
| ds://<namespace>.<project>.<name>[@v<semver>] | ||
| Components: | ||
| - `ds://` : required prefix identifying dataset URIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we drop the ds:// by now? do you remember why it was needed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That goes way back ... It was part of the requirements but I don't remember all the reasoning behind it. We can discuss separately as I don't want to remove anything like that in this PR
src/datachain/dataset.py
Outdated
| pattern = re.compile( | ||
| r""" | ||
| ^(?P<namespace>@?\w+) # namespace, may start with '@' | ||
| \. (?P<project>\w+) # project | ||
| \. (?P<name>\w+) # dataset name | ||
| (?:@v # optional version prefix must be '@v' | ||
| (?P<version>\d+\.\d+\.\d+) | ||
| )?$ # end of string | ||
| """, | ||
| re.VERBOSE, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using regexp, here we can use parse_dataset_name first, to get optional namespace_name, optional project_name, and dataset_name (including version) and then parse dataset_name to split it into name and version 🤔
This way we will not split the logic of parsing full dataset name (including namespace and project) between two different methods (parse_dataset_uri and parse_dataset_name), and will keep it in one place, so next time we will need to change anything related, we will not forget to do so and it will be only one place where changes are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to set namespace and project as optional and to use parse_dataset_name().
src/datachain/dataset.py
Outdated
| pattern = re.compile( | ||
| r""" | ||
| ^(?P<namespace>@?\w+) # namespace, may start with '@' | ||
| \. (?P<project>\w+) # project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet we do allow @ in project name too. Let's check? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ is not allowed in project name and dataset name. Also, you cannot create namespace name with it, but we do create automatically namespaces with @<username> so it can appear here. Regardless, I think I think you are right to assume it can be present, as it's safer that way.
src/datachain/dataset.py
Outdated
| \. (?P<project>\w+) # project | ||
| \. (?P<name>\w+) # dataset name | ||
| (?:@v # optional version prefix must be '@v' | ||
| (?P<version>\d+\.\d+\.\d+) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, we still do support single-digit versions (not semver) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only allow integer versions in dc.read_dataset() to be backward compatible with old user's scripts. In that method we convert integer version to correct string semver and continue with that.
parse_dataset_uri() and other internal methods expect full valid semver.
src/datachain/dataset.py
Outdated
| Output: (zalando, 3.0.1) | ||
| Parse a dataset URI of the form: | ||
| ds://<namespace>.<project>.<name>[@v<semver>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always require both namespace and project now? It looks like breaking changes 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was used in places where it was expected to have both, but you are right to make this more generic and leave it as optional.
src/datachain/dataset.py
Outdated
| r""" | ||
| ^(?P<namespace>@?\w+) # namespace, may start with '@' | ||
| \. (?P<project>\w+) # project | ||
| \. (?P<name>\w+) # dataset name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also allow @ in dataset name: https://github.com/iterative/datachain/blob/main/src/datachain/dataset.py#L35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, those are reserved characters that are not allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to do not use regexp here, but rather use parse_dataset_name to get namespace and project name (as described here).
It also looks like it is not only basic fix and refactoring, but also this PR introduce breaking changes: namespace and project are required now. Or am I wrong? If this is ok, then we should have more clear message about these changes in PR description, what do you think?
I would also love to see more cases in added tests, may be ask AI to help with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving this so we don’t block the fix for the issue, but I believe we could achieve a stronger solution with proper refactoring and better test coverage.
|
|
||
| match = pattern.match(body) | ||
| # Split off optional @v<version> | ||
| match = re.match(r"^(?P<name>.+?)(?:@v(?P<version>\d+\.\d+\.\d+))?$", body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have quite a lot of cases where this regexp will not works (and we have no tests for all these cases), I would prefer to move parsing dataset name, version, namespace, project, etc to separate module and cover it all with tests, so we can reuse it all over the code, but same time it looks like a separate task to refactor this part and it is good enough for now for me.
tests/unit/test_dataset.py
Outdated
| @pytest.mark.parametrize( | ||
| "uri", | ||
| [ | ||
| "ds://result", | ||
| "ds://[email protected]", | ||
| "ds://@[email protected]", | ||
| "ds://@[email protected]", | ||
| ], | ||
| ) | ||
| def test_parse_dataset_uri_invalid_format(uri): | ||
| with pytest.raises(ValueError) as excinfo: | ||
| parse_dataset_uri(uri) | ||
|
|
||
| assert str(excinfo.value) == f"Invalid dataset URI format: {uri}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why did we remove these tests to catch all the failures and left only "happy path" tests :(
|
@ilongin can you wrap it all up? |
Fixing function to work with namespace names that start with
@vwhich is the same as version separator - this is where the bug was.Also refactoring function to return namespace name and project name as well. Using regex instead of complicated parsing logic which also simplifies.
Added missing tests for
parse_dataset_uri()as well.Summary by Sourcery
Refactor dataset URI parsing to extract fully qualified identifiers using regex, fix namespace-vs-version ambiguity, update dependent code paths, and enhance test coverage.
New Features:
Bug Fixes:
Enhancements:
Tests: