-
Notifications
You must be signed in to change notification settings - Fork 132
Add dependency tracking to hash_callable()
#1386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideEnhance hash_callable to include user-defined helper functions from the same module in the computed hash by adding a recursion guard, detecting dependencies via bytecode inspection, and incorporating their hashes into the final SHA256. A new unit test verifies that changes in helper functions affect the overall hash. Class diagram for updated hash_callable dependency trackingclassDiagram
class hash_callable {
+hash_callable(func, _visited=None)
- Recursively hashes helper functions from the same module
- Tracks visited functions to avoid infinite recursion
- Inspects func.__code__.co_names for dependencies
- Only includes functions from same file
- Handles circular dependencies
}
class Function {
<<callable>>
}
hash_callable --> Function : hashes
Flow diagram for recursive dependency hashing in hash_callableflowchart TD
A["Start hash_callable(func)"] --> B["Check if func is callable"]
B --> C["Initialize _visited set"]
C --> D["Check if func already visited"]
D -- Yes --> E["Return recursive hash marker"]
D -- No --> F["Add func to _visited"]
F --> G["Determine if lambda or named function"]
G --> H["Build payload and extras"]
H --> I["Inspect func.__code__.co_names for referenced names"]
I --> J["For each referenced name in func.__globals__"]
J --> K["If user-defined function from same module"]
K -- Yes --> L["Recursively hash dependency"]
L --> M["Add dependency hash to dependencies"]
K -- No --> N["Skip"]
M --> O["Include dependency hashes in SHA256"]
N --> O
O --> P["Return final hash"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tests/unit/test_hash_utils.py:112-121` </location>
<code_context>
assert len({h1, h2, h3}) == 3
+
+
+def test_hash_callable_with_dependencies():
+ """Test that hash_callable includes dependencies from the same module."""
+
+ # Define helper and function that uses it
+ def helper(x):
+ return x + 1
+
+ def func_with_helper(x):
+ return helper(x) * 2
+
+ hash1 = hash_callable(func_with_helper)
+ assert hash1 == "5b2dbae7cca8695acd62ea2ee2226277962c1c59a098ab948ff1b2e73b3d822c"
+
+ # Redefine helper with different implementation (same name, different code)
+ def helper(x): # noqa: F811
+ return x + 10
+
+ def func_with_helper(x):
+ return helper(x) * 2
+
+ hash2 = hash_callable(func_with_helper)
+ assert hash2 == "099b86b464fb5a901393b28f073b7701f22a31775b5ce8402b4ea1116a50064e"
+
+ # Hashes should be different because helper changed
+ assert hash1 != hash2
</code_context>
<issue_to_address>
**suggestion (testing):** Missing tests for circular dependencies and multiple helpers.
Please add tests for functions with multiple helpers and for circular dependencies between helpers to fully validate the recursive hashing logic.
</issue_to_address>
### Comment 2
<location> `src/datachain/hash_utils.py:115` </location>
<code_context>
def hash_callable(func, _visited=None):
"""
Calculate a hash from a callable, including its dependencies.
Rules:
- Named functions (def) → use source code for stable, cross-version hashing
- Lambdas → use bytecode (deterministic in same Python runtime)
- Recursively hashes helper functions from the same module
"""
if not callable(func):
raise TypeError("Expected a callable")
# Track visited functions to avoid infinite recursion
if _visited is None:
_visited = set()
# Use id(func) to track which functions we've visited
func_id = id(func)
if func_id in _visited:
return hashlib.sha256(f"recursive:{func.__name__}".encode()).hexdigest()
_visited.add(func_id)
# Determine if it is a lambda
is_lambda = func.__name__ == "<lambda>"
if not is_lambda:
# Try to get exact source of named function
try:
lines, _ = inspect.getsourcelines(func)
payload = textwrap.dedent("".join(lines)).strip()
except (OSError, TypeError):
# Fallback: bytecode if source not available
payload = func.__code__.co_code
else:
# For lambdas, fall back directly to bytecode
payload = func.__code__.co_code
# Normalize annotations
annotations = {
k: getattr(v, "__name__", str(v)) for k, v in func.__annotations__.items()
}
# Extras to distinguish functions with same code but different metadata
extras = {
"name": func.__name__,
"defaults": func.__defaults__,
"annotations": annotations,
}
# Find helper functions that this function depends on
dependencies = {}
if hasattr(func, "__code__") and hasattr(func, "__globals__"):
# Get all names referenced in the function's code
referenced_names = func.__code__.co_names
func_module = inspect.getmodule(func)
for name in referenced_names:
# Look up the name in the function's global namespace
if name in func.__globals__:
obj = func.__globals__[name]
# Only hash user-defined functions from the same module
# Skip built-ins, imported functions from other modules, and classes
if (
callable(obj)
and hasattr(obj, "__module__")
and func_module is not None
and obj.__module__ == func_module.__name__
and not inspect.isclass(obj)
and not inspect.isbuiltin(obj)
):
# Recursively hash the dependency
try:
dependencies[name] = hash_callable(obj, _visited)
except (TypeError, OSError):
# If we can't hash it, skip it
pass
# Compute SHA256
h = hashlib.sha256()
h.update(str(payload).encode() if isinstance(payload, str) else payload)
h.update(str(extras).encode())
# Include dependency hashes in sorted order for determinism
if dependencies:
deps_str = json.dumps(dependencies, sort_keys=True)
h.update(deps_str.encode())
return h.hexdigest()
</code_context>
<issue_to_address>
**issue (code-quality):** Low code quality found in hash\_callable - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_hash_callable_with_dependencies(): | ||
| """Test that hash_callable includes dependencies from the same module.""" | ||
|
|
||
| # Define helper and function that uses it | ||
| def helper(x): | ||
| return x + 1 | ||
|
|
||
| def func_with_helper(x): | ||
| return helper(x) * 2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Missing tests for circular dependencies and multiple helpers.
Please add tests for functions with multiple helpers and for circular dependencies between helpers to fully validate the recursive hashing logic.
|
|
||
|
|
||
| def hash_callable(func): | ||
| def hash_callable(func, _visited=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Low code quality found in hash_callable - 21% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
|
Let's deprioritize this. Or let me know why this is a priority. Even hashing the function is questionable (we actually want to be able to change it and run the same job again, and it will be very hard to make it robust and cover all cases. But even besides that - we have a way bigger priority - make basic case work e2e. Specifically - single chain with UDF restart. Let's please make it work, then we can discuss improvements like this. |
|
@ilongin can be closed for now? |
Currently
hash_callable()only hashed the function itself. Changes to helper functions in the same script didn't change the hashExample:
Solution
Now recursively tracks and hashes user-defined helper functions from the same module:
func.__code__.co_namesto find dependenciesLimitations
Summary by Sourcery
Add recursive dependency tracking to hash_callable so that changes in user-defined helper functions within the same module affect the hash, with cycle detection to prevent infinite recursion.
New Features:
Enhancements:
Tests: