Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Oct 6, 2025

Currently hash_callable() only hashed the function itself. Changes to helper functions in the same script didn't change the hash

Example:

def helper(x):
    return x + 1

def my_udf(x):
    return helper(x) * 2

hash1 = hash_callable(my_udf)

# Change helper
def helper(x):
    return x + 10

hash2 = hash_callable(my_udf)  # hash1 == hash2 -> wrong!

Solution

Now recursively tracks and hashes user-defined helper functions from the same module:

  • Inspects func.__code__.co_names to find dependencies
  • Only includes functions from same file (ignores imports, built-ins, classes)
  • Handles circular dependencies

Limitations

  1. Global variables/constants not tracked - e.g. changing THRESHOLD = 100 to 200 (assuming it's used in function) won't change hash
  2. Cross-module imports ignored - from utils import helper changes not tracked (by design)
  3. Closures not tracked - captured values in closures not reflected

Summary by Sourcery

Add recursive dependency tracking to hash_callable so that changes in user-defined helper functions within the same module affect the hash, with cycle detection to prevent infinite recursion.

New Features:

  • Recursively include user-defined helper functions from the same module in hash_callable
  • Handle circular dependencies by tracking visited functions

Enhancements:

  • Inspect function code names and globals to identify same-module dependencies and ignore imports, builtins, and classes
  • Include sorted dependency hashes in the hash payload for determinism

Tests:

  • Add unit test to verify that hash_callable changes when a helper function implementation is modified

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 6, 2025

Reviewer's Guide

Enhance hash_callable to include user-defined helper functions from the same module in the computed hash by adding a recursion guard, detecting dependencies via bytecode inspection, and incorporating their hashes into the final SHA256. A new unit test verifies that changes in helper functions affect the overall hash.

Class diagram for updated hash_callable dependency tracking

classDiagram
    class hash_callable {
        +hash_callable(func, _visited=None)
        - Recursively hashes helper functions from the same module
        - Tracks visited functions to avoid infinite recursion
        - Inspects func.__code__.co_names for dependencies
        - Only includes functions from same file
        - Handles circular dependencies
    }
    class Function {
        <<callable>>
    }
    hash_callable --> Function : hashes
Loading

Flow diagram for recursive dependency hashing in hash_callable

flowchart TD
    A["Start hash_callable(func)"] --> B["Check if func is callable"]
    B --> C["Initialize _visited set"]
    C --> D["Check if func already visited"]
    D -- Yes --> E["Return recursive hash marker"]
    D -- No --> F["Add func to _visited"]
    F --> G["Determine if lambda or named function"]
    G --> H["Build payload and extras"]
    H --> I["Inspect func.__code__.co_names for referenced names"]
    I --> J["For each referenced name in func.__globals__"]
    J --> K["If user-defined function from same module"]
    K -- Yes --> L["Recursively hash dependency"]
    L --> M["Add dependency hash to dependencies"]
    K -- No --> N["Skip"]
    M --> O["Include dependency hashes in SHA256"]
    N --> O
    O --> P["Return final hash"]
Loading

File-Level Changes

Change Details Files
Added recursion guard to hash_callable
  • Introduce optional _visited parameter
  • Track visited function IDs to prevent infinite loops
  • Return a special hash for already visited functions
src/datachain/hash_utils.py
Implemented dependency detection and recursive hashing
  • Inspect func.code.co_names for referenced names
  • Filter to user-defined functions in the same module (exclude imports, built-ins, classes)
  • Recursively hash each eligible helper and collect in dependencies dict
src/datachain/hash_utils.py
Integrated dependency hashes into final hash computation
  • Serialize dependencies dict with sorted keys
  • Include serialized dependency string in SHA256 update
src/datachain/hash_utils.py
Added unit test for dependency tracking in hash_callable
  • Define helper and user function, assert hash changes when helper code is modified
  • Verify that different helper implementations produce different hashes
tests/unit/test_hash_utils.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tests/unit/test_hash_utils.py:112-121` </location>
<code_context>
     assert len({h1, h2, h3}) == 3
+
+
+def test_hash_callable_with_dependencies():
+    """Test that hash_callable includes dependencies from the same module."""
+
+    # Define helper and function that uses it
+    def helper(x):
+        return x + 1
+
+    def func_with_helper(x):
+        return helper(x) * 2
+
+    hash1 = hash_callable(func_with_helper)
+    assert hash1 == "5b2dbae7cca8695acd62ea2ee2226277962c1c59a098ab948ff1b2e73b3d822c"
+
+    # Redefine helper with different implementation (same name, different code)
+    def helper(x):  # noqa: F811
+        return x + 10
+
+    def func_with_helper(x):
+        return helper(x) * 2
+
+    hash2 = hash_callable(func_with_helper)
+    assert hash2 == "099b86b464fb5a901393b28f073b7701f22a31775b5ce8402b4ea1116a50064e"
+
+    # Hashes should be different because helper changed
+    assert hash1 != hash2
</code_context>

<issue_to_address>
**suggestion (testing):** Missing tests for circular dependencies and multiple helpers.

Please add tests for functions with multiple helpers and for circular dependencies between helpers to fully validate the recursive hashing logic.
</issue_to_address>

### Comment 2
<location> `src/datachain/hash_utils.py:115` </location>
<code_context>
def hash_callable(func, _visited=None):
    """
    Calculate a hash from a callable, including its dependencies.
    Rules:
    - Named functions (def) → use source code for stable, cross-version hashing
    - Lambdas → use bytecode (deterministic in same Python runtime)
    - Recursively hashes helper functions from the same module
    """
    if not callable(func):
        raise TypeError("Expected a callable")

    # Track visited functions to avoid infinite recursion
    if _visited is None:
        _visited = set()

    # Use id(func) to track which functions we've visited
    func_id = id(func)
    if func_id in _visited:
        return hashlib.sha256(f"recursive:{func.__name__}".encode()).hexdigest()
    _visited.add(func_id)

    # Determine if it is a lambda
    is_lambda = func.__name__ == "<lambda>"

    if not is_lambda:
        # Try to get exact source of named function
        try:
            lines, _ = inspect.getsourcelines(func)
            payload = textwrap.dedent("".join(lines)).strip()
        except (OSError, TypeError):
            # Fallback: bytecode if source not available
            payload = func.__code__.co_code
    else:
        # For lambdas, fall back directly to bytecode
        payload = func.__code__.co_code

    # Normalize annotations
    annotations = {
        k: getattr(v, "__name__", str(v)) for k, v in func.__annotations__.items()
    }

    # Extras to distinguish functions with same code but different metadata
    extras = {
        "name": func.__name__,
        "defaults": func.__defaults__,
        "annotations": annotations,
    }

    # Find helper functions that this function depends on
    dependencies = {}
    if hasattr(func, "__code__") and hasattr(func, "__globals__"):
        # Get all names referenced in the function's code
        referenced_names = func.__code__.co_names
        func_module = inspect.getmodule(func)

        for name in referenced_names:
            # Look up the name in the function's global namespace
            if name in func.__globals__:
                obj = func.__globals__[name]

                # Only hash user-defined functions from the same module
                # Skip built-ins, imported functions from other modules, and classes
                if (
                    callable(obj)
                    and hasattr(obj, "__module__")
                    and func_module is not None
                    and obj.__module__ == func_module.__name__
                    and not inspect.isclass(obj)
                    and not inspect.isbuiltin(obj)
                ):
                    # Recursively hash the dependency
                    try:
                        dependencies[name] = hash_callable(obj, _visited)
                    except (TypeError, OSError):
                        # If we can't hash it, skip it
                        pass

    # Compute SHA256
    h = hashlib.sha256()
    h.update(str(payload).encode() if isinstance(payload, str) else payload)
    h.update(str(extras).encode())
    # Include dependency hashes in sorted order for determinism
    if dependencies:
        deps_str = json.dumps(dependencies, sort_keys=True)
        h.update(deps_str.encode())
    return h.hexdigest()

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in hash\_callable - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 112 to 121
def test_hash_callable_with_dependencies():
"""Test that hash_callable includes dependencies from the same module."""

# Define helper and function that uses it
def helper(x):
return x + 1

def func_with_helper(x):
return helper(x) * 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Missing tests for circular dependencies and multiple helpers.

Please add tests for functions with multiple helpers and for circular dependencies between helpers to fully validate the recursive hashing logic.



def hash_callable(func):
def hash_callable(func, _visited=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in hash_callable - 21% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

@ilongin ilongin marked this pull request as draft October 6, 2025 13:45
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 6, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2fdc2f1
Status:🚫  Deploy failed.

View logs

@shcheklein
Copy link
Member

Let's deprioritize this. Or let me know why this is a priority. Even hashing the function is questionable (we actually want to be able to change it and run the same job again, and it will be very hard to make it robust and cover all cases. But even besides that - we have a way bigger priority - make basic case work e2e. Specifically - single chain with UDF restart. Let's please make it work, then we can discuss improvements like this.

@shcheklein
Copy link
Member

@ilongin can be closed for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants