Skip to content

Conversation

@shcheklein
Copy link
Member

@shcheklein shcheklein commented Oct 19, 2025

Continuation of https://github.com/iterative/datachain/pull/1412/files

Since we use union in delta, we essentially properly set schema in read_dataset to not include sys columns

Summary by Sourcery

Simplify and correct delta processing by filtering out system columns at read time, removing a redundant regeneration step, propagating delta parameters in storage reads, and adding tests to validate proper behavior

Bug Fixes:

  • Exclude system columns from the schema during delta reads to prevent incorrect sys_id unions

Enhancements:

  • Remove the manual _RegenerateSystemColumnsStep hack and streamline delta logic by passing delta parameters through read_storage
  • Adjust read_dataset to drop system signals when delta is enabled

Tests:

  • Add functional tests to verify that storage-based delta replays correctly regenerate system columns and include new records

Summary by Sourcery

Improve delta handling by excluding system columns at read time, removing the hacky regeneration step, consolidating delta logic in storage reads, and expanding functional tests

Bug Fixes:

  • Exclude system columns from the schema when reading datasets in delta mode to fix sys_id handling

Enhancements:

  • Remove the _RegenerateSystemColumnsStep hack and redundant _as_delta invocation
  • Propagate delta parameters directly in read_storage for unified delta logic

Tests:

  • Update delta functional test to use the new save signature and add a storage delta replay test for system column regeneration

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 19, 2025

Reviewer's Guide

This PR streamlines delta handling by filtering out system columns at read time, removing a redundant system-column regeneration step, propagating delta parameters through storage reads, and adding functional tests to ensure correct delta replay behavior.

Sequence diagram for delta read with system column filtering

sequenceDiagram
    participant "read_dataset()"
    participant "SignalSchema"
    participant "DataChain"
    participant "read_storage()"
    "read_dataset()"->>"SignalSchema": deserialize or from_column_types
    "read_dataset()"->>"SignalSchema": clone_without_sys_signals (if delta)
    "read_dataset()"->>"DataChain": create with filtered signals_schema
    "DataChain"->>"read_storage()": propagate delta parameters
Loading

Entity relationship diagram for signals schema changes in delta reads

erDiagram
    SIGNAL_SCHEMA {
        id int
        name string
        type string
        is_system bool
    }
    DATA_CHAIN {
        id int
        signals_schema_id int
    }
    DATA_CHAIN ||--o| SIGNAL_SCHEMA : uses
    %% When delta is enabled, DATA_CHAIN uses SIGNAL_SCHEMA with is_system = false
Loading

Class diagram for delta handling and system column changes

classDiagram
    class DataChain {
        +clone()
        +_query: Query
        +signals_schema: SignalSchema
    }
    class SignalSchema {
        +deserialize()
        +from_column_types()
        +mutate()
        +clone_without_sys_signals()
    }
    class read_dataset {
        +delta: bool
        +signals_schema: SignalSchema
    }
    class read_storage {
        +delta: bool
        +delta_on
        +delta_result_on
        +delta_compare
        +delta_retry
        +delta_unsafe
    }
    DataChain --> SignalSchema
    read_dataset --> SignalSchema
    read_storage --> DataChain
    read_storage --> read_dataset

    %% Removed class
    class _RegenerateSystemColumnsStep {
        -catalog: Catalog
        -hash_inputs()
        -apply()
    }
    %% Indicate removal
    _RegenerateSystemColumnsStep --|> Step
    %% Mark as removed
    class _RegenerateSystemColumnsStep {
        <<removed>>
    }
Loading

File-Level Changes

Change Details Files
Removed manual system-column regeneration hack
  • Deleted _RegenerateSystemColumnsStep class
  • Removed its injection in delta chain appending
  • Cleaned up related imports
src/datachain/delta.py
Propagated delta parameters directly through read_storage
  • Extended read_storage signature to accept delta arguments
  • Passed delta, on, result_on, compare, retry, unsafe flags into underlying reader
  • Eliminated separate _as_delta call in storage listing
src/datachain/lib/dc/storage.py
Filtered out system signals in read_dataset when delta is enabled
  • Inserted schema.clone_without_sys_signals() under delta branch
  • Adjusted signals_schema construction for delta reads
src/datachain/lib/dc/datasets.py
Enhanced functional tests for storage-based delta replay
  • Revised existing build_chain delta invocation to omit redundant params
  • Added test_storage_delta_replay_regenerates_system_columns
  • Verified system-column regeneration and new record inclusion
tests/func/test_delta.py

Possibly linked issues

  • #issue: The PR resolves the KeyError by correctly filtering out system columns like 'sys__id' in delta processing.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Oct 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.79%. Comparing base (7258b2e) to head (9e7340b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1413      +/-   ##
==========================================
- Coverage   87.80%   87.79%   -0.02%     
==========================================
  Files         160      160              
  Lines       15207    15192      -15     
  Branches     2178     2178              
==========================================
- Hits        13353    13338      -15     
  Misses       1350     1350              
  Partials      504      504              
Flag Coverage Δ
datachain 87.75% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/delta.py 97.53% <ø> (+0.65%) ⬆️
src/datachain/lib/dc/datasets.py 95.23% <100.00%> (+0.11%) ⬆️
src/datachain/lib/dc/storage.py 100.00% <ø> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cloudflare-workers-and-pages
Copy link

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9e7340b
Status: ✅  Deploy successful!
Preview URL: https://d03748bf.datachain-documentation.pages.dev
Branch Preview URL: https://fix-sys-id-delta-2.datachain-documentation.pages.dev

View logs

@shcheklein shcheklein requested a review from dreadatour October 26, 2025 02:10
@shcheklein shcheklein marked this pull request as ready for review October 26, 2025 02:10
@shcheklein shcheklein requested a review from a team October 26, 2025 02:10
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tests/func/test_delta.py:338-343` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@shcheklein shcheklein closed this Oct 26, 2025
@shcheklein shcheklein reopened this Oct 26, 2025
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

@shcheklein shcheklein merged commit efe1202 into main Oct 26, 2025
72 of 74 checks passed
@shcheklein shcheklein deleted the fix-sys-id-delta-2 branch October 26, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants