better fix for sys_id in delta, add tests #1413

shcheklein · 2025-10-19T20:54:44Z

Continuation of https://github.com/iterative/datachain/pull/1412/files

Since we use union in delta, we essentially properly set schema in read_dataset to not include sys columns

Summary by Sourcery

Simplify and correct delta processing by filtering out system columns at read time, removing a redundant regeneration step, propagating delta parameters in storage reads, and adding tests to validate proper behavior

Bug Fixes:

Exclude system columns from the schema during delta reads to prevent incorrect sys_id unions

Enhancements:

Remove the manual _RegenerateSystemColumnsStep hack and streamline delta logic by passing delta parameters through read_storage
Adjust read_dataset to drop system signals when delta is enabled

Tests:

Add functional tests to verify that storage-based delta replays correctly regenerate system columns and include new records

Summary by Sourcery

Improve delta handling by excluding system columns at read time, removing the hacky regeneration step, consolidating delta logic in storage reads, and expanding functional tests

Bug Fixes:

Exclude system columns from the schema when reading datasets in delta mode to fix sys_id handling

Enhancements:

Remove the _RegenerateSystemColumnsStep hack and redundant _as_delta invocation
Propagate delta parameters directly in read_storage for unified delta logic

Tests:

Update delta functional test to use the new save signature and add a storage delta replay test for system column regeneration

sourcery-ai · 2025-10-19T20:54:49Z

Reviewer's Guide

This PR streamlines delta handling by filtering out system columns at read time, removing a redundant system-column regeneration step, propagating delta parameters through storage reads, and adding functional tests to ensure correct delta replay behavior.

Sequence diagram for delta read with system column filtering

sequenceDiagram
    participant "read_dataset()"
    participant "SignalSchema"
    participant "DataChain"
    participant "read_storage()"
    "read_dataset()"->>"SignalSchema": deserialize or from_column_types
    "read_dataset()"->>"SignalSchema": clone_without_sys_signals (if delta)
    "read_dataset()"->>"DataChain": create with filtered signals_schema
    "DataChain"->>"read_storage()": propagate delta parameters

Entity relationship diagram for signals schema changes in delta reads

erDiagram
    SIGNAL_SCHEMA {
        id int
        name string
        type string
        is_system bool
    }
    DATA_CHAIN {
        id int
        signals_schema_id int
    }
    DATA_CHAIN ||--o| SIGNAL_SCHEMA : uses
    %% When delta is enabled, DATA_CHAIN uses SIGNAL_SCHEMA with is_system = false

Class diagram for delta handling and system column changes

classDiagram
    class DataChain {
        +clone()
        +_query: Query
        +signals_schema: SignalSchema
    }
    class SignalSchema {
        +deserialize()
        +from_column_types()
        +mutate()
        +clone_without_sys_signals()
    }
    class read_dataset {
        +delta: bool
        +signals_schema: SignalSchema
    }
    class read_storage {
        +delta: bool
        +delta_on
        +delta_result_on
        +delta_compare
        +delta_retry
        +delta_unsafe
    }
    DataChain --> SignalSchema
    read_dataset --> SignalSchema
    read_storage --> DataChain
    read_storage --> read_dataset

    %% Removed class
    class _RegenerateSystemColumnsStep {
        -catalog: Catalog
        -hash_inputs()
        -apply()
    }
    %% Indicate removal
    _RegenerateSystemColumnsStep --|> Step
    %% Mark as removed
    class _RegenerateSystemColumnsStep {
        <<removed>>
    }

File-Level Changes

Change	Details	Files
Removed manual system-column regeneration hack	Deleted _RegenerateSystemColumnsStep class Removed its injection in delta chain appending Cleaned up related imports	`src/datachain/delta.py`
Propagated delta parameters directly through read_storage	Extended read_storage signature to accept delta arguments Passed delta, on, result_on, compare, retry, unsafe flags into underlying reader Eliminated separate _as_delta call in storage listing	`src/datachain/lib/dc/storage.py`
Filtered out system signals in read_dataset when delta is enabled	Inserted schema.clone_without_sys_signals() under delta branch Adjusted signals_schema construction for delta reads	`src/datachain/lib/dc/datasets.py`
Enhanced functional tests for storage-based delta replay	Revised existing build_chain delta invocation to omit redundant params Added test_storage_delta_replay_regenerates_system_columns Verified system-column regeneration and new record inclusion	`tests/func/test_delta.py`

Possibly linked issues

#issue: The PR resolves the KeyError by correctly filtering out system columns like 'sys__id' in delta processing.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codecov · 2025-10-19T21:02:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.79%. Comparing base (7258b2e) to head (9e7340b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1413      +/-   ##
==========================================
- Coverage   87.80%   87.79%   -0.02%     
==========================================
  Files         160      160              
  Lines       15207    15192      -15     
  Branches     2178     2178              
==========================================
- Hits        13353    13338      -15     
  Misses       1350     1350              
  Partials      504      504

Flag	Coverage Δ
datachain	`87.75% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/delta.py	`97.53% <ø> (+0.65%)`	⬆️
src/datachain/lib/dc/datasets.py	`95.23% <100.00%> (+0.11%)`	⬆️
src/datachain/lib/dc/storage.py	`100.00% <ø> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-10-25T05:52:01Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`9e7340b`
Status:	✅ Deploy successful!
Preview URL:	https://d03748bf.datachain-documentation.pages.dev
Branch Preview URL:	https://fix-sys-id-delta-2.datachain-documentation.pages.dev

View logs

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `tests/func/test_delta.py:338-343` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

dreadatour

👀

better fix for sys_id in delta, add tests

9e7340b

shcheklein force-pushed the fix-sys-id-delta-2 branch from af8762a to 9e7340b Compare October 25, 2025 05:50

shcheklein requested a review from dreadatour October 26, 2025 02:10

shcheklein marked this pull request as ready for review October 26, 2025 02:10

shcheklein requested a review from a team October 26, 2025 02:10

sourcery-ai bot reviewed Oct 26, 2025

View reviewed changes

shcheklein closed this Oct 26, 2025

shcheklein reopened this Oct 26, 2025

dreadatour approved these changes Oct 26, 2025

View reviewed changes

shcheklein merged commit efe1202 into main Oct 26, 2025
72 of 74 checks passed

shcheklein deleted the fix-sys-id-delta-2 branch October 26, 2025 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

better fix for sys_id in delta, add tests #1413

better fix for sys_id in delta, add tests #1413

Uh oh!

shcheklein commented Oct 19, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Oct 19, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov bot commented Oct 19, 2025 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Oct 25, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

dreadatour left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

better fix for sys_id in delta, add tests #1413

better fix for sys_id in delta, add tests #1413

Uh oh!

Conversation

shcheklein commented Oct 19, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for delta read with system column filtering

Entity relationship diagram for signals schema changes in delta reads

Class diagram for delta handling and system column changes

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

codecov bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Oct 25, 2025

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shcheklein commented Oct 19, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 19, 2025 •

edited

Loading

codecov bot commented Oct 19, 2025 •

edited

Loading