Implement Record Linage Tracking and Finalization Callback #435

zprobst · 2025-08-26T21:40:22Z

This PR focuses on refactoring the pipeline code in order to better accommodate and implement a record lineage tracking system. The goal of which is to enable steps, often extractors, to free up resources associated with a yielded record to acknowledge back to systems that a record was truly and completely processed.

Refactor

The current pipeline code is tenuous and hard to introduce change into because of the fact that it is implemented as a series of procedures and not well encapsulated. In these situations, we need to make some cross cutting changes to refactor the pipeline code.

The refactor itself works on the observation that the previous code was implemented as a procedure so that there could be careful transitions from one state that a step is in to another. In order to remove the procedures, the pipeline was refactored to operate a state machine for each step in the pipeline as well as the output.

Steps Transition through a state flow that looks like this:

StartStepState → ProcessRecordsState → EmitOutstandingRecordsState → StopStepExecution

and the pipeline output progresses via a state flow like this:

PipelineOutputStartState → PipelineOutputProcessRecordsState → PipelineOutputStopState

This means that both steps and the pipeline output use the same executor pattern, simplifying the overall architecture.

Lineage Tracking

The highlight of this PR i record lineage tracking. This, in short, builds a tree of intermediary and final records produced from every single output record. In other words, we track parent-child relationships of every record in and out of every step. With this information, we are able to know when nodestream is 'done' processing a record and trigger a callback to the originating step when its appropriate. All steps can use this following the same pattern. As an example an extractor is provided below:

from nodestream.pipeline import Extractor


class TracksLineageWithTheOriginalRecord(Extractor):
    async def finalize_record(self, record):
      print("The following record was returned to me", record)

    async def extract_records(self):
      for i in range(10000):
          yield i 


class TracksLineageWithSomeToken(Extractor):
   def __init__(self, items):
      self.items = items

   async def finalize_record(self, token):
      print("The following record was returend to me", self.items[token])

   async def extract_record(self):
      for index, item in enumerate(self.items):
         yield item, index # yielding a tuple here, the second item here is what is sent back to `finalize_record`

Bug Fixes

As miscellanies, this PR fixes to minor issues with exception handling in the pipeline logic as well.

Fixes a race condition where If an exception is thrown before the on_start callback is triggered, the CLI may not have started the progress spinner that reports error messages. This leads to a separate error on the crash that obscures the original error. The fix is to ensure that the on_start is explicitly executed first before actual pipeline processing begins.
Fixes a error where it is impossible for the CLI to throw an exception in order to emit a status code other than zero. This is because the hook provided to the CLI to do this (on_finish) had a try catch block around it. That try catch was not required because, by the time the pipeline executes this code, there is nothing else to do so there is no reason to swallow the exception in order to protect the integrity of the pipeline. the fix was to simply remove that try catch.

Due to this behavior, we'll likely need to make a 0.15 release.

nodestream/pipeline/pipeline.py

codecov · 2025-08-28T14:43:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.30%. Comparing base (f0d69f4) to head (f3a3aa5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
+ Coverage   98.26%   98.30%   +0.03%     
==========================================
  Files         152      152              
  Lines        6171     6247      +76     
==========================================
+ Hits         6064     6141      +77     
+ Misses        107      106       -1

Flag	Coverage Δ
3.10-macos-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.10-ubuntu-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.10-windows-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.11-macos-latest	`98.28% <100.00%> (+0.03%)`	⬆️
3.11-ubuntu-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.11-windows-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.12-macos-latest	`98.28% <100.00%> (+0.03%)`	⬆️
3.12-ubuntu-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.12-windows-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.13-macos-latest	`98.28% <100.00%> (+0.03%)`	⬆️
3.13-ubuntu-latest	`98.27% <100.00%> (+0.03%)`	⬆️
3.13-windows-latest	`98.27% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…linage-tracking

cbadke

I'm really excited to see how all this is coming together. Just a few notes and questions.

cbadke · 2025-09-26T21:14:23Z

nodestream/pipeline/step.py

+    tracks_lineage: bool = False
+


Would there be any developer benefit to encapsulating this as a new subclass of Step? Then finalize_record could only be declared on that interface?

I feel like this could avoid confusion for Step developers that don't need this finalizing behaviour since if that boolean is false then finalize_record is never called.

def FinalizingStep(Step): async def finalize_record(self, callback_token: object): """Finalize a record. This method is called when a record produced by this step has been fully processed by all downstream steps. It is not called for records that are not produced by this step. """ pass

Interesting idea... thought about it a bit. Right now we have class hierarchies that look like this:

graph LR A[Step] --> B(Transformer) B --> C[MyAwesomeTransformer]

Loading

Lets assume that we want to add finalization to our MyAwesomeTransformer by inheriting from FinalizingStep. We'd need to have a class hierarchy like this:

graph LR A[Step] --> B(Transformer) B --> C[MyAwesomeTransformer] D[FinalizingStep] --> C A --> D

Loading

This creates a... confusing class hierarchy and can lead to weird to weird MRO issues.

Then imagine we have a ApronSpringsStep that gets notified every time we have operate on a give record.

graph LR A[Step] --> B(Transformer) B --> C[MyAwesomeTransformer] D[FinalizingStep] --> C A --> D E[ApronSpringsStep] --> C A --> E

Loading

This violates my personal rule for relatively shallow, flat hierarchies of classes. The more cases we add to this example the more it feels that its really the same case with the implementer of Step choosing to do something or not depending on the cases.

Yep, that's a really good point. As I look at this, it feels like Finalizing is more of a Protocol than a SubClass. Would that feel any better?

It's a bit of an abuse because utilizing the protocol changes the frameworks treatment of Step outputs so maybe it's still not a great idea. I think I'm just trying to address the bad feeling of a boolean behavior flag and a method that is unimportant to most use cases.

I leave it to your judgement on how you want that interface and experience to work.

I think I've come down into there isn't really a need to distinguish a protocol or subclass to avoid the bools. Not having it in this case is the same as doing nothing. For Step is is always a reasonable default implementation that we can rely on. This case its just pass.

cbadke · 2025-09-26T21:16:27Z

nodestream/pipeline/pipeline.py

+@dataclass(slots=True)
+class Record:
+    """A `Record` is a unit of data that is processed by a pipeline."""
+


This is only a framework facing class so maybe not critical but I wonder if naming this Record will cause some confusion. As I'm reading through the code, I'm realizing that throughout the system record is used to refer to the input and output of Steps. But now we have a new Record class where that object that is often called a record is the data property of this class. Could get confusing for folks not deeply entrenched in the framework?

Thats a good point... I'll workshop a better name

I've gone with RecordContext

cbadke · 2025-09-26T21:18:04Z

nodestream/pipeline/pipeline.py

+        data = callback_token = emission
+        if isinstance(emission, tuple) and step.tracks_lineage:
+            data, callback_token = emission


Would it be beneficial for callback_token == None when steps.tracks_lineage == None? Would make it explicitly clear than no token was actually communicated out from the step.

My pendulum has swung the other way on this. I think always calling it and calling and not having a flag is the most predictable pattern.

Make sure to update docs and advertise the breaking change that any tuple returned from a step will have the last element stripped off.

cbadke · 2025-09-26T21:21:15Z

nodestream/pipeline/pipeline.py

            self.step_outbox_size, current_output_name, current_input_name
        )
-        pipeline_output = PipelineOutput(current_input, reporter)
+        executors.append(Executor.pipeline_output(current_input, reporter))


I might be misremembering but I thought we needed the reporter to be first in the list in order to fix the race?

I think it is? Line 484 we create a blank list and 497 is the first we append to it so it will have position 0. Do you think its better if we create it there to be a touch more clear about that. Something like:

executors = [Executor.pipeline_output(current_input, reporter)]

Does that seem more appropriate?

I see. I was completely misinterpreting this section. I think I'm not fulling grokking all of the Executor abstractions and channel management and that was clouding my read of it.

All good... this PR is doing a lot in it.

cbadke · 2025-09-26T21:23:52Z

tests/unit/pipeline/test_record.py

+async def test_record_from_step_emission_tuple_data():
+    """Test Record.from_step_emission with tuple (data, token)."""
+    step = Mock(spec=Step)
+    data = {"test": "data"}
+    token = "callback_token"
+
+    record = Record.from_step_emission(step, (data, token))
+
+    assert record.data == data
+    assert record.callback_token == token
+    assert record.originating_step == step


How does this test pass if tracks_lineage isn't set on the Step? This seems like an issue with a number of tests in this file?

Maybe Mock defaults bools to True? If so, I feel like it would be clearer to be explicit about that value.

cbadke · 2025-09-26T21:29:46Z

tests/unit/pipeline/test_step.py

+    # Should not raise any exceptions and should do nothing
+
+
+@pytest.mark.asyncio


If I'm not mistaken, I think this test just tests that You can call a function. The mock_finalize is defined, attached to the step object and then called. I'm not sure this is testing anything.

Probably teh tests are pretty bad. I'll go through them

tests/unit/pipeline/test_step_finalize_record.py

cbadke · 2025-09-26T21:36:46Z

tests/unit/pipeline/test_step_finalize_record.py

+
+
+@pytest.mark.asyncio
+async def test_finalize_record_async_behavior():


Is this actually testing that Record.drop() calls the finalize with await?

zprobst · 2025-10-03T13:33:43Z

I think due to the complexity and nuance in this change, we are due for merging this into a new breaking release. I see this as a good opportunity to make some other small changes we need to make a-la #407

zprobst added 3 commits August 26, 2025 17:13

Implment Record Linage Tracking and Finalization Callback

d20a212

clarify comment language in Record

3dcd450

Keep downstream pipeline failure ripple effect

e82b14a

zprobst requested a review from ccloes as a code owner August 26, 2025 21:40

zprobst marked this pull request as draft August 26, 2025 21:42

zprobst added 2 commits August 27, 2025 10:02

Straight Refactor of Pipeline code

874d00b

clean up docs and boolean signals with a nice type

f15ee05

cbadke reviewed Aug 27, 2025

View reviewed changes

nodestream/pipeline/pipeline.py Show resolved Hide resolved

nodestream/pipeline/pipeline.py Show resolved Hide resolved

nodestream/pipeline/pipeline.py Outdated Show resolved Hide resolved

Introduce tests and fix up existing ones

a4f0925

zprobst added 11 commits September 25, 2025 16:41

Tiny cleanup

0005062

fix typo

699b7d0

add tests

d80cfe9

Fix Status code Error handling

0968fea

Merge branch 'fix/status-code-an-error-handling' into feature/record-…

aae9098

…linage-tracking

isort test files

912d345

Adds a feature flag for the tuple magic escpae hatch

c417562

Fix up lint errors

45c6a99

correctly capture step emit failure case

71ccebe

Fix tests to opt into needed feature flag

91d61fe

Fix tests to opt into needed feature flag

e79c5a5

zprobst marked this pull request as ready for review September 26, 2025 12:31

zprobst requested review from cbadke and jbristow September 26, 2025 12:31

Merge branch 'main' into feature/record-linage-tracking

51f550b

cbadke reviewed Sep 26, 2025

View reviewed changes

ccloes mentioned this pull request Sep 29, 2025

Could callback be used to trigger actions after write? #436

Open

zprobst added 3 commits October 3, 2025 08:57

Revert back to implementing being the way we care about tracking

0dcb02c

Change Record to RecordContext

be6f8c9

Clean up tests

deb0739

zprobst added 2 commits October 3, 2025 09:21

Trim useless tests

27c8f4e

one more useless test

f3a3aa5

zprobst changed the base branch from main to 0.15 October 3, 2025 13:32

		# Should not raise any exceptions and should do nothing


		@pytest.mark.asyncio



		@pytest.mark.asyncio
		async def test_finalize_record_async_behavior():

Implement Record Linage Tracking and Finalization Callback #435

Are you sure you want to change the base?

Implement Record Linage Tracking and Finalization Callback #435

Conversation

zprobst commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor

Lineage Tracking

Bug Fixes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cbadke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zprobst Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zprobst commented Oct 3, 2025

Uh oh!

Uh oh!

zprobst commented Aug 26, 2025 •

edited

Loading

codecov bot commented Aug 28, 2025 •

edited

Loading

zprobst Sep 29, 2025 •

edited

Loading