Implement Record Linage Tracking and Finalization Callback #435

zprobst · 2025-08-26T21:40:22Z

This PR introduces the concept of record lineage tracking. The core value proposition here is to expose another hook to Step to allow it to be called back when the record has been fully processed.

class SomeStepThatIsProbablyAnExtractor:
      def process_record(self, record):
            results = self.actually_process_record(record)
            from i, result in results:
                yield result, i

      def finalize_record(self, token: int): 
             # do something with the value you got.

NOTE: This is currently a draft for comment purposes. Tests will not work, need to add some tests to cover this case, and (i think) there are a few edge cases to work out but this is the crux of the solution.

cbadke · 2025-08-27T15:27:01Z

nodestream/pipeline/pipeline.py

+        The `emission` can either be a single value or a tuple of two values.
+        If it is a single value, then it is assumed to be the data for the
+        record. If it is a tuple of two values, then the first value is
+        assumed to be the data for the record and the second value is assumed
+        to be the callback token for the record. If any other value is
+        provided, the data and callback token are both set to the value
+        provided.


This is how i figured it would work. Documentation will have to clearly state that after this change tuple is not an acceptable record type unless the steps provides some kind of callback token. That is to say, tuple record types will only work if the emission type is tuple[tuple, Any].

Probably worth cutting a breaking change here.

cbadke · 2025-08-27T15:30:58Z

nodestream/pipeline/pipeline.py

+        # If we are being told to drop, then we need to run our callback so
+        # that the step that created us can clean up any resources it has
+        # allocated for this record.
+        await self.originating_step.finalize_record(self.callback_token)


should the call to finalize_record be skipped if data == callback_token? From lines 58-60, I figured that was the intent. Or is finalize() always called and the callback_token is just a way for the step author to have the callback call with different data than the original data record?

Yeah I think the code is evidencing to me I am of two minds here.

At first, I was thinking finalize_record would only be called when you provide a token, but there is some chance you'd want to operate the on the record itself when handling finalize_record. And one may argue that its speculative generality, but it actually (assuming cleaning the comments) produces cleaner code to just not treat passing a token as a special case and instead just always call finalize_record with a default implementation of pass

Do you have a feeling one way or another? I think for most cases, it is functionally immaterial.

I can see the argument and temptation to always call finalize_record but to change what is passed.

The concern I have with calling finalize in either case is the inconsistency and element of surprise. If I return one thing from process_record I get one behaviour, if I return a magic tuple, I get a different behaviour. This is technically true in either scenario...

My feeling is that there should generally be one way for things to work. If you only call finalize if they return tuple with a callback object, the system is explicitly forcing the developer to say "let me know when this is finished processing". If they want the original message, they can return (record, record).

Maybe it should even be a flaggable feature but that would complicate matters I imagine.

No matter which way you choose, the tuple vs not tuple behaviour is going to confuse someone at some point. :-/

I agree with your thought process here. I've solicited @jbristow to get another perspective. I definitely see this both ways.

Thinking about this as if I were writing Haskell or F# code, it feels like there should be "current-context" struct being passed from step to step that contains the result rather than a result.

That way you could bind things to the context that would be called by specific lifecycle events. Maybe make some Global hooks like "after-successful-ingest" or whatever.

I apologize for bringing up monad adjacent thought while discussing Python code.

What's the difference between passing:

Context { data: dict[str,Any] } FinalizingContext(Context) { finalize_fn: Callable[context, None] }

and what we do now other than one level of wrapping?

I mean, we could probably make a bit more elegant modeling it after the old state hooks ways of Javaland

have a dict[NodestreamState, list[HookFn]] that a state handler can pick up the context object and say "I AM STATE X! EXECUTE ANY HOOKS YOU HAVE FOR ME"

That really got me thinking... a potential compromise is if we force extractors to look like this:

class SomeExtractor(Extractor): async def extract_records(self): for i in range(1000): yield self.record(data=i) # if you do not want callbacks yield self.managed_record(data=i, callback_token=i) # if you do want callbacks

Then we have some polymorphism on the Record type. Thoughts?

I had a similar thought early on but wasn't sure on the appetite to change the interface. I think it could work.

I'm not sure we should be singularly focused on Extractors for this feature, there might be cases for other constructs (like Transforms?) that could want to access this mechanism once it's available.

It could feel strange to have components return this record object but receive just the data element from the upstream step. Should all steps receive this Record wrapper instead? I would guess the interface would use the base class.

cbadke · 2025-08-27T15:53:29Z

nodestream/pipeline/pipeline.py


-        self.call_handling_errors(self.reporter.on_finish_callback, metrics)
+class Exectutor:


should be Executor 😄

codecov · 2025-08-28T14:43:08Z

Codecov Report

❌ Patch coverage is 97.05882% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.17%. Comparing base (466466a) to head (a4f0925).

Files with missing lines	Patch %	Lines
nodestream/pipeline/pipeline.py	97.01%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #435      +/-   ##
==========================================
- Coverage   98.26%   98.17%   -0.10%     
==========================================
  Files         152      152              
  Lines        6171     6248      +77     
==========================================
+ Hits         6064     6134      +70     
- Misses        107      114       +7

Flag	Coverage Δ
3.10-macos-latest	`?`
3.10-ubuntu-latest	`?`
3.10-windows-latest	`98.15% <97.05%> (-0.08%)`	⬇️
3.11-macos-latest	`?`
3.11-ubuntu-latest	`?`
3.11-windows-latest	`98.15% <97.05%> (-0.08%)`	⬇️
3.12-macos-latest	`?`
3.12-ubuntu-latest	`?`
3.12-windows-latest	`98.15% <97.05%> (-0.08%)`	⬇️
3.13-macos-latest	`?`
3.13-ubuntu-latest	`?`
3.13-windows-latest	`98.15% <97.05%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zprobst added 3 commits August 26, 2025 17:13

Implment Record Linage Tracking and Finalization Callback

d20a212

clarify comment language in Record

3dcd450

Keep downstream pipeline failure ripple effect

e82b14a

zprobst requested a review from ccloes as a code owner August 26, 2025 21:40

zprobst marked this pull request as draft August 26, 2025 21:42

zprobst added 2 commits August 27, 2025 10:02

Straight Refactor of Pipeline code

874d00b

clean up docs and boolean signals with a nice type

f15ee05

cbadke reviewed Aug 27, 2025

View reviewed changes

Introduce tests and fix up existing ones

a4f0925

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Record Linage Tracking and Finalization Callback #435

Implement Record Linage Tracking and Finalization Callback #435

Uh oh!

zprobst commented Aug 26, 2025

Uh oh!

cbadke Aug 27, 2025

Uh oh!

zprobst Aug 27, 2025

Uh oh!

zprobst Aug 27, 2025

Uh oh!

cbadke Aug 27, 2025

Uh oh!

zprobst Aug 27, 2025 •

edited

Loading

Uh oh!

cbadke Aug 27, 2025 •

edited

Loading

Uh oh!

cbadke Aug 27, 2025

Uh oh!

zprobst Aug 28, 2025

Uh oh!

jbristow Aug 28, 2025

Uh oh!

jbristow Aug 28, 2025 •

edited

Loading

Uh oh!

jbristow Aug 28, 2025

Uh oh!

zprobst Aug 29, 2025

Uh oh!

cbadke Aug 29, 2025

Uh oh!

cbadke Aug 27, 2025

Uh oh!

codecov bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!


		self.call_handling_errors(self.reporter.on_finish_callback, metrics)
		class Exectutor:

Implement Record Linage Tracking and Finalization Callback #435

Are you sure you want to change the base?

Implement Record Linage Tracking and Finalization Callback #435

Uh oh!

Conversation

zprobst commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zprobst Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbadke Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbristow Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

zprobst Aug 27, 2025 •

edited

Loading

cbadke Aug 27, 2025 •

edited

Loading

jbristow Aug 28, 2025 •

edited

Loading

codecov bot commented Aug 28, 2025 •

edited

Loading