-
Notifications
You must be signed in to change notification settings - Fork 14
Implement Record Linage Tracking and Finalization Callback #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zprobst
wants to merge
23
commits into
0.15
Choose a base branch
from
feature/record-linage-tracking
base: 0.15
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 18 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
d20a212
Implment Record Linage Tracking and Finalization Callback
zprobst 3dcd450
clarify comment language in Record
zprobst e82b14a
Keep downstream pipeline failure ripple effect
zprobst 874d00b
Straight Refactor of Pipeline code
zprobst f15ee05
clean up docs and boolean signals with a nice type
zprobst a4f0925
Introduce tests and fix up existing ones
zprobst 0005062
Tiny cleanup
zprobst 699b7d0
fix typo
zprobst d80cfe9
add tests
zprobst 0968fea
Fix Status code Error handling
zprobst aae9098
Merge branch 'fix/status-code-an-error-handling' into feature/record-…
zprobst 912d345
isort test files
zprobst c417562
Adds a feature flag for the tuple magic escpae hatch
zprobst 45c6a99
Fix up lint errors
zprobst 71ccebe
correctly capture step emit failure case
zprobst 91d61fe
Fix tests to opt into needed feature flag
zprobst e79c5a5
Fix tests to opt into needed feature flag
zprobst 51f550b
Merge branch 'main' into feature/record-linage-tracking
zprobst 0dcb02c
Revert back to implementing being the way we care about tracking
zprobst be6f8c9
Change Record to RecordContext
zprobst deb0739
Clean up tests
zprobst 27c8f4e
Trim useless tests
zprobst f3a3aa5
one more useless test
zprobst File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,315 @@ | ||
from unittest.mock import Mock | ||
|
||
import pytest | ||
|
||
from nodestream.pipeline.extractors import Extractor | ||
from nodestream.pipeline.object_storage import ObjectStore | ||
from nodestream.pipeline.pipeline import Pipeline | ||
from nodestream.pipeline.progress_reporter import PipelineProgressReporter | ||
from nodestream.pipeline.transformers import Transformer | ||
from nodestream.pipeline.writers import Writer | ||
|
||
|
||
class ResourceTrackingExtractor(Extractor): | ||
"""Extractor that tracks resource allocation and cleanup.""" | ||
|
||
tracks_lineage: bool = True | ||
|
||
def __init__(self, data_items): | ||
self.data_items = data_items | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
|
||
async def extract_records(self): | ||
for i, item in enumerate(self.data_items): | ||
# Simulate resource allocation | ||
token = f"extractor_resource_{i}" | ||
self.allocated_resources[token] = f"resource_for_{item}" | ||
yield (item, token) # Emit tuple with callback token | ||
|
||
async def finalize_record(self, record_token): | ||
"""Clean up resources allocated for this record.""" | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
|
||
class ResourceTrackingTransformer(Transformer): | ||
"""Transformer that tracks resource allocation and cleanup.""" | ||
|
||
tracks_lineage: bool = True | ||
|
||
def __init__(self): | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
self.processed_records = [] | ||
|
||
async def process_record(self, record, context): | ||
# Track the record we processed | ||
self.processed_records.append(record) | ||
|
||
# Simulate resource allocation for transformation | ||
token = f"transformer_resource_{id(record)}" | ||
self.allocated_resources[token] = f"transform_resource_for_{record}" | ||
|
||
# Transform the record | ||
transformed = f"transformed_{record}" | ||
yield (transformed, token) # Emit with callback token | ||
|
||
async def finalize_record(self, record_token): | ||
"""Clean up transformation resources.""" | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
|
||
class ResourceTrackingWriter(Writer): | ||
"""Writer that tracks resource allocation and cleanup.""" | ||
|
||
tracks_lineage: bool = True | ||
|
||
def __init__(self): | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
self.written_records = [] | ||
|
||
async def write_record(self, record): | ||
# Track what we wrote | ||
self.written_records.append(record) | ||
|
||
# Simulate resource allocation for writing | ||
token = f"writer_resource_{id(record)}" | ||
self.allocated_resources[token] = f"write_resource_for_{record}" | ||
return token # Return token for cleanup | ||
|
||
async def process_record(self, record, context): | ||
# Write the record and get cleanup token | ||
token = await self.write_record(record) | ||
yield (record, token) # Pass through with cleanup token | ||
|
||
async def finalize_record(self, record_token): | ||
"""Clean up writing resources.""" | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_end_to_end_cleanup_flow(): | ||
"""Test complete cleanup flow through extractor -> transformer -> writer.""" | ||
# Create steps with resource tracking | ||
extractor = ResourceTrackingExtractor(["item1", "item2", "item3"]) | ||
transformer = ResourceTrackingTransformer() | ||
writer = ResourceTrackingWriter() | ||
|
||
# Create pipeline | ||
steps = (extractor, transformer, writer) | ||
object_store = Mock(spec=ObjectStore) | ||
object_store.namespaced = Mock(return_value=Mock()) | ||
|
||
pipeline = Pipeline(steps, step_outbox_size=10, object_store=object_store) | ||
|
||
# Create progress reporter | ||
reporter = PipelineProgressReporter.for_testing([]) | ||
|
||
# Run pipeline | ||
await pipeline.run(reporter) | ||
|
||
# Verify all records were processed | ||
assert len(transformer.processed_records) == 3 | ||
assert len(writer.written_records) == 3 | ||
|
||
# Verify finalize_record was called for writer (final step) | ||
# Note: In multi-step pipelines, only the final step gets cleanup calls | ||
# because intermediate records are transformed, not dropped | ||
assert len(writer.finalized_tokens) == 3 | ||
|
||
# Writer resources should be cleaned up | ||
assert len(writer.allocated_resources) == 0 | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_cleanup_flow_with_filtering(): | ||
"""Test cleanup flow when some records are filtered out.""" | ||
|
||
class FilteringTransformer(Transformer): | ||
tracks_lineage: bool = True | ||
|
||
def __init__(self): | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
|
||
async def process_record(self, record, context): | ||
# Allocate resource for processing | ||
token = f"filter_resource_{id(record)}" | ||
self.allocated_resources[token] = f"resource_for_{record}" | ||
|
||
# Only pass through records containing "keep" | ||
if "keep" in str(record): | ||
yield (f"filtered_{record}", token) | ||
# If we don't yield, the record will be dropped and finalized | ||
|
||
async def finalize_record(self, record_token): | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
# Create steps | ||
extractor = ResourceTrackingExtractor(["keep1", "drop1", "keep2", "drop2"]) | ||
filter_transformer = FilteringTransformer() | ||
writer = ResourceTrackingWriter() | ||
|
||
steps = (extractor, filter_transformer, writer) | ||
object_store = Mock(spec=ObjectStore) | ||
object_store.namespaced = Mock(return_value=Mock()) | ||
|
||
pipeline = Pipeline(steps, step_outbox_size=10, object_store=object_store) | ||
reporter = PipelineProgressReporter.for_testing([]) | ||
|
||
await pipeline.run(reporter) | ||
|
||
# Verify only "keep" records made it to writer | ||
assert len(writer.written_records) == 2 | ||
assert all("keep" in str(record) for record in writer.written_records) | ||
|
||
# Verify writer resources were cleaned up | ||
assert len(writer.allocated_resources) == 0 | ||
|
||
# Verify finalize_record was called for writer (final step) | ||
assert len(writer.finalized_tokens) == 2 # Only 2 kept records | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_cleanup_flow_with_record_multiplication(): | ||
"""Test cleanup flow when one record generates multiple records.""" | ||
|
||
class MultiplyingTransformer(Transformer): | ||
tracks_lineage: bool = True | ||
|
||
def __init__(self): | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
|
||
async def process_record(self, record, context): | ||
# Allocate resource for processing | ||
token = f"multiply_resource_{id(record)}" | ||
self.allocated_resources[token] = f"resource_for_{record}" | ||
|
||
# Generate multiple records from one input | ||
for i in range(3): | ||
yield (f"{record}_copy_{i}", token) | ||
|
||
async def finalize_record(self, record_token): | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
# Create steps | ||
extractor = ResourceTrackingExtractor(["item1", "item2"]) | ||
multiplier = MultiplyingTransformer() | ||
writer = ResourceTrackingWriter() | ||
|
||
steps = (extractor, multiplier, writer) | ||
object_store = Mock(spec=ObjectStore) | ||
object_store.namespaced = Mock(return_value=Mock()) | ||
|
||
pipeline = Pipeline(steps, step_outbox_size=10, object_store=object_store) | ||
reporter = PipelineProgressReporter.for_testing([]) | ||
|
||
await pipeline.run(reporter) | ||
|
||
# Verify multiplication worked | ||
assert len(writer.written_records) == 6 # 2 input * 3 copies each | ||
|
||
# Verify writer resources were cleaned up | ||
assert len(writer.allocated_resources) == 0 | ||
|
||
# Verify finalize_record calls for writer (final step) | ||
assert len(writer.finalized_tokens) == 6 # 6 output records | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_cleanup_flow_with_exception(): | ||
"""Test cleanup flow when an exception occurs during processing.""" | ||
|
||
class FailingTransformer(Transformer): | ||
tracks_lineage: bool = True | ||
|
||
def __init__(self): | ||
self.allocated_resources = {} | ||
self.finalized_tokens = [] | ||
|
||
async def process_record(self, record, context): | ||
# Allocate resource | ||
token = f"failing_resource_{id(record)}" | ||
self.allocated_resources[token] = f"resource_for_{record}" | ||
|
||
if "fail" in str(record): | ||
raise ValueError(f"Processing failed for {record}") | ||
|
||
yield (f"processed_{record}", token) | ||
|
||
async def finalize_record(self, record_token): | ||
if record_token in self.allocated_resources: | ||
del self.allocated_resources[record_token] | ||
self.finalized_tokens.append(record_token) | ||
|
||
# Create steps | ||
extractor = ResourceTrackingExtractor(["good1", "fail1", "good2"]) | ||
failing_transformer = FailingTransformer() | ||
writer = ResourceTrackingWriter() | ||
|
||
steps = (extractor, failing_transformer, writer) | ||
object_store = Mock(spec=ObjectStore) | ||
object_store.namespaced = Mock(return_value=Mock()) | ||
|
||
pipeline = Pipeline(steps, step_outbox_size=10, object_store=object_store) | ||
|
||
# Use a reporter that doesn't raise on fatal errors for this test | ||
reporter = PipelineProgressReporter( | ||
on_fatal_error_callback=lambda ex: None # Don't raise | ||
) | ||
|
||
await pipeline.run(reporter) | ||
|
||
# The pipeline should handle the exception and stop processing | ||
# Writer should have processed at least one successful record before failure | ||
assert len(writer.written_records) >= 1 # At least one successful record | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_cleanup_flow_performance(): | ||
"""Test cleanup flow performance with many records.""" | ||
# Create a large number of records to test performance | ||
large_dataset = [f"item_{i}" for i in range(100)] | ||
|
||
extractor = ResourceTrackingExtractor(large_dataset) | ||
transformer = ResourceTrackingTransformer() | ||
writer = ResourceTrackingWriter() | ||
|
||
steps = (extractor, transformer, writer) | ||
object_store = Mock(spec=ObjectStore) | ||
object_store.namespaced = Mock(return_value=Mock()) | ||
|
||
pipeline = Pipeline(steps, step_outbox_size=10, object_store=object_store) | ||
reporter = PipelineProgressReporter.for_testing([]) | ||
|
||
# Measure execution time | ||
import time | ||
|
||
start_time = time.time() | ||
|
||
await pipeline.run(reporter) | ||
|
||
end_time = time.time() | ||
execution_time = end_time - start_time | ||
|
||
# Verify all records were processed and cleaned up | ||
assert len(writer.written_records) == 100 | ||
assert len(writer.allocated_resources) == 0 | ||
|
||
# Verify cleanup calls were made for writer (final step) | ||
assert len(writer.finalized_tokens) == 100 | ||
|
||
# Performance should be reasonable (adjust threshold as needed) | ||
assert execution_time < 5.0 # Should complete within 5 seconds |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any developer benefit to encapsulating this as a new subclass of
Step
? Thenfinalize_record
could only be declared on that interface?I feel like this could avoid confusion for Step developers that don't need this finalizing behaviour since if that boolean is false then
finalize_record
is never called.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea... thought about it a bit. Right now we have class hierarchies that look like this:
Lets assume that we want to add finalization to our
MyAwesomeTransformer
by inheriting fromFinalizingStep
. We'd need to have a class hierarchy like this:This creates a... confusing class hierarchy and can lead to weird to weird MRO issues.
Then imagine we have a
ApronSpringsStep
that gets notified every time we have operate on a give record.This violates my personal rule for relatively shallow, flat hierarchies of classes. The more cases we add to this example the more it feels that its really the same case with the implementer of
Step
choosing to do something or not depending on the cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's a really good point. As I look at this, it feels like
Finalizing
is more of a Protocol than a SubClass. Would that feel any better?It's a bit of an abuse because utilizing the protocol changes the frameworks treatment of
Step
outputs so maybe it's still not a great idea. I think I'm just trying to address the bad feeling of a boolean behavior flag and a method that is unimportant to most use cases.I leave it to your judgement on how you want that interface and experience to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've come down into there isn't really a need to distinguish a protocol or subclass to avoid the bools. Not having it in this case is the same as doing nothing. For
Step
is is always a reasonable default implementation that we can rely on. This case its justpass
.