Commit 7c66375
authored
Streaming client improvements and snowflake loader features (#15)
* loaders: Add label management system for CSV-based enrichment
- Load labels from CSV files with automatic type detection
- Support hex string to binary conversion for Ethereum addresses
- Thread-safe label storage and retrieval
- Add LabelJoinConfig type for configuring joins
* streaming: Add unified stream state management for resume and dedup
- StreamStateStore interface with in-memory, null, and DB-backed
implementations
- Block range tracking with gap detection
- Reorg invalidation support
Key features:
- Resume from last processed position after crashes
- Exactly-once semantics via batch deduplication
- Gap detection and intelligent backfill
- Support for multiple networks and tables
* streaming: Add resilience features
- Exponential backoff with jitter for transient failures
- Adaptive rate limiting with automatic adjustment
- Back pressure detection and mitigation
- Error classification (transient vs permanent)
- Configurable retry policies
Features:
- Auto-detects rate limits and slows down requests
- Detects timeouts and adjusts batch sizes
- Production-tested configurations included
* *: Major base loader improvements for streaming and resilience
- Integrate state management for resume and deduplication
- Add label joining support with automatic type conversion
- Implement resilience features (retry, backpressure, rate limiting)
- Add metadata columns (_amp_batch_id) for reorg handling
- Support streaming with block ranges and reorg detection
- Separate _try_load_batch() for better error handling
* streaming: Enhance parallel execution; resumability & gap detection
- Add resume optimization that adjusts min_block based on persistent
state
- Implement gap-aware partitioning for intelligent backfill
- Add pre-flight table creation to avoid locking issues
- Improve error handling and logging for state operations
- Support label joining in parallel workers
Key features:
- Auto-detects processed ranges and skips already-loaded partitions
- Prioritizes gap filling before processing new data
- Efficient partition creation avoiding redundant work
- Visible logging for resume operations and adjustments
Resume workflow:
1. Query state store for max processed block
2. Adjust min_block to skip processed ranges
3. Detect gaps in processed data
4. Create partitions prioritizing gaps first
5. Process remaining historical data
* client: Integrate label manager into Client for enriched streaming
Add label management to Client class:
- Initialize LabelManager with configurable label directory
- Support loading labels from CSV files
- Pass label_manager to all loader instances
- Enable label joining in streaming queries via load() method
Updates:
- Client now supports label enrichment out of the box
- Loaders inherit label_manager from client
- Add pyarrow.csv dependency for label loading
* loaders: Update all impls for new base class interface
- PostgreSQL: Add reorg support with DELETE/UPDATE, metadata columns
- Redis: Add streaming metadata and batch ID support
- DeltaLake: Support new metadata columns
- Iceberg: Update for base class changes
- LMDB: Add metadata column support
All loaders now support:
- State-backed resume and deduplication
- Label joining via base class
- Resilience features (retry, backpressure)
- Reorg-aware streaming with metadata tracking
* test: Add comprehensive unit tests for streaming features
Add unit tests for all new streaming features:
- test_label_joining.py - Label enrichment with type conversion
- test_label_manager.py - CSV loading and label storage
- test_resilience.py - Retry, backoff, rate limiting
- test_resume_optimization.py - Resume position calculation
- test_stream_state.py - State store implementations
- test_streaming_helpers.py - Utility functions and batch ID generation
- test_streaming_types.py - BlockRange, ResumeWatermark types
* snowflake_loader: Major improvements with state management
- Add Snowflake-backed persistent state store (amp_stream_state table)
- Implement SnowflakeStreamStateStore with overlap detection
- Support multiple loading methods: stage, insert, pandas,
snowpipe_streaming
- Add connection pooling for parallel workers
- Implement reorg history tracking with simplified schema
- Support Parquet stage loading for better performance
State management features:
- Block-level overlap detection for different partition sizes
- MERGE-based upsert to prevent duplicate state entries
- Resume position calculation with gap detection
- Deduplication across runs
Performance improvements:
- Parallel stage loading with connection pool
- Optimized Parquet format for stage loads
- Efficient batch processing with metadata columns
* apps: Add Snowflake parallel loading applications
Add comprehensive demo applications for Snowflake loading:
1. snowflake_parallel_loader.py - Full-featured parallel loader
- Configurable block ranges, workers, and partition sizes
- Label joining with CSV files
- State management with resume capability
- Support for all Snowflake loading methods
- Reorg history tracking
- Clean formatted output with progress indicators
2. test_erc20_parallel_load.py - Simple ERC20 transfer loader
- Basic parallel loading example
- Good starting point for new users
3. test_erc20_labeled_parallel.py - Label-enriched example
- Demonstrates label joining with token metadata
- Shows how to enrich blockchain data
4. Query templates in apps/queries/
- erc20_transfers.sql - Decode ERC20 Transfer events
- README.md - Query documentation
* test: Add integration tests for loaders and streaming features
New tests:
- test_resilient_streaming.py - Resilience with real databases
- Enhanced Snowflake loader tests with state management
- Enhanced PostgreSQL tests with reorg handling
- Updated Redis, DeltaLake, Iceberg, LMDB loader tests
Integration test features:
- Real database containers (PostgreSQL, Redis, Snowflake)
- State persistence and resume testing
- Label joining with actual data
- Reorg detection and invalidation
- Parallel loading with multiple workers
- Error injection and recovery
Tests require Docker for database containers.
* infra: Add Docker and Kubernetes deployment configurations
Add containerization and orchestration support:
- General-purpose Dockerfile for amp-python
- Snowflake-specific Dockerfile with parallel loader
- GitHub Actions workflow for automated Docker publishing to ghcr.io
- Kubernetes deployment manifest for GKE with resource limits
- Comprehensive .dockerignore and .gitignore
Docker images:
- amp-python: Base image with all loaders
- amp-snowflake: Optimized for Snowflake parallel loading
- Includes snowflake_parallel_loader.py as entrypoint
- Pre-configured with Snowflake connector and dependencies
* docs: Add comprehensive documentation for new features
- All loading methods comparison (stage, insert, pandas, streaming)
- State management and resume capability
- Label joining for data enrichment
- Performance tuning and optimization
- Parallel loading configuration
- Reorg handling strategies
- Troubleshooting common issues
* data: Save performance benchmarks
* Formatting
* Linting fixes
* label manager: Remove data directory and document how to add label files
Users should now mount label CSV files at runtime using volume mounts
(Docker) or init containers with cloud storage (Kubernetes).
Changes
- Removed COPY data/ line from both Dockerfiles
- The /data directory is still created (mkdir -p /app /data) but empty
- Updated .gitignore to ignore entire data/ directory
- Removed data/** trigger from docker-publish workflow
- Added comprehensive docs/label_manager.md with:
* Docker volume mount examples
* Kubernetes init container pattern (recommended for large files)
* ConfigMap examples (for small files <1MB)
* PersistentVolume examples (for shared access)
* Performance considerations and troubleshooting
* redis loader: Fix reorg handling when using string data structure
When data_structure='string', batch IDs are stored inside JSON values
rather than as hash fields. The reorg handler now checks the data
structure and uses GET+JSON parse for strings, HGET for hashes.1 parent 68e1d60 commit 7c66375
File tree
50 files changed
+10207
-1060
lines changed- .github/workflows
- apps
- queries
- docs
- k8s
- sql
- src/amp
- config
- loaders
- implementations
- streaming
- tests
- integration
- unit
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
50 files changed
+10207
-1060
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
0 commit comments