fix: use collection_id for database queries instead of allocation_id #835

tmigone · 2025-09-11T16:07:17Z

No description provided.

Signed-off-by: Tomás Migone <[email protected]>

…835) * fix: use collection_id for database queries instead of allocation_id Signed-off-by: Tomás Migone <[email protected]> * ci: make cargo fmt happy Signed-off-by: Tomás Migone <[email protected]> --------- Signed-off-by: Tomás Migone <[email protected]>

* test: add database checker to inspect receipts, values and ravs for V1 and V2 * feat(monitor): add escrow accounts utility methods * fix(tap-agent): fix allocation type mapping based on horizon detection Fix the TODO comment about mapping allocation types - now properly detects Horizon contracts and maps allocations accordingly instead of always using Legacy type * fix(tap-agent): fix signer address encoding for receipt queries * chore(tap-agent): impprove RAV request error messages with database-table context Enhance RAV request error messages to indicate which receipt table (V1/Legacy vs V2/Horizon) was queried, helping with Horizon migration debugging * test(tracker): improve logging for testing and debugging * feat: add comprehensive logging for escrow accounts and allocation tracking Add comprehensive debug and info logging throughout the allocation and escrow account systems to improve troubleshooting of receipt processing, RAV generation, and Horizon migration issues. * test: optimize TAP configuration for development testing - Reduce timestamp buffer from 1000s to 30s for faster RAV generation - Add trigger_value_divisor configuration for fine-tuned receipt batching - Set max_receipt_value_grt to 0.001 for controlled testing - Update tap-aggregator endpoint port configuration * build: update tap_core and tap_aggregator deps * feat(tap): add V2 domain separator support across core components Add dual domain separator support for V1 (TAP) and V2 (GraphTally) protocols. This enables simultaneous processing of both legacy and Horizon receipt types with version-appropriate EIP712 domains * feat(service): implement dual domain separator architecture Update service components to support both V1 and V2 domain separators simultaneously. Includes IndexerTapContext, ServiceRouter, and middleware updates for version-aware receipt processing. * fix(tap): use correct domain separator for V2 receipt storage Critical fix for V2 receipt processing failures. Store receipts using version-appropriate domain separators to ensure correct signer recovery and escrow account lookup. Resolves "signer not found in V2 escrow accounts" errors. * feat(auth): add dual TAP receipt authorization middleware Implement version-aware receipt authorization that routes V1 and V2 receipts to appropriate TAP managers. Includes new dual_tap_receipt_authorize function for simultaneous protocol support. * feat(middleware): add V2 domain separator to sender middleware Update sender middleware to use version-appropriate domain separators for signer recovery. V1 receipts use legacy domain, V2 receipts use Horizon domain for correct escrow account lookups. * feat(tap-agent): integrate V2 domain separator throughout agent Add V2 domain separator support to sender account management, allocation handling, and receipt processing. Maintains single allocation type invariant while enabling version-appropriate domain usage. * test: update TAP validation checks for dual domain support Update receipt validation checks and test infrastructure to support dual domain separators. Ensures proper EIP712 domain usage for both V1 and V2 receipt types in testing scenarios. * test: add dual domain support to integration and service tests Update integration tests, TAP agent tests, and service router tests to support V1/V2 dual domain architecture. Ensures comprehensive testing coverage for both protocol versions. * test: fix struct name in escrow regex * test: add database handler to check for new receipts, RAV * test: add a new horizon tests * feat(config): add receipts_verifier_address_v2 configuration field - Add optional receipts_verifier_address_v2 to BlockchainConfig - Update configuration examples with V2 verifier documentation - Prepare infrastructure for dual domain separator support * feat(config): implement fail-safe dual domain separators for V2 receipts - Update indexer-service and tap-agent to use correct V2 verifier addresses - Add strict validation: V2 address required when horizon.enabled=true - Implement graceful fallback when Horizon is disabled * refactor(config): remove no-longer used configuration files * refactor(docker): remove static config files and use dynamic generation - Remove static config.toml files from indexer-service and tap-agent - Update Dockerfiles to rely on start.sh for dynamic config generation - Clean up docker-compose.dev.yml by removing commented static config mounts - Ensure consistent contract file mounting across all Docker configurations * fix: cargo fmt * fix(clippy): fix unused function and other clippy warnings * ci: add missing license * test: update error snapshot * test: fix start script to use tap verifier address * test: fix type in command that retrieves tap escrow contract address * feat(config): add new subgraph_service field to config.horizon feat(sender_account): Propagate new subgraph_service_address to receipt layers fix(horizon): use subgraph_service in query for V2 receipts * chore: improve logging * test: add new setting to services configuration * fix: update config examples * fix(test): Some fixes to funding script for V2 * fix(tests): include subgraph_service address book * build: use latest tap_aggregator version * ci: Update rust version in containers * ci: update Rust version in workflows * fix(tap-agent): derive allocation_id from V2 collection_id in AllocationId check - Accept Horizon (V2) receipts that only provide collection_id by mapping the 32‑byte value to an Address (last 20 bytes, right‑aligned). - Preserve V1 behavior (use allocation_id() when present). - Validate collection_id length and return a clear error when invalid. - Add a small debug log comparing the resolved allocation_id to the expected one. * fix(tap-agent): Use subgraph service address in last_rav query * chore(service): log better the any error while parsing receipts headers * chore: improve logging * chore: remove testing gateway from workspace * test(tap): reduce amount willing to lose to improve testing * feat(integration-tests): add centralized environment configuration - Add TestConfig struct with env-backed configuration - Add env_loader module for loading test environment variables This centralizes test configuration and makes it easier to run tests against different environments without hardcoding values. * refactor(integration-tests): migrate RAV tests to use database checker - Replace metrics-based checking with direct database state verification - Update test_tap_rav_v2 to use TestConfig and DatabaseChecker - Add detailed state tracking with pending receipt values - Add timeout-based RAV creation waiting with diagnostic output * chore(integration-tests): remove unused test function and improve logging - Remove test_direct_service_rav_v2_simplified (299 lines) - Clean up debug logging in find_allocation function - Remove commented code and update function references in main.rs * fix(test-setup): improve service synchronization and reliability - Extend Docker health check timeouts from 30s to 120s - Add HTTP endpoint verification for indexer-service and tap-agent - Override allocation ID with known test value for consistency - Add 10s wait for indexer chain synchronization - Add optional pgAdmin startup for debugging * feat(tap-agent): fail fast when Horizon is enabled but not deployed; return Result from start_agent - Enforce hard startup error when horizon.enabled = true but Horizon isn’t active in the Network Subgraph (no silent fallback to V1). - Error out if the Horizon readiness query fails (subgraph/network error) instead of proceeding. - Convert start_agent to return anyhow::Result<(ActorRef<…>, JoinHandle<…>)> and propagate errors in main. - Add contextual errors for watcher initialization and actor spawn (replace .expect(...) with .with_context(...)?). * test(tap-agent): fix missing field in config types used in tests * feat(service): fail fast when Horizon is enabled but contracts not deployed Previously, both tap-agent and indexer-service would silently fall back to legacy (V1) mode when horizon.enabled=true but Horizon contracts weren't found in the network subgraph. This could lead to silent misconfigurations where operators expected V2 functionality but got V1. Changes: - tap-agent: Replace silent fallback with anyhow::bail! on missing contracts - indexer-service: Replace silent fallback with anyhow::bail! on missing contracts - Both services now fail at startup with clear error messages explaining: - How to disable Horizon (set horizon.enabled=false) - How to fix it (deploy Horizon contracts to network) - Replace .expect() calls with .with_context() for better error handling * ci(clippy): fix unused warnings accross the code * ci: fix clippy warnings * ci: make cargo fmt happy * feat(metrics): add TAP version-specific metrics for V1/V2 protocol support - Add version-specific counters and gauges for unaggregated fees, RAV creation/failures, and response times - Introduce `*_by_version` metrics with TAP version labels (v1/v2) - Maintain backward compatibility by preserving legacy metrics for V1 only - Add helper function `get_tap_version()` using type introspection to determine TAP version - Support monitoring of dual V1/V2 protocol operations during migration * refactor: replace boolean horizon config with TapMode enum Replace horizon_enabled boolean and optional subgraph_service_address fields with a TapMode enum that encapsulates TAP version modes. - Add TapMode enum with Legacy and Horizon variants - Legacy mode: V1 TAP receipts only - Horizon mode: hybrid V1/V2 support with SubgraphService address - Update all references from horizon_enabled to tap_mode.is_horizon() - Replace direct subgraph_service_address access with pattern matching - Add helper methods: is_legacy(), supports_v2(), v1_only() - Update TapAgentContext to use Option<Address> for SubgraphService - Add comprehensive documentation with usage examples This improves type safety by ensuring SubgraphService address is only accessible when actually configured for Horizon mode. * test(tap-agent): add missing subgraph_service_address to Horizon test contexts The Horizon (V2) TAP receipt and RAV database queries require a subgraph_service_address to filter records by the data_service field. Test contexts were not providing this address, causing Horizon tests to fail with "SubgraphService address not available" errors. Changes: - Set subgraph_service_address in all Horizon TapAgentContext builders for tests - Update Horizon test receipt/RAV creation to use consistent data_service addresses - Ensure V2 receipts use TAP_SENDER.1 as data_service to match query expectations * fix(horizon-detection): activate on schema availability - Treat Horizon as active when the Network Subgraph schema is available, not only when PaymentsEscrow accounts exist. - Prevent false negatives at startup before any accounts are created. - Adjust monitor logs to reflect schema availability and account count. - Clarify TAP Agent logs and note V2 watcher auto-detects new accounts. * fix: closed allocations check to use .address() for proper address extraction Ensures both Legacy and Horizon allocation types are handled correctly when checking for allocations. * ci: fix clippy warnings * fix(tap-agent): resolve V2 RAV tracking silent failures This commit fixes critical bugs in Horizon (V2) allocation handling where V2 RAVs were silently dropped due to format conversion issues: - Normalize database queries to use AllocationId enum wrapper instead of raw strings, properly distinguishing Legacy vs Horizon variants - Use consistent .address() method for 20-byte address normalization in subgraph queries and internal tracking - Fix HashMap key format mismatch between local RAVs and subgraph responses - Add proper collection_id to address conversion for redemption comparison * ci: make cargo fmt happy * fix(tap-agent): preserve allocation variant in ActorTerminated handler The ActorTerminated handler was hardcoding AllocationId::Legacy variant instead of looking up the correct variant from state. This caused Horizon allocations to be incorrectly processed as Legacy when their actors terminated, leading to inconsistent variant handling compared to the ActorFailed handler. Now both supervision event handlers properly preserve the original allocation variant by looking it up from state.allocation_ids. * fix(tap-agent): handle missing allocation in ActorTerminated with sender type fallback The ActorTerminated handler now gracefully handles cases where an allocation cannot be found in state.allocation_ids by falling back to the sender's type. The issue occurs when: - Allocation is removed from state.allocation_ids (due to closure) - Actor termination event fires shortly after - Handler cannot determine the correct AllocationId variant Since sender accounts are type-specific (Legacy or Horizon), we can safely infer the allocation variant from state.sender_type when the allocation is not found in state. This handles both legitimate race conditions in production and test scenarios with manually spawned allocations. The fallback logs a warning to help identify when this condition occurs, while ensuring the retry mechanism continues to function correctly. * ci: make cargo fmt happy * fix(tap-agent): prevent post-RAV underflows in fee tracker; add tests - Use saturating_sub in SenderFeeStats::ravable_count to avoid u64 underflow when buffer entries exceed the current counter after a successful RAV reset. - Use saturating_sub in GenericTracker::get_ravable_total_fee to avoid underflow when requesting or buffered fees exceed total; add diagnostic warning log. - Add tests to cover both edge cases: - test_ravable_count_no_underflow_after_rav_success - test_get_ravable_total_fee_no_underflow This fixes the “attempt to subtract with overflow” crash observed after RAV success and prevents actor restarts caused by counter/buffer mismatch. and add TODO comments to track this changes for further discussion/investigation regarding the validity of such edge cases * test: fix path to pg_admin script * build: Update tap dependencies tap_aggregator tap_graph tap_core * fix: use collection_id for database queries instead of allocation_id (#835) * fix: use collection_id for database queries instead of allocation_id Signed-off-by: Tomás Migone <[email protected]> * ci: make cargo fmt happy Signed-off-by: Tomás Migone <[email protected]> --------- Signed-off-by: Tomás Migone <[email protected]> * feat(contrib): Add indexer-cli container for allocation management Adds indexer-cli Docker container built from graphprotocol/indexer horizon branch to enable allocation management via indexer-agent API for integration testing. - Builds from horizon branch with ESM compatibility fix - Auto-connects to indexer-agent on startup - Integrated into docker-compose and setup-test-network.sh - Enables allocation operations: list, close with POI, etc. * refactor(config): group Horizon V2 addrs under [blockchain] * chore: add an interruptable wait time helper function * test: use the right local-network branch * fix: remove typo in setup script * test: fix missing subgraph_service_address field * ci: fix cargo fmt warnings * fix: use canonical encode_hex implementation for allocationId * refactor: improve structured logging in sender_accounts_manager - Replace verbose debug logs with structured fields - Use consistent sender_type_str formatting (V1/V2) - Simplify error messages while preserving key information - Remove excessive escrow account mapping dumps - Consolidate allocation address handling logic*** * test: centralize test config defaults using constants - Replace hardcoded values with imports from constants module - Use SUBGRAPH_ID constant to match local-network/.env SUBGRAPH value - Consolidate import statements for better organization - Remove duplicate env var lookup in test_subgraph_deployment - Maintain env variable override functionality while standardizing defaults * test: add IndexerCli wrapper for allocation management Add minimal Docker-based wrapper around indexer-cli service to enable programmatic allocation operations in integration tests. Features: - List active allocations via graph indexer allocations get - Close allocations with zero POI for testing - Configurable container name via environment variables - Ethereum address parsing with deduplication Note: Integration with test suite pending further investigation. * test: migrate integration tests from metrics to database checker - Replace MetricsChecker with DatabaseChecker for more reliable state verification - Add comprehensive V1 TAP test implementation with database validation - Enable early exit on successful RAV creation in V2 tests - Add detailed progress tracking with trigger threshold calculations - Implement delayed RAV creation detection with timeout handling - Remove commented-out legacy metrics code and unused imports - Disable invalid_chain_id test temporarily - Consolidate import statements across test modules The database checker provides direct access to TAP state rather than relying on metrics parsing, which are less reliable. * chore: remove TODOs * feat: add signer authorization to escrow funding script - Add V1 TAP escrow signer authorization with proof generation - Add V2 PaymentsEscrow signer authorization - Use separate gateway signer accounts (ACCOUNT1) for authorization - Add verification checks for both V1 and V2 signer authorization - Improve logging with clearer role descriptions and error handling * test: start pg_admin visor for testing * test: use edge-node local network and some minor improvements in services detection * refactor(logging): switch to structured tracing; fix typos and minor renames • Replace string-interpolated logs with structured fields (tracing) for machine-parsable, consistent logs and lower formatting overhead. • main, metrics, routes, service: convert logs to structured fields (error, deployment, addr, config_path, etc.); fix typo “occoured” -> “occurred”. • tap-agent: • sender_account, sender_accounts_manager, sender_allocation, metrics, tracker/, tap/: widespread conversion to structured logs; clarify messages and add useful fields (allocation_id, sender, fees, counts, channel, code, etc.). • Derive Debug for DenyListVersion to improve log output. • No functional changes intended; only logging structure, readability, and typo fixes. * test: fix some typos test: remove commented code; fix typos build: move deps to workspace * ci: make cargo fmt happy * test: pass GATEWAY_COMMIT when building gateway container test: use our custom local-network to use subgraph-deploy patch which consist on simple loop to wait for TAP subgraph to be ready test: fix gateway path * fix: v1 processing fix allocation type * feat(query): add isLegacy to network schema for allocations * fix(tap-agent): stop global allocation typing; normalize per-sender with isLegacy + sender_type - Remove global allocation typing in Manager; keep a single raw watcher - Add per-sender normalization using Network Subgraph Allocation.isLegacy + escrow-derived sender_type - Pass typed Receiver<HashSet<AllocationId>> to SenderAccount; remove in-actor retyping * feat(allocation): update allocation type to align with the updated shema * test: update test allocations * test: DatabaseChecker normalize API; fix V1/V2 DB logic - Make DatabaseChecker hold TestConfig - improve receipts and rav checks using tap-agent config - V1 queries: fix table/column usage to scalar_tap_receipts/scalar_tap_ravs and signer_address - has_rav_for_identifier (V1) - get_pending_receipt_value (V1) - V2 pending value: compute receipts newer than last RAV and older than buffer cutoff (remove rav IS NULL join) * ci: make clippy happy * test: add additional script to deploy services This script leverages local-network setup by adding some extra overrides to compile from source * fix(tap-agent): add data_service field to fix Horizon RAV mark_rav_last - Add data_service: Option<Address> field to SenderAllocationState struct - Fix mark_rav_last query to include data_service in WHERE clause - Use expect() with descriptive message for Horizon mode access - remove redundant conversion from CollectionId to CollectionId * build: update offline query for offline compilation * ci: use rust nighly toolchain * feat(service): enforce V2 data_service via DataServiceCheck - Add tap/checks/data_service_check.rs to validate V2 receipt.message.data_service against an allowlist(from configuration) - Return CheckError::Failed with clear message; maps to HTTP 400 - Extend IndexerTapContext::get_checks(..) to accept allowed_data_services and append the check when provided - Wire ServiceRouter to pass blockchain.subgraph_service_address (Horizon mode), activating the check only when configured - V1 receipts unaffected; persistence logic unchanged * fix(tap-agent): enforce Horizon bucket isolation for RAV/receipt queries - Use full Horizon identity (payer, data_service, service_provider, collection_id) in reads to prevent cross-bucket reads. - sender_account.rs: last non-final RAVs lookup now filters by service_provider and data_service. - sender_accounts_manager.rs: non-final RAV aggregation constrained to the configured bucket via WHERE data_service = $1 AND service_provider = $2. - sender_allocation.rs (Horizon): - calculate_invalid_receipts_fee() now filters by collection_id, payer, service_provider, and data_service. - calculate_fee_until_last_id() now filters by collection_id, payer, service_provider, and data_service (and updated SQLx arg indices). - Rationale: Aligns with tap_horizon_ravs PK, matches Horizon’s identity model, avoids mixing state across different data services/providers, and complements the service-side DataServiceCheck which only guards ingestion. * fix(tap-agent): include data_service in queries - remove old queries - add new offline database query artifacts * ci: fix formatting issues * ci: add missing sqlx artifacts for offline compilation * ci: fix cargo fmt --------- Signed-off-by: Tomás Migone <[email protected]> Co-authored-by: Tomás Migone <[email protected]>

tmigone added 2 commits September 11, 2025 13:06

fix: use collection_id for database queries instead of allocation_id

d577471

Signed-off-by: Tomás Migone <[email protected]>

ci: make cargo fmt happy

c0c0109

Signed-off-by: Tomás Migone <[email protected]>

tmigone requested a review from neithanmo September 11, 2025 17:49

neithanmo merged commit 61b3c27 into feat/horizon Sep 11, 2025
8 checks passed

neithanmo deleted the tmigone/fix-collection-id branch September 11, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use collection_id for database queries instead of allocation_id #835

fix: use collection_id for database queries instead of allocation_id #835

Uh oh!

tmigone commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: use collection_id for database queries instead of allocation_id #835

fix: use collection_id for database queries instead of allocation_id #835

Uh oh!

Conversation

tmigone commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants