Merged
Conversation
This implements Plan 2 for WebSocket IP rotation: - One IP handles ALL streams at a time (active IP) - Other IPs stay "hot" as standby with fresh sessions - Automatic rotation before connections age out (default 20 min) - Zero-downtime switching with overlap during transition - Hot standby sessions refreshed every 15 min to stay fresh - Automatic failover to standby on unexpected active IP failure Key components: - IPState enum: ACTIVE, SPINNING_UP, STANDBY, TEARING_DOWN - _rotation_controller: Background task monitoring connection ages - _refresh_standby_sessions: Keep standby IPs fresh - _spinup_ip / _complete_rotation / _teardown_ip: Rotation lifecycle - _handle_unexpected_failure: Automatic failover Configuration (exchange config): - websocket_ip_pool: List of IPs to use - websocket_active_max_age: Max seconds before rotating (default 1200) - websocket_standby_refresh: Refresh standby sessions interval (default 900) - websocket_spinup_lead_time: Start spinup this early (default 120) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add HOT_BACKUP state for IPs with pre-loaded data ready for instant failover - Replace rotation_controller with danger_zone_controller - Spin up backup IPs before candle close (:45 and :53) so they have cached data - Cascade through all IPs with data in ohlcvs() before falling back to REST - Determine survivor at :02 based on which IP has fresh data - New config: ws_danger_zone_start, ws_post_danger_zone, ws_spinup_schedule, ws_freshness_threshold This prevents REST fallback at hourly candle boundaries when Hyperliquid kills connections, as backup IPs already have ~15 min of cached data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ache The previous implementation called watch_ohlcv() once per pair during spinup, which only initiates the connection but doesn't populate the ohlcvs cache. Data only comes from continuous watch loops. Changes: - Add _backup_tasks dict to track background tasks per backup IP - Replace one-time watch_ohlcv calls with _continuously_watch_backup loops - Each backup IP now runs parallel watch tasks that populate its ohlcvs cache - Cancel backup tasks on teardown to clean up properly This ensures backup IPs actually have cached data ready for cascade failover. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds comprehensive logging to understand what happens at :00 candle boundaries: - [BACKUP-CANDLE]: Log candle data received on backup IPs near boundaries - [OHLCV-CHECK]: Log ALL IPs' data state when checking near :58-:02 - [WS-EVAL]: Log exact conditions for WS success/failure in exchange.py - [WS-FALLBACK]: Log why REST fallback triggered with condition details - [WS-CONN-ERROR]: Enhanced error logging with timestamp and state context - [CANDLE-CLOSE]: Log state at exact :00:00-:00:10 for sample pairs - [METRICS]: Periodic summary of subscriptions/candles per IP (every 5 min) This helps answer: Do backup IPs receive data? What fails at :00? Which condition triggers REST fallback? Are we hitting rate limits? Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Based on overnight log analysis showing candles arrive within 0-9s (avg 5.1s): - Retry 1: Check immediately (normal flow) - Retry 2: Wait 3s, check again (at ~:04, catches ~60% of pairs) - Retry 3: Wait 3s more, check again (at ~:07, catches ~90%) - Retry 4: Wait 4s more, final check (at ~:11, catches 100%) - Total max wait: 10 seconds before REST fallback After all retries, returns best available data (even if stale) as fallback instead of returning empty, which triggers REST unnecessarily. Key improvements: - Cascades through ALL IPs (active → hot_backup → spinning_up) on each retry - Only retries near candle boundaries (:58-:02) - Logs retry attempts and successes for analysis - Returns stale data as last resort instead of empty This solves the issue where backup IPs receive fresh data 2-9s after the initial check, but we were giving up too early and falling back to REST. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical bug fix for unexpected IP failover: Problem: - When active IP fails, all 150 watch tasks exit - Each task removed itself from _klines_watching via discard() - Result: _klines_watching becomes empty - Next danger zone spinup fails: "No pairs in _klines_watching" - System vulnerable for 30+ minutes until strategy re-adds pairs Example from logs (03:29 failover): - 03:29:22 - IP .36 dies, _klines_watching cleared - 03:45:08 - Spinup FAILS (no pairs) - 04:00:00 - Only 1 IP running, no backups available - 04:00:01 - Strategy finally re-adds pairs (31 min later) Solution: 1. Don't remove pairs from _klines_watching when tasks exit - _klines_watching = desired state (what we want to watch) - _klines_scheduled = actual running tasks - Keep desired state intact during failover 2. Explicitly reschedule pairs after rotation - Call _schedule_while_true() immediately after promoting new active IP - Ensures all pairs are resubscribed within seconds, not minutes - Backups can spin up for next danger zone Result: Seamless failover with continuous backup coverage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical bug: _determine_survivor() only sampled first 3 pairs to check IP health, causing it to miss IPs with fresh data. Problem from logs (11:02 exit): - IP .42 had fresh data for all 75 pairs - But survivor check only tested first 3 pairs - Those 3 happened to be stale at 11:02:04 - Result: "IP .42 has no fresh data" → kept broken IP .36 Solution: - Sample 20 pairs (or all if <20) instead of just 3 - Track which IP has MOST fresh pairs - Promote the IP with best fresh data percentage - Log fresh_count/total and percentage for visibility This ensures we actually promote IPs with fresh data instead of keeping broken IPs due to unlucky sampling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace Rolling Active + Hot Standby model with true pair distribution: - Each IP handles ~15 pairs (75 pairs / 5 IPs) - All IPs active simultaneously - Consistent hash assignment (same pair -> same IP) - Automatic reassignment on IP failure - Extensive logging for debugging: - Per-IP health status every minute - Pair assignments logged - Candle boundary status near :00 - Data freshness tracking per IP Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Periodic connection refresh at :20 and :50 to ensure fresh connections before candle close at :00 (Hyperliquid kills connections after ~3h) - IP recovery mechanism: require 3 consecutive failures before marking FAILED, allow recovery after 5 minute cooldown - Retry logic before REST fallback: up to 5 retries with increasing delays (1-5s) at candle boundaries (:58-:02), giving WS data time to arrive instead of immediately falling back to REST Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace sync _try_build_from_websocket with async _async_fetch_ohlcv_with_ws_retry. All pairs now retry WS simultaneously with non-blocking async delays instead of sequential blocking. Uses 10 retries with delays [1,1,2,2,3,3,4,4,5,5] (30s max) before falling back to REST. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace hash-based IP assignment with least-loaded distribution for even pair distribution across IPs - Remove cleanup_expired() call - pairs managed by periodic refresh at :20/:50 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
After the :20/:50 periodic refresh clears connections, immediately trigger re-scheduling of all pairs that were being watched. This ensures WS subscriptions are re-established before the :00 candle boundary rather than waiting until data is requested. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes: - Fix recovery not finding pairs to reschedule (look for unscheduled pairs in _desired_subscriptions instead of deleted assignments) - Reset backoff delay on periodic refresh and IP recovery - Use await instead of run_coroutine_threadsafe in refresh (already async) - Parallelize rate limit checks with asyncio.gather() Improvements: - Add per-IP rate limit monitoring every 5 min (Hyperliquid only) - Remove :50 refresh, keep only :20 (40 min buffer before :00 boundary) - Add exchange guard for rate limit API (skip for non-Hyperliquid) - Add doubled backoff for RateLimitExceeded in async retrier - Add TTFM (time-to-first-message) logging for connection diagnostics The :50 refresh was too close to :00 candle boundary - connections take 25-60s to receive first message, leaving insufficient buffer. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log wallet address (masked) at startup if found - Warn if wallet address missing (rate limit monitoring disabled) - Change rate limit skip log from DEBUG to WARNING for visibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shows whether wallet was found from: 1. exchange_config (merged config) 2. ccxt_object.walletAddress (fallback) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: - Fix stale_streams metric to account for timeframe (4h candles have higher threshold) - Fix avg_data_age to only use 1h candles (4h was skewing average) - Add sliding window weight tracking per IP (1200 weight/min budget) - Add REST API weight tracking for fetch_ohlcv calls - Add periodic logging of weight consumption ([REST-WEIGHT], [IP-WEIGHT]) - Add warnings when approaching 70% of rate limit budget New log patterns: - [REST-WEIGHT] REST_PROXY=X/1200(Y%) - REST API consumption - [IP-WEIGHT] ip=X/1200(Y%) | ... - Per WebSocket IP consumption - [REST-WEIGHT-HIGH] / [IP-WEIGHT-HIGH] - Warnings at >70% Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
walletAddress is a public blockchain address, not a secret. It was being removed in dry_run mode by remove_exchange_credentials(), which broke rate limit monitoring. Also adds debug logging for wallet address discovery. Co-Authored-By: Claude <noreply@anthropic.com>
This reverts commit 9d8d485.
When REST fallback is triggered, route requests through the same IP assigned to that pair for WebSocket. This prevents overloading a single global proxy with all REST fallback requests. - Add get_ip_for_pair(), get_exchange_for_pair(), assign_pair_to_ip() to ExchangeWS - Pre-assign pairs to IPs before startup candle fetch - Track REST weight against actual IP used instead of "REST_PROXY" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cannot reuse WebSocket exchange instances for REST calls - they run in different event loops which causes "Future attached to different loop" errors. REST continues using self._api_async, but we track weight against the assigned IP for monitoring purposes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This reverts commit ac91639.
REST and WebSocket were sharing the same CCXT exchange instances, causing "Future attached to different loop" errors because WebSocket runs in its own thread with a separate event loop. Solution: Create a separate pool of CCXT exchanges for REST calls. - _rest_exchanges dict stores REST-specific instances per IP - get_rest_exchange_for_pair() creates instances lazily in main thread - Both pools use same local_addr for IP routing, different instances Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…toring Problem: Wallet address was not being detected for rate limit API queries because freqtradebot.py makes a deepcopy of config["exchange"] to preserve credentials, but Exchange.__init__ calls remove_exchange_credentials() in dry_run mode which strips walletAddress BEFORE ExchangeWS is created. Solution: - Preserve wallet address at start of Exchange.__init__ BEFORE credential stripping - Add wallet_address parameter to ExchangeWS.__init__ - Pass preserved wallet address explicitly to ExchangeWS - Clean up debug logging in exchange_ws.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cleanup before PR: - Remove dead code: _log_candle_close_state (never called) - Remove dead code: _is_data_fresh (never called) - Reduce logging: change candle boundary logs from INFO to DEBUG - Reduce logging: simplify OHLCV read logging (was 150+ lines/boundary) - Extract _count_pairs_per_ip helper to reduce code duplication Net reduction: 42 lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move periodic monitoring logs to DEBUG level (health, distribution, metrics) - Simplify _stats_monitor: remove 60s health logs, keep hourly stats with :20 refresh - Move WebSocket event logs to DEBUG (SCHEDULE, CONNECT, TTFM, UPDATE, TASK-DONE) - Move rate limit success logs to DEBUG, keep warnings at WARNING (>80%) - Keep essential INFO logs: startup, hourly refresh, IP stats, failures, recoveries Result: INFO logs reduced from 36 to 10 (-72%), DEBUG increased from 13 to 40 Total logger calls: 66 (down from 69). Hourly [IP-STATS] provides visibility without noise. Use -vv for detailed diagnostics. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WebSocket IP pool distribution with per-IP rate limit monitoring
Summary
Distributes WebSocket connections across multiple IPs to avoid Hyperliquid rate limits, with automatic failover and recovery mechanisms.
Key Features
IP Pool Distribution
Failover & Recovery
Rate Limit Monitoring
Periodic Refresh