enarjord
diff --git a/‎CLAUDE.md‎
Lines changed: 43 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 43 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎changelog.md‎
Lines changed: 40 additions & 0 deletions b/‎changelog.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎docs/ai/debugging_case_studies.md‎
Lines changed: 108 additions & 0 deletions b/‎docs/ai/debugging_case_studies.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎docs/ai/exchange_api_quirks.md‎
Lines changed: 65 additions & 0 deletions b/‎docs/ai/exchange_api_quirks.md‎
Lines changed: 65 additions & 0 deletions
@@ -2,7 +2,11 @@
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
-**IMPORTANT:** Also read and follow `docs/ai/passivbot_agent_principles.yaml` for detailed conventions on terminology, error handling, testing, and design principles.
+**IMPORTANT:** Also read and follow the documentation in `docs/ai/`:
+- `passivbot_agent_principles.yaml` - Conventions on terminology, error handling, testing, design principles
+- `exchange_api_quirks.md` - Known exchange API limitations and workarounds (check before implementing exchange code)
+- `debugging_case_studies.md` - Detailed debugging sessions as reference for complex investigations
+- `log_analysis_prompt.md` - Logging level definitions and examples
 
 ## Overview
 
@@ -227,9 +231,22 @@ python3 -m jupyter lab
 
 ### Logging
 
-- Use structured, leveled logging (`error`, `warn`, `info`, `debug`, `trace`)
-- Every remote API call must have a debug-level log entry (endpoint, params, timing)
-- Logging should be non-intrusive but detailed enough for full replay/audit
+Use structured, leveled logging with clear separation between levels:
+
+| Level | Audience | Content | Golden Rule |
+|-------|----------|---------|-------------|
+| **INFO** | Operators | Essential events: orders, fills, positions, balance, mode changes, health summaries | Must be sustainable to tail indefinitely in production |
+| **DEBUG** | Developers | Internal state: API timing, cache updates, decision points, fetch summaries | Tolerable for short debugging sessions |
+| **TRACE** | Deep debugging | Full firehose: API payloads, per-item iterations, raw data | Expect GB of logs; enable briefly for specific issues |
+
+**Guidelines:**
+- INFO should answer "what is the bot doing?" without overwhelming
+- DEBUG should answer "why did it make that decision?"
+- TRACE should answer "what exact data did it see?"
+- Use `[tag]` format consistently: `[order]`, `[pos]`, `[fill]`, `[health]`, `[boot]`
+- Prefer `logging.info("msg %s", var)` over f-strings for log aggregation compatibility
+- Every remote API call should have a DEBUG-level log entry (endpoint, timing)
+- See `docs/ai/log_analysis_prompt.md` for detailed level definitions and examples
 
 ## Testing
 
@@ -274,6 +291,28 @@ Configuration sections form an inheritance hierarchy. When adding new parameters
 - When adding dependencies, explain necessity and impact
 - Before committing, simulate/dry-run changes
 
+## Documentation Structure
+
+This file (`CLAUDE.md`) serves as the entry point for AI agents. Detailed topic-specific documentation lives in `docs/ai/`:
+
+| File | Purpose |
+|------|---------|
+| `passivbot_agent_principles.yaml` | Core conventions: terminology, error handling, testing |
+| `exchange_api_quirks.md` | Exchange-specific API limitations and workarounds |
+| `debugging_case_studies.md` | Detailed debugging sessions as learning references |
+| `log_analysis_prompt.md` | Logging level definitions and analysis guidance |
+
+**When to add new docs:**
+- **Exchange quirks** → Add to `exchange_api_quirks.md` (or create `{exchange}_quirks.md` if extensive)
+- **Complex debugging** → Add case study to `debugging_case_studies.md`
+- **New subsystem** → Consider a dedicated `{subsystem}.md` if >50 lines of guidance
+
+**Modularization guidelines:**
+- Keep CLAUDE.md as a high-level overview (<300 lines ideal)
+- Move detailed reference material to `docs/ai/` subdirectory
+- Use consistent naming: `{topic}.md` or `{topic}_{subtopic}.md`
+- Always reference new docs from CLAUDE.md's header list
+
 ## Changelog
 
 Maintain `CHANGELOG.md` as single source of truth for user-facing changes. Add entries under "Unreleased" as changes land; move to dated version heading when tagging releases.
@@ -4,7 +4,7 @@
 
 :warning: **Used at one's own risk** :warning:
 
-v7.6.2
+v7.7.0
 
 
 ## Overview
 
@@ -2,6 +2,46 @@
 
 All notable user-facing changes will be documented in this file.
 
+## v7.7.0 - Unreleased
+
+### Fixed
+- **Bybit: Missing PnL on some close fills** - Fixed pagination bug in `BybitFetcher._fetch_positions_history()` that caused closed-pnl records to be skipped when >100 records existed in a time window. Now uses hybrid pagination: cursor-based for recent records (no gaps), time-based sliding window for older records.
+
+### Added
+- **Fill events now include psize/pprice** - Each fill event is annotated with position size (`psize`) and VWAP entry price (`pprice`) after the fill. Values are computed using a two-phase algorithm and persisted to cache for all exchanges.
+- **Logging best practices documentation** - New `docs/ai/log_analysis_prompt.md` with comprehensive logging guidelines, level definitions, and improvement tracking.
+- **Exchange API quirks documentation** - New `docs/ai/exchange_api_quirks.md` documenting known exchange-specific limitations and workarounds.
+- **Debugging case studies** - New `docs/ai/debugging_case_studies.md` with detailed debugging sessions as reference.
+
+### Changed
+- **Logging improvements (7 rounds of refinement)**:
+  - Standardized log tags: `[memory]`, `[warmup]`, `[hourly]`, `[fills]`, `[mapping]`, `[candle]`, `[ranking]`, `[mode]`
+  - Moved routine API/cache messages from INFO to DEBUG level (CCXT fetch details, cache updates)
+  - Moved CCXT API payloads from DEBUG to TRACE level
+  - EMA ranking logs now throttled to every 5 minutes (was every cycle)
+  - Mode changes throttled to 2 minutes per symbol (reduces forager oscillation noise)
+  - KucoinFetcher PnL discrepancy warnings throttled to 1 hour with delta-based deduplication
+  - WebSocket reconnection now logs explicit `[ws] reconnecting...` messages
+  - Strict mode gaps changed from WARNING to DEBUG (expected for illiquid markets)
+  - Persistent gaps changed from WARNING to INFO with throttling
+  - Zero-candle synthesis warnings aggregated and rate-limited
+- **PnL tracking now uses FillEventsManager exclusively** - Legacy `update_pnls` path removed. FillEventsManager provides more accurate fill tracking with proper event deduplication, canonical schemas, and exchange-specific fetchers for all supported exchanges.
+- Fill events are now stored in `caches/fill_events/{exchange}/{user}/` instead of the old `caches/{exchange}/{user}_pnls.json` format. Existing legacy cache files are ignored; FillEventsManager will rebuild from exchange API on first run.
+- Unstuck allowances now computed from FillEventsManager data instead of legacy pnls list.
+- Trailing position change timestamps now derived from FillEventsManager events.
+
+### Removed
+- `--shadow-mode` CLI flag (no longer needed; FillEventsManager is production-ready)
+- `live.pnls_manager_shadow_mode` config option
+- Legacy `init_pnls`, `update_pnls`, `fetch_pnls` methods in passivbot.py
+- Legacy `init_fill_events`, `update_fill_events`, `fetch_fill_events` methods (dead code)
+- Shadow mode comparison logging (`_compare_pnls_shadow`, etc.)
+
+### Migration Notes
+- **No action required** - FillEventsManager automatically fetches and caches fill data
+- Old `{user}_pnls.json` cache files can be safely deleted after upgrading
+- If using custom exchange configurations, ensure the exchange's fill fetcher is supported (Binance, Bybit, Bitget, GateIO, Hyperliquid, KuCoin, OKX)
+
 ## v7.6.2 - 2026-01-20
 
 ### Fixed
 
@@ -0,0 +1,108 @@
+# Debugging Case Studies
+
+This document captures debugging sessions for complex issues, serving as a reference for future investigations.
+
+## Case Study: Missing Bybit PnL Data (2026-01-25)
+
+### Initial Symptom
+User reported that some Bybit close fills had `pnl: 0.0` while others had correct PnL values.
+
+### Investigation Process
+
+#### Step 1: Identify the Pattern
+```python
+# Load cached fill events and filter to closes
+closes = [x for x in events if x['side'] == 'sell' and x['position_side'] == 'long']
+zero_pnl = [c for c in closes if abs(c['pnl']) < 0.0001]
+```
+
+Found: 1 XMR fill from Dec 30 had zero PnL, while Dec 28 and Jan 12+ fills had PnL.
+
+#### Step 2: Check Raw Data
+Examined the `raw` field in the cached fill event:
+```python
+# Fill with zero PnL only had fetch_my_trades data, no positions_history
+event['raw']
+# [{'source': 'fetch_my_trades', 'data': {...}}]
+# Missing: {'source': 'positions_history', 'data': {...}}
+```
+
+#### Step 3: Verify Data Exists on Exchange
+Created a test script to query Bybit directly:
+```python
+# scripts/check_missing_pnl.py
+params = {'category': 'linear', 'symbol': 'XMRUSDT', 'limit': 100, 'endTime': end_ms}
+result = await api.private_get_v5_position_closed_pnl(params)
+# Found the record! It exists on Bybit.
+```
+
+**Key finding:** The closed-pnl record existed on Bybit but wasn't being fetched.
+
+#### Step 4: Trace the Fetch Logic
+Added debug output to understand pagination:
+```python
+# Fetch #10: Dec 29 10:45 → Jan 03 14:23 (100 records)
+# Missing Dec 30 05:28 record should be in this window!
+```
+
+**Key finding:** 128 records existed in the time window, but only 100 were fetched.
+
+#### Step 5: Identify Root Cause
+The time-based pagination was skipping records:
+1. Fetch returns 100 records, oldest at timestamp T
+2. Next fetch uses `endTime = T`
+3. Records between T and the previous batch's oldest are skipped
+
+#### Step 6: Test Cursor Pagination
+```python
+# Use cursor instead of time-based
+cursor = response.get('result', {}).get('nextPageCursor')
+params['cursor'] = cursor  # Continue with cursor
+```
+
+**Key finding:** Cursor pagination only covers ~7 days, then cursor becomes empty.
+
+#### Step 7: Implement Hybrid Solution
+Combined both approaches:
+1. Use cursor pagination for recent data (no gaps)
+2. Fall back to time-based sliding window for older data
+3. Deduplicate by orderId
+
+### Verification
+```
+Before fix: 387 records fetched (cursor-only), missing Dec 30 record
+After fix: 1434 records fetched (hybrid), all records including Dec 30
+```
+
+### Key Lessons
+
+1. **Don't trust CCXT wrappers blindly** - they may not expose all pagination mechanisms
+2. **Check raw API responses** - create test scripts to query directly
+3. **Understand pagination limits** - each exchange has different behaviors
+4. **Compare counts** - if you expect N records but get fewer, investigate
+5. **Check multiple endpoints** - the data might exist via different API calls
+
+### Debug Scripts Created
+- `scripts/check_missing_pnl.py` - Query specific orderId directly
+- `scripts/debug_positions_history.py` - Trace pagination behavior
+- `scripts/verify_pagination_fix.py` - Verify fix works
+
+These can be adapted for similar issues on other exchanges.
+
+---
+
+## Template for New Case Studies
+
+### Initial Symptom
+(What the user reported or what was observed)
+
+### Investigation Process
+1. Identify the pattern
+2. Check raw data
+3. Verify data exists at source
+4. Trace the code path
+5. Identify root cause
+6. Implement and verify fix
+
+### Key Lessons
+(What was learned that applies to future debugging)
@@ -0,0 +1,65 @@
+# Exchange API Quirks
+
+This document catalogs known exchange API quirks, limitations, and workarounds discovered during development. When implementing exchange-specific code, check here first.
+
+## Bybit
+
+### Closed-PnL Pagination (Critical)
+
+**Discovered:** 2026-01-25
+
+**Problem:** Bybit's `/v5/position/closed-pnl` endpoint has two pagination mechanisms that behave differently:
+
+1. **Cursor pagination** (`nextPageCursor`): Only covers ~7 days of recent data, then cursor becomes empty
+2. **Time-based pagination** (`endTime`): Can reach older data but may skip records when there are >100 records in a time window
+
+**Symptoms:**
+- Close fills missing PnL (showing `pnl: 0.0`)
+- Inconsistent historical data - some old records present, others missing
+- CCXT's `fetch_positions_history` wrapper doesn't expose cursor, making it unreliable
+
+**Root Cause:**
+When using time-based pagination alone:
+- If a time window has >100 records, only 100 are returned
+- Setting `endTime = oldest_timestamp_in_batch` for next request skips records between batches
+- Example: 128 records exist in window, only 100 fetched, 28 missed
+
+**Solution:** Hybrid pagination in `BybitFetcher._fetch_positions_history`:
+1. Phase 1: Use cursor pagination for recent records (efficient, no gaps)
+2. Phase 2: When cursor exhausts (~7 days back), switch to time-based sliding window
+3. Deduplicate by orderId to handle overlap
+
+**Code Reference:** `src/fill_events_manager.py` - `BybitFetcher._fetch_positions_history()`
+
+**Testing:** Verified fetching 1434 records vs 387 (cursor-only) or 1200 (time-only with gaps)
+
+### Closed-PnL Record Timing
+
+Each close fill on Bybit immediately generates a closed-pnl record with `avgEntryPrice`. This means:
+- PnL can be computed per-fill, not just when position fully closes
+- The formula: `(exit_price - avgEntryPrice) * closedSize * direction - fees`
+- Old fills (>30 days) may have expired closed-pnl records on Bybit's servers
+
+## Binance
+
+(Add Binance-specific quirks here as discovered)
+
+## General Patterns
+
+### CCXT Wrapper Limitations
+
+CCXT normalizes exchange APIs but sometimes loses important data:
+- Pagination cursors may not be exposed
+- Exchange-specific fields may be buried in `info` dict
+- Always check raw response (`trade.get("info", {})`) when CCXT fields are insufficient
+
+### Debugging Missing Data
+
+When data appears incomplete:
+1. **Query raw API directly** - bypass CCXT to see actual response
+2. **Check pagination** - are there more pages? Is cursor working?
+3. **Check time windows** - does the time range include the missing data?
+4. **Check data retention** - how long does the exchange keep this data?
+5. **Compare endpoints** - does another endpoint have the data?
+
+See `docs/ai/debugging_case_studies.md` for detailed examples.