Skip to content

Commit e3a3011

Browse files
committed
Add telemetry refcount issue
1 parent 3965026 commit e3a3011

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

.beads/issues.jsonl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,5 @@
4848
{"id":"canopy-v5y","title":"Fix service proxy map operation races","description":"Location: /rpc/src/service_proxy.cpp\n\nIssue: proxies_ map operations in on_object_proxy_released() and get_or_create_object_proxy() may have races between creation, release, and cleanup.\n\nFix: Ensure consistent locking pattern (lock → check → operate → unlock). Avoid unlocking and re-locking within single logical operation.\n\nReference: CODE_QUALITY_REVIEW.md Section 2.1.2","status":"open","priority":1,"issue_type":"bug","owner":"edward@boggis-rolfe.com","estimated_minutes":360,"created_at":"2026-01-12T14:41:31.721557621Z","created_by":"Edward Boggis-Rolfe","updated_at":"2026-01-20T14:27:22.519275344Z"}
4949
{"id":"canopy-v64","title":"Add comprehensive public API documentation","description":"Add comprehensive Doxygen-style API documentation for all public headers.\n\nStatus: PLANNED\n\nCurrent State: Sparse comments in public headers\n\nTarget Files:\n- rpc/include/rpc/rpc.h - Main public API\n- rpc/include/rpc/internal/service.h - Service interface\n- rpc/include/rpc/internal/marshaller.h - Marshalling interface\n- All other public headers\n\nRequired Documentation:\n- Function purpose and behavior\n- Parameter descriptions\n- Return value semantics\n- Thread-safety guarantees\n- Exception specifications\n- Usage examples\n- Pre/post-conditions\n\nBenefits:\n- Easier for developers to use the library\n- Better understanding of API contracts\n- Foundation for auto-generated documentation\n\nReference: docs/to_do/CODE_QUALITY_REVIEW.md (Section 5.1)","status":"open","priority":2,"issue_type":"task","owner":"edward@boggis-rolfe.com","estimated_minutes":640,"created_at":"2026-01-12T14:28:36.550506377Z","created_by":"Edward Boggis-Rolfe","updated_at":"2026-01-20T14:27:22.516514407Z","labels":["api","documentation"]}
5050
{"id":"canopy-y7n","title":"transport::inbound_add_ref Reference Counting Bug","description":"## Issue: transport::inbound_add_ref Reference Counting Bug\n\n### Problem\nThe transport::inbound_add_ref method has a bug where it doesn't properly increment transport reference counts or notify passthrough, causing incorrect internal state tracking.\n\n### Root Cause Analysis\n1. Same-zone operations: When forward_transport-\u003einbound_add_ref() is called with forward_dest (same zone), the transport's add_ref_count should increment but doesn't\n2. Passthrough notification failure: Passthrough's add_ref method isn't being called when it should be for cross-zone operations\n3. Complex reference counting rules: Missing logic for different add_ref_options scenarios\n\n### Reference Counting Rules Found\n- If caller_or_destination == service_zone_id: Increment add_ref_count (local service-to-transport)\n- If add_ref_options::normal OR add_ref_options::optimistic: Increment add_ref_count AND notify passthrough\n- If build_caller_route XOR build_dest_channel: Increment add_ref_count AND notify passthrough \n- If build_caller_route \u0026\u0026 build_destination_route: Do NOT increment reference counts (back-channel only)\n\n### Release Counting Rules (Simpler)\n- All releases go through passthrough if not involving local service\n- No transformation of routing context\n\n### Race Condition Warning\nThere's a potential race condition when two threads add the same source/destination pair - needs synchronization protection.\n\n### Fix Plan\n1. Implement correct reference counting logic in transport::inbound_add_ref\n2. Add proper passthrough notification with synchronization\n3. Apply same logic to transport::inbound_release\n4. Handle edge cases and race conditions\n\n### Affected Tests\n- add_ref_optimistic - expecting add_ref_count to be 1, getting 0\n- release_happy_path - expecting release_count to be 1, getting 0 \n- reference_count_balance - expecting send_count to be 2, getting 0\n\nPlease track this issue as high priority for transport layer fixes.","status":"closed","priority":1,"issue_type":"task","owner":"edward@boggis-rolfe.com","created_at":"2026-01-12T23:11:44.832234875Z","created_by":"Edward Boggis-Rolfe","updated_at":"2026-01-20T14:27:22.523450409Z","closed_at":"2026-01-17T20:25:17.614839213Z","close_reason":"Closed"}
51+
{"id":"canopy-y7o","title":"Implement reference counting tracking in i_telemetry_service","description":"Implement comprehensive reference counting tracking in i_telemetry_service to capture all add_ref and release events with their parameters.\n\n## Problem\nCurrent telemetry provides limited visibility into reference counting operations. We cannot trace:\n- Which add_ref/release events are happening\n- What options parameters are passed (add_ref_options, release_options)\n- The state of shared_count_ and optimistic_count_ atomics\n- Cross-zone reference counting patterns\n\n## Requirements\n\n### 1. Telemetry Events\nAdd new telemetry event types:\n- `add_ref_send` - When add_ref is sent across transport\n- `add_ref_received` - When add_ref is received inbound\n- `add_ref_local` - When add_ref is local (same zone)\n- `release_send` - When release is sent across transport\n- `release_received` - When release is received inbound\n- `release_local` - When release is local (same zone)\n\n### 2. Event Payload\nEach event should capture:\n```cpp\nstruct ref_count_event {\n uint64_t object_id;\n uint32_t zone_id;\n uint32_t remote_zone_id; // 0 if local\n bool is_optimistic; // true for optimistic_ptr, false for shared_ptr\n uint32_t previous_count; // Count before operation\n uint32_t new_count; // Count after operation\n add_ref_options_t options; // If add_ref\n release_options_t options; // If release\n bool is_local; // true if same zone operation\n bool build_caller_route; // From routing context\n bool build_dest_channel; // From routing context\n transport_type_t transport_type; // TCP, SPSC, PassThrough, etc.\n};\n```\n\n### 3. Visualization Integration\nUpdate animation_telemetry_service.cpp:\n- Track ref_count_events[] alongside existing telemetry events\n- Show reference count bubbles on object proxies and stubs\n- Color-code: green for shared_ptr, orange for optimistic_ptr\n- Animate count changes with +/- indicators\n- Display count in tooltip on hover\n- Show cross-zone reference tracking with dashed lines\n\n### 4. Benefits\n- Debug reference counting issues (canopy-7i3)\n- Visualize reference lifecycle across zones\n- Detect reference leaks (objects never released)\n- Debug race conditions in add_ref/release sequencing\n- Verify transport failure cleanup (canopy-bq5)\n\n## Implementation Tasks\n\n### Phase 1: Core Infrastructure (Week 1)\n- Define ref_count_event struct in telemetry headers\n- Add telemetry callback in i_telemetry_service.idl\n- Implement event capture in transport::inbound_add_ref/inbound_release\n- Add event capture in object_proxy::add_ref/object_proxy::release\n\n### Phase 2: Event Processing (Week 2)\n- Update animation_telemetry_service.cpp to handle new events\n- Store ref_count_events[] array (separate from nodes/links)\n- Update ensureZoneNode() and related functions\n- Add visualization rendering for reference counts\n\n### Phase 3: Visualization (Week 2-3)\n- Render count bubbles on proxy/stub nodes\n- Implement hover tooltips showing count history\n- Add cross-zone reference tracking lines\n- Color coding and animations\n\n### Phase 4: Testing (Week 3)\n- Test with existing reference counting tests\n- Verify no performance impact on production code\n- Test visualization with multi-zone scenarios\n\n## Files to Modify\n- telemetry/include/rpc/telemetry/... (new event structures)\n- telemetry/src/animation_telemetry_service.cpp\n- rpc/src/object_proxy.cpp (add telemetry callbacks)\n- rpc/src/transport.cpp (add telemetry callbacks)\n- i_telemetry_service.idl (add new event type)\n\n## Related Issues\n- canopy-7i3: Verify object proxy reference counting\n- canopy-bq5: Implement transport failure detection\n- canopy-g4h: Fix pending transmits shutdown race\n\n## Success Criteria\n- All add_ref/release events captured in telemetry\n- Visualization shows current reference counts on all proxies/stubs\n- Cross-zone reference relationships visible\n- Tooltip shows detailed event history per object","status":"open","priority":2,"issue_type":"task","owner":"edward@boggis-rolfe.com","estimated_minutes":960,"created_at":"2026-01-21T13:01:00Z","created_by":"Edward Boggis-Rolfe","updated_at":"2026-01-21T13:01:00Z","labels":["debugging","reference-counting","telemetry","visualization"]}
5152
{"id":"canopy-zw9","title":"Document and mitigate deadlock risks","description":"Eliminate deadlock risks through architectural improvements and documentation.\n\nStatus: PLANNED - Partially addressed in thread safety analysis\n\nCurrent State:\n- service.cpp contains explicit warnings about deadlock risks\n- Uses scoped blocks to manage lock lifetimes (fragile approach)\n- Complex interactions between stub_control_ and zone_control_ mutexes\n- Transport destination management already implements deadlock prevention via lock ordering\n\nProposed Actions:\n1. Review and document all mutex acquisition patterns\n2. Analyze stub_control_ and zone_control_ lock interactions\n3. Consider lock hierarchy to prevent circular dependencies\n4. Evaluate alternative concurrency models (e.g., actor model for internal state management)\n5. Reduce scope of critical sections where possible\n6. Document lock ordering policies\n\nRisk Assessment: High priority due to potential runtime failures\n\nBenefits:\n- Prevent runtime deadlocks\n- More maintainable concurrent code\n- Better understanding of concurrency patterns\n\nReference: docs/to_do/CODE_QUALITY_REVIEW.md (Section 2.2)\nRelated: canopy-mno (thread safety and race conditions)","status":"open","priority":1,"issue_type":"task","owner":"edward@boggis-rolfe.com","estimated_minutes":960,"created_at":"2026-01-12T14:28:36.746667761Z","created_by":"Edward Boggis-Rolfe","updated_at":"2026-01-20T14:27:22.520325084Z","labels":["architecture","concurrency","safety"],"dependencies":[{"issue_id":"canopy-zw9","depends_on_id":"canopy-mno","type":"blocks","created_at":"2026-01-12T14:29:35.29221463Z","created_by":"Edward Boggis-Rolfe"}]}

0 commit comments

Comments
 (0)