Skip to content

Commit b525a20

Browse files
committed
Add comprehensive documentation for execution tracking features
- Document execution tracking and timeout monitoring - Include configuration examples and environment variables - Provide debugging guide for stuck executions - Explain log formats and locations - Add testing instructions and examples - List all modified/created files - Include future enhancement ideas
1 parent 3fa8852 commit b525a20

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed

docs/EXECUTION_TRACKING.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Execution Tracking and Timeout Monitoring
2+
3+
This document describes the execution tracking and timeout monitoring features added to improve observability and debugging of stuck Jupyter notebook executions.
4+
5+
## Overview
6+
7+
The execution tracking system provides:
8+
9+
1. **Execution Event Tracking**: Logs every cell execution with start/end times, duration, and success status
10+
2. **Execution Metadata Publishing**: Publishes structured metadata to the webapp for monitoring
11+
3. **ZMQ Message Flow Logging**: Enhanced debugging of kernel-server communication
12+
4. **Execution Timeout Monitoring**: Optional monitoring with warnings and auto-interrupt for stuck executions
13+
14+
## Features
15+
16+
### 1. Execution Event Tracking
17+
18+
**Location**: `deepnote_toolkit/execution_tracking.py`
19+
20+
Automatically tracks all notebook cell executions and logs:
21+
- Execution start with cell preview
22+
- Execution end with duration and success status
23+
- Error types if execution fails
24+
25+
**Log Format**:
26+
```
27+
EXEC_START | count=1 | cell_id=12345 | preview=import pandas as pd\ndf = pd.read_csv(...
28+
EXEC_END | count=1 | duration=2.45s | success=True
29+
```
30+
31+
This is **enabled by default** and requires no configuration.
32+
33+
### 2. Execution Metadata Publishing
34+
35+
**Location**: `deepnote_toolkit/ipython_utils.py`
36+
37+
Publishes execution metadata to the webapp via the `application/vnd.deepnote.execution-metadata+json` MIME type.
38+
39+
**Metadata Structure**:
40+
```python
41+
{
42+
"execution_count": 1,
43+
"duration_seconds": 2.45,
44+
"success": True,
45+
"error_type": "KeyError" # Only present if execution failed
46+
}
47+
```
48+
49+
This allows the webapp to:
50+
- Display execution duration in the UI
51+
- Track execution history
52+
- Alert on failed executions
53+
- Build execution analytics
54+
55+
### 3. ZMQ Session Debug Logging
56+
57+
**Location**: `deepnote_core/resources/jupyter/jupyter_server_config.py:817`
58+
59+
Enables detailed ZMQ message logging by setting:
60+
```python
61+
c.Session.debug = True
62+
```
63+
64+
This logs all ZMQ messages exchanged between the Jupyter server and kernel, which is critical for debugging:
65+
- Message delays
66+
- Message loss
67+
- Protocol issues
68+
- Kernel communication problems
69+
70+
**Enhanced Logging Configuration** (line 418-426):
71+
```python
72+
c.ServerApp.logging_config = {
73+
"loggers": {
74+
"tornado.access": {"level": "DEBUG"},
75+
"jupyter_server.serverapp": {"level": "DEBUG"},
76+
"jupyter_client.session": {"level": "DEBUG"},
77+
"execution_tracker": {"level": "INFO"},
78+
"execution_tracking": {"level": "INFO"},
79+
}
80+
}
81+
```
82+
83+
### 4. Execution Timeout Monitoring (Optional)
84+
85+
**Location**: `deepnote_toolkit/execution_timeout.py`
86+
87+
Optional feature that monitors execution duration and can:
88+
- Send warnings when executions exceed a threshold
89+
- Report to the webapp
90+
- Automatically interrupt stuck executions (if enabled)
91+
92+
**Configuration via Environment Variables**:
93+
94+
```bash
95+
# Enable timeout monitoring (default: false)
96+
export DEEPNOTE_ENABLE_EXECUTION_TIMEOUT=true
97+
98+
# Warning threshold in seconds (default: 240 = 4 minutes)
99+
export DEEPNOTE_EXECUTION_WARNING_THRESHOLD=240
100+
101+
# Timeout threshold in seconds (default: 300 = 5 minutes)
102+
export DEEPNOTE_EXECUTION_TIMEOUT_THRESHOLD=300
103+
104+
# Enable automatic interrupt of stuck executions (default: false)
105+
# WARNING: Use with caution! This will send SIGINT to interrupt executions
106+
export DEEPNOTE_EXECUTION_AUTO_INTERRUPT=false
107+
```
108+
109+
**Example Usage**:
110+
```bash
111+
# Enable with default thresholds
112+
export DEEPNOTE_ENABLE_EXECUTION_TIMEOUT=true
113+
114+
# Enable with custom thresholds and auto-interrupt
115+
export DEEPNOTE_ENABLE_EXECUTION_TIMEOUT=true
116+
export DEEPNOTE_EXECUTION_WARNING_THRESHOLD=120 # Warn after 2 minutes
117+
export DEEPNOTE_EXECUTION_TIMEOUT_THRESHOLD=180 # Timeout after 3 minutes
118+
export DEEPNOTE_EXECUTION_AUTO_INTERRUPT=true # Auto-interrupt
119+
```
120+
121+
## Log Locations
122+
123+
All execution tracking logs are written to the standard Deepnote log location:
124+
- **File**: `$DEEPNOTE_LOG_DIR/helpers.log` (or XDG default)
125+
- **Scraped by**: Loki for centralized logging
126+
127+
## Debugging Stuck Executions
128+
129+
When investigating stuck executions, follow this process:
130+
131+
1. **Check Execution Logs**: Look for `EXEC_START` without matching `EXEC_END`
132+
```bash
133+
grep "EXEC_START\|EXEC_END" $DEEPNOTE_LOG_DIR/helpers.log
134+
```
135+
136+
2. **Check ZMQ Message Flow**: Look for message delays or missing messages
137+
```bash
138+
grep "Session.send\|Session.recv" /var/log/jupyter-server.log
139+
```
140+
141+
3. **Check for Long Executions**: Look for `LONG_EXECUTION` warnings
142+
```bash
143+
grep "LONG_EXECUTION" $DEEPNOTE_LOG_DIR/helpers.log
144+
```
145+
146+
4. **Enable Timeout Monitoring**: If not already enabled, set environment variables and restart kernel
147+
148+
## Implementation Details
149+
150+
### Execution Tracking Flow
151+
152+
1. User executes a cell
153+
2. IPython fires `pre_execute` event → `ExecutionTracker.on_pre_execute()`
154+
- Logs `EXEC_START`
155+
- Records start time and cell metadata
156+
3. Cell executes...
157+
4. IPython fires `post_execute` event → `ExecutionTracker.on_post_execute()`
158+
- Calculates duration
159+
- Logs `EXEC_END` with duration and success status
160+
- Publishes metadata to webapp
161+
162+
### Timeout Monitoring Flow (if enabled)
163+
164+
1. `pre_execute` event → `ExecutionTimeoutMonitor.on_pre_execute()`
165+
- Starts warning timer (default: 240s)
166+
- Starts timeout timer (default: 300s) if auto-interrupt enabled
167+
2. If warning timer fires → `_send_warning()`
168+
- Logs `LONG_EXECUTION` warning
169+
- Reports to webapp
170+
3. If timeout timer fires → `_interrupt_execution()`
171+
- Logs `TIMEOUT_INTERRUPT` error
172+
- Reports to webapp
173+
- Sends SIGINT to kernel (if auto-interrupt enabled)
174+
4. `post_execute` event → cancels all timers
175+
176+
## Files Modified/Created
177+
178+
### New Files
179+
- `deepnote_toolkit/execution_tracking.py` - Execution tracking implementation
180+
- `deepnote_toolkit/execution_timeout.py` - Timeout monitoring implementation
181+
- `docs/EXECUTION_TRACKING.md` - This documentation
182+
183+
### Modified Files
184+
- `deepnote_toolkit/runtime_initialization.py` - Added tracking and timeout setup
185+
- `deepnote_toolkit/ipython_utils.py` - Added execution metadata publishing
186+
- `deepnote_core/resources/jupyter/jupyter_server_config.py` - Enhanced logging config
187+
188+
## Testing
189+
190+
To test the execution tracking:
191+
192+
1. Start a Jupyter kernel with the toolkit
193+
2. Execute a simple cell:
194+
```python
195+
print("Hello, world!")
196+
```
197+
3. Check logs for execution tracking:
198+
```bash
199+
tail -f $DEEPNOTE_LOG_DIR/helpers.log | grep "EXEC_"
200+
```
201+
4. You should see:
202+
```
203+
EXEC_START | count=1 | cell_id=... | preview=print("Hello, world!")
204+
EXEC_END | count=1 | duration=0.01s | success=True
205+
```
206+
207+
To test timeout monitoring:
208+
209+
1. Enable timeout monitoring:
210+
```bash
211+
export DEEPNOTE_ENABLE_EXECUTION_TIMEOUT=true
212+
export DEEPNOTE_EXECUTION_WARNING_THRESHOLD=5 # 5 seconds for testing
213+
```
214+
2. Restart kernel
215+
3. Execute a long-running cell:
216+
```python
217+
import time
218+
time.sleep(10)
219+
```
220+
4. After 5 seconds, you should see a `LONG_EXECUTION` warning in the logs
221+
222+
## Future Enhancements
223+
224+
Potential improvements to consider:
225+
226+
1. **Kernel Heartbeat Monitoring**: Monitor ZMQ heartbeat channel to detect dead kernels
227+
2. **Message Queue Metrics**: Track IOPub message queue depth to detect backpressure
228+
3. **Execution History**: Store execution history in database for analytics
229+
4. **WebApp Integration**: Display execution metrics in real-time in the UI
230+
5. **Adaptive Timeouts**: Adjust timeout thresholds based on historical execution patterns
231+
6. **Execution Profiling**: Integrate with Python profilers for detailed performance analysis
232+
233+
## References
234+
235+
- Jupyter Client Protocol: https://jupyter-client.readthedocs.io/en/stable/messaging.html
236+
- IPython Events: https://ipython.readthedocs.io/en/stable/config/callbacks.html
237+
- ZMQ Guide: https://zguide.zeromq.org/

0 commit comments

Comments
 (0)