feat(subscribers): add JSONL event log subscriber#6420
feat(subscribers): add JSONL event log subscriber#6420
Conversation
Greptile SummaryThis PR introduces Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant DaftContext
participant EventLogSubscriber
participant JSONL as events.jsonl
User->>DaftContext: enable_event_log(dir?)
DaftContext->>EventLogSubscriber: __init__(log_dir, run_id)
EventLogSubscriber->>JSONL: session_started
DaftContext->>DaftContext: attach_subscriber("_daft_event_log", subscriber)
User->>DaftContext: df.collect()
DaftContext->>EventLogSubscriber: on_query_start(query_id, metadata)
EventLogSubscriber->>JSONL: query_started + plan_unoptimized
DaftContext->>EventLogSubscriber: on_optimization_start(query_id)
EventLogSubscriber->>JSONL: optimization_started
DaftContext->>EventLogSubscriber: on_optimization_end(query_id, plan)
EventLogSubscriber->>JSONL: optimization_ended + plan_optimized
DaftContext->>EventLogSubscriber: on_exec_start(query_id, physical_plan)
EventLogSubscriber->>JSONL: execution_started + plan_physical
par Concurrent operator callbacks
DaftContext->>EventLogSubscriber: on_exec_operator_start(query_id, node_id=N)
EventLogSubscriber->>JSONL: operator_started
DaftContext->>EventLogSubscriber: on_exec_emit_stats(query_id, stats)
EventLogSubscriber->>JSONL: stats (per node)
DaftContext->>EventLogSubscriber: on_exec_operator_end(query_id, node_id=N)
EventLogSubscriber->>JSONL: operator_ended
end
DaftContext->>EventLogSubscriber: on_result_out(query_id, partition)
EventLogSubscriber->>JSONL: result_out
DaftContext->>EventLogSubscriber: on_exec_end(query_id)
EventLogSubscriber->>JSONL: execution_ended
DaftContext->>EventLogSubscriber: on_query_end(query_id, result)
EventLogSubscriber->>JSONL: query_ended
User->>DaftContext: disable_event_log()
DaftContext->>DaftContext: detach_subscriber("_daft_event_log")
DaftContext->>EventLogSubscriber: close()
EventLogSubscriber->>JSONL: (file closed)
Last reviewed commit: "fixes" |
| _EVENT_LOG_SUBSCRIBER: EventLogSubscriber | None = None | ||
|
|
||
|
|
||
| def enable_event_log(dir: str | Path | None = None) -> None: |
There was a problem hiding this comment.
Parameter name
dir shadows built-in
dir is a Python built-in function. Using it as a parameter name silently shadows the built-in for the entire function scope, which can cause confusing bugs if the built-in is ever needed inside the function. A more descriptive name would also improve readability:
| def enable_event_log(dir: str | Path | None = None) -> None: | |
| def enable_event_log(log_dir: str | Path | None = None) -> None: |
And update the body accordingly (dir or _DEFAULT_EVENT_LOG_DIR → log_dir or _DEFAULT_EVENT_LOG_DIR).
| return time.monotonic() * 1000 | ||
|
|
||
|
|
||
| def _generate_run_id() -> str: |
There was a problem hiding this comment.
we made them a nice human readable id that shows up in the dashboard #6310. Would it be possible to reuse that so users can quickly correlate runs in the dashboard to these events.
There was a problem hiding this comment.
That's going to be tricky. Those names are associated with a query and a script can execute many queries. This means we would need to maintain and write multiple files. I thought for the first version it would be easier to write to a single file per script execution. Thoughts?
There was a problem hiding this comment.
I am of the opinion that it makes the most sense to have the query be the atomic unit as opposed to the script. @rchowell and I had a lengthy discussion about this the other day actually, and he may have some opinions on it as well. I think there's pros/cons of both sides but ultimately I think having everything 1-1 with the query (not script) will make things easier to reason about. Additionally, considering the dashboard uses queries as the atomic unti, it would be odd to me if we introduced an additional grouping here that does not exist there.
Happy to have further discussions as well.
There was a problem hiding this comment.
My 2-cents is log to a file whose name is the query id.
Put it in some conventional location, then it's trivial to show logs in the dashboard for each query.
| def _write_event(self, event_name: str, payload: dict[str, Any]) -> None: | ||
| if self._closed: | ||
| return | ||
| record: dict[str, Any] = {"event": event_name, "ts": _iso_now()} | ||
| record.update(payload) | ||
| try: | ||
| self._file.write(json.dumps(record, default=_json_default) + "\n") | ||
| except OSError: | ||
| pass # Don't let logging failures affect query execution | ||
|
|
||
| def close(self) -> None: | ||
| if self._closed: | ||
| return | ||
| self._closed = True | ||
| self._file.close() |
There was a problem hiding this comment.
Shared mutable state and file writes are not thread-safe
The sample log in the PR description shows concurrent operator_started events just 2ms apart (node IDs 1–4 at 23:33:54.901, .903, .905, .906), confirming that subscriber callbacks are invoked from multiple threads simultaneously.
The current implementation has several unprotected shared-state operations:
self._operator_starts[(query_id, node_id)] = _mono_ms()andself._operator_starts.pop(...)inon_exec_operator_start/on_exec_operator_endcalled concurrently from different threadsself._closedis checked andself._file.write(...)is called non-atomically in_write_event— a thread could pass theif self._closed: returnguard just before another thread runsclose(), then attempt a write on a closed file (thoughOSErroris caught)- With Python 3.13+ free-threaded mode (
--disable-gil) or non-CPython runtimes, individual dict operations are also no longer atomic
A threading.Lock should guard both the file write and any mutation of the timing-state dicts:
import threading
def __init__(self, log_dir: str | Path, run_id: str | None = None) -> None:
...
self._lock = threading.Lock()
...
def _write_event(self, event_name: str, payload: dict[str, Any]) -> None:
if self._closed:
return
record: dict[str, Any] = {"event": event_name, "ts": _iso_now()}
record.update(payload)
try:
with self._lock:
self._file.write(json.dumps(record, default=_json_default) + "\n")
except OSError:
passAnd similarly wrap mutations to _operator_starts / _exec_starts / etc. in the same lock (or a separate state lock).
| def enable_event_log(dir: str | Path | None = None) -> None: | ||
| """Experimental helper that attaches an event-log subscriber. | ||
|
|
||
| This API is currently intended for local event-log capture through | ||
| `enable_event_log()` / `disable_event_log()`. | ||
| """ | ||
| global _EVENT_LOG_ATEXIT_REGISTERED, _EVENT_LOG_SUBSCRIBER | ||
| if _EVENT_LOG_SUBSCRIBER is not None: | ||
| disable_event_log() | ||
| if not _EVENT_LOG_ATEXIT_REGISTERED: | ||
| atexit.register(disable_event_log) | ||
| _EVENT_LOG_ATEXIT_REGISTERED = True | ||
|
|
||
| subscriber = EventLogSubscriber(dir or _DEFAULT_EVENT_LOG_DIR) | ||
| try: | ||
| get_context().attach_subscriber(_EVENT_LOG_ALIAS, subscriber) | ||
| except Exception: | ||
| subscriber.close() | ||
| raise | ||
| _EVENT_LOG_SUBSCRIBER = subscriber |
There was a problem hiding this comment.
enable_event_log returns None, making log path undiscoverable
After enable_event_log() is called, the user has no way to know where the log is being written without reaching into the private _EVENT_LOG_SUBSCRIBER._events_path attribute. Since this is the primary user-facing entry point for the feature, returning the log path would significantly improve discoverability — especially for new users who may not know the default ~/.daft/events/ location.
Consider returning the path to the generated log file:
| def enable_event_log(dir: str | Path | None = None) -> None: | |
| """Experimental helper that attaches an event-log subscriber. | |
| This API is currently intended for local event-log capture through | |
| `enable_event_log()` / `disable_event_log()`. | |
| """ | |
| global _EVENT_LOG_ATEXIT_REGISTERED, _EVENT_LOG_SUBSCRIBER | |
| if _EVENT_LOG_SUBSCRIBER is not None: | |
| disable_event_log() | |
| if not _EVENT_LOG_ATEXIT_REGISTERED: | |
| atexit.register(disable_event_log) | |
| _EVENT_LOG_ATEXIT_REGISTERED = True | |
| subscriber = EventLogSubscriber(dir or _DEFAULT_EVENT_LOG_DIR) | |
| try: | |
| get_context().attach_subscriber(_EVENT_LOG_ALIAS, subscriber) | |
| except Exception: | |
| subscriber.close() | |
| raise | |
| _EVENT_LOG_SUBSCRIBER = subscriber | |
| def enable_event_log(log_dir: str | Path | None = None) -> Path: | |
| """Experimental helper that attaches an event-log subscriber. | |
| This API is currently intended for local event-log capture through | |
| `enable_event_log()` / `disable_event_log()`. | |
| Returns: | |
| Path to the events.jsonl file for this run. | |
| """ | |
| global _EVENT_LOG_ATEXIT_REGISTERED, _EVENT_LOG_SUBSCRIBER | |
| if _EVENT_LOG_SUBSCRIBER is not None: | |
| disable_event_log() | |
| if not _EVENT_LOG_ATEXIT_REGISTERED: | |
| atexit.register(disable_event_log) | |
| _EVENT_LOG_ATEXIT_REGISTERED = True | |
| subscriber = EventLogSubscriber(log_dir or _DEFAULT_EVENT_LOG_DIR) | |
| try: | |
| get_context().attach_subscriber(_EVENT_LOG_ALIAS, subscriber) | |
| except Exception: | |
| subscriber.close() | |
| raise | |
| _EVENT_LOG_SUBSCRIBER = subscriber | |
| return subscriber._events_path |
| def _write_event(self, event_name: str, payload: dict[str, Any]) -> None: | ||
| if self._closed: | ||
| return | ||
| record: dict[str, Any] = {"event": event_name, "ts": _iso_now()} | ||
| record.update(payload) | ||
| try: | ||
| self._file.write(json.dumps(record, default=_json_default) + "\n") | ||
| except OSError: | ||
| pass # Don't let logging failures affect query execution |
There was a problem hiding this comment.
Payload keys
event and ts can be silently overwritten
record.update(payload) is called after the event and ts keys are set. If any internal caller ever accidentally includes "event" or "ts" in the payload dict, those values will be silently clobbered without any error. Since this is an internal method, this is an unlikely bug today, but it's a defensive programming issue that could cause subtle, hard-to-debug corruption as the event type roster grows.
| def _write_event(self, event_name: str, payload: dict[str, Any]) -> None: | |
| if self._closed: | |
| return | |
| record: dict[str, Any] = {"event": event_name, "ts": _iso_now()} | |
| record.update(payload) | |
| try: | |
| self._file.write(json.dumps(record, default=_json_default) + "\n") | |
| except OSError: | |
| pass # Don't let logging failures affect query execution | |
| def _write_event(self, event_name: str, payload: dict[str, Any]) -> None: | |
| if self._closed: | |
| return | |
| record: dict[str, Any] = {"event": event_name, "ts": _iso_now()} | |
| record.update(payload) | |
| # Ensure the meta-keys cannot be overridden by payload | |
| record["event"] = event_name | |
| record["ts"] = record["ts"] # already set above; keep for clarity or assert | |
| try: | |
| self._file.write(json.dumps(record, default=_json_default) + "\n") | |
| except OSError: | |
| pass # Don't let logging failures affect query execution |
Or, more simply, merge in the opposite order so meta-keys always win:
record: dict[str, Any] = {**payload, "event": event_name, "ts": _iso_now()}
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6420 +/- ##
==========================================
- Coverage 74.79% 74.76% -0.03%
==========================================
Files 1021 1022 +1
Lines 136569 136711 +142
==========================================
+ Hits 102142 102212 +70
- Misses 34427 34499 +72
🚀 New features to boost your workflow:
|
|
Cool! Looking at the log output, is the plan to add the operator multiline displays in the output too? E.g. "Filter: col(x) > 5", "Project: col(x) + 1, col(y) * 2"? Would be super useful |
Changes Made
EventLogSubscriber writes query lifecycle events (start, optimization, execution, operator, stats) to a per-run JSONL file under ~/.daft/events/ directory as the default. This is configurable. Includes enable_event_log()/disable_event_log() helpers for global attachment with atexit cleanup.
Each log is contained in a run directory with following format
run_YYYYMMDD_HHMMSS_<4 hex chars>.This is experimental and the api will probably change in future PRs.
Related Issues
sample log