fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089
Merged
fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089
Conversation
The k8s actions client was not including EndTime when reporting task completion via UpdateActionStatusRequest. This caused the server to fall back to time.Now(), which could be significantly later than the actual task completion time, resulting in inflated durations displayed in the UI (e.g., "1h 10m" instead of "22s"). Extract the OccurredAt timestamp from the last PhaseHistory entry when it represents a terminal phase and include it as EndTime in the status update request. Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Add a started_at column to the actions table that records when the action first enters the RUNNING phase. Duration is now calculated as ended_at - started_at (execution time only) instead of ended_at - created_at (which included queue and setup time). This fixes the left panel showing inflated durations (e.g., "1m 3s" for a task that executed in 2s) because the previous calculation included the time spent waiting in the queue. Also update StartTime in API responses to use started_at when available, so the UI shows the actual execution start time. Signed-off-by: Kevin Su <pingsutw@apache.org>
Tasks may skip RUNNING and go QUEUED → INITIALIZING → SUCCEEDED directly. The previous code only set started_at on RUNNING, leaving it NULL for tasks that skip that phase. Now started_at is set on any phase beyond the waiting stages (INITIALIZING, RUNNING, or terminal). Signed-off-by: Kevin Su <pingsutw@apache.org>
taskActionStatusChanged() did not compare PhaseHistory, so when a new phase was appended to the history but no other status field changed, the change went undetected and no ActionEvent was emitted. This caused intermediate phases like RUNNING to be silently dropped for fast tasks where the plugin transitions quickly between reconciliation cycles. Add a PhaseHistory length comparison so that new phase transitions always trigger an event to be recorded. Signed-off-by: Kevin Su <pingsutw@apache.org>
terminalPhaseTimestamp was comparing against incorrect phase names
("Succeeded", "Failed", "Aborted", "TimedOut") instead of the actual
ConditionReason constants used in PhaseHistory ("Completed",
"PermanentFailure", "RetryableFailure", "Aborted"). This caused the
function to never match terminal phases, so endTime was always nil,
forcing the run service to fall back to time.Now() on every watch
event — making completed action durations grow indefinitely.
Signed-off-by: Kevin Su <pingsutw@apache.org>
When a child TaskAction is created in k8s, the parent must already be running (it created the child). However, due to the 5-second reconcile interval, the parent's k8s condition may still show Initializing while children are already executing, causing the UI to display "Initializing" for the parent. Fix this by sending an UpdateActionStatus(RUNNING) for the parent whenever a child action's Added event is observed in the watcher. Also add a phase monotonicity guard (phase <= ?) in UpdateActionPhase to prevent phase downgrades from concurrent updates. Signed-off-by: Kevin Su <pingsutw@apache.org>
Store actual pod container start/finish times (ExecutionStartedAt, CompletedAt) in TaskAction CRD status, separate from PhaseHistory which uses controller observation time. The watcher and run service now prefer these actual timestamps for duration computation, falling back to PhaseHistory for backward compatibility. This fixes the duration mismatch where the sidebar showed inflated durations (e.g. 15s) while the right panel showed correct times (e.g. 2s), caused by controller polling delay inflating PhaseHistory timestamps. Signed-off-by: Kevin Su <pingsutw@apache.org>
Set MaxConcurrentReconciles to 10 (was default 1) and reduce TaskActionDefaultRequeueDuration from 5s to 1s. This dramatically reduces the time for tasks to progress through their lifecycle, cutting per-task processing time from ~50s to ~3s. Signed-off-by: Kevin Su <pingsutw@apache.org>
Action events arrive via the fast path (controller → events proxy → RecordActionEvents) while actions.phase is updated via the slow path (K8s watcher → UpdateActionStatus). This causes the sidebar (reading actions.phase) to show "queue" while the right panel (reading action_events) already shows "completed". Fix: when recording action events, also advance the actions table phase to match the highest event phase. This ensures both data paths converge immediately. Signed-off-by: Kevin Su <pingsutw@apache.org>
…s duration The fast path (recordEvents) was passing controller observation timestamps as startTime/endTime, which could corrupt started_at/ended_at ordering when the slow path later set the real pod timestamps. Pass nil instead so only the slow path (K8s watcher) sets accurate timestamps. Signed-off-by: Kevin Su <pingsutw@apache.org>
The K8s watcher replays all existing TaskAction CRDs as ADDED events on every watch reconnect (every 5 minutes). With 187+ old CRDs, each triggering a synchronous UpdateActionStatus RPC, the event loop was blocked for ~40 seconds. This caused: 1. UI status appearing 40s late (sidebar showed no status) 2. Main task pods waiting 40s+ for subtask completion notification 3. Main task duration inflated from ~9s to ~49s Fix: skip terminal ADDED events (already completed) entirely in handleWatchEvent, and skip duplicate ADDED events (bloom filter hit) entirely in notifyRunService. This reduces cold-start processing from 187 RPCs to ~11 (only non-terminal actions). Also fixes duration computation: use actual pod timestamps only (no time.Now() fallback), handle SQL UPDATE reading old started_at value, and remove CreatedAt fallback for StartTime in proto conversion. Results: run duration dropped from 49s to 9s, status appears in <3s. Signed-off-by: Kevin Su <pingsutw@apache.org>
Fixes three issues from the previous commit: 1. Terminal ADDED events were skipped unconditionally, but on cold start they haven't been processed yet — their timestamps were never set. Now only skip when the bloom filter confirms they're duplicates. 2. Reverted passing event timestamps as startTime from the fast path. Event timestamps are controller observation times that diverge from real pod times, causing duration to jump (e.g., 6s → 2s) when the slow path arrives with accurate timestamps. 3. Added GREATEST(0, ...) clamp on duration_ms SQL to prevent negative values from edge cases where started_at > ended_at. Signed-off-by: Kevin Su <pingsutw@apache.org>
On watch reconnects, K8s replays all existing CRDs as ADDED events. Previously, terminal duplicates in the bloom filter were skipped entirely, including UpdateActionStatus. This left actions with NULL started_at/ended_at/duration_ms if MODIFIED events were missed before the reconnect. Now: - Non-terminal duplicates: still skip entirely (performance) - Terminal duplicates: skip RecordAction/parent promotion, but still call UpdateActionStatus to repair potentially missing timestamps - Parent promotion also gated by isDuplicate to avoid redundant RPCs Signed-off-by: Kevin Su <pingsutw@apache.org>
The controller now sets StartTime on ActionEvent from the CRD's ExecutionStartedAt field. The fast path (recordEvents) uses this real pod start time to set started_at immediately, enabling the frontend to show a live duration counter from the INITIALIZING phase. Previously the fast path passed nil for startTime, so duration was only visible after the slow path (K8s watcher) arrived 1-2s later. Signed-off-by: Kevin Su <pingsutw@apache.org>
Use the event's UpdatedTime (from the controller's PhaseHistory) to set started_at via the fast path, enabling the frontend to show a live duration counter from the first non-queued phase. This is simpler than modifying the controller to set StartTime, since UpdatedTime is already populated on every event. The SQL uses COALESCE(started_at, ?) so the first value wins and the slow path's real pod start time won't be overwritten if it arrives first. Signed-off-by: Kevin Su <pingsutw@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking issue
N/A
Why are the changes needed?
The V2 UI left panel shows incorrect durations for tasks (e.g., "1h 10m" instead of "22s"). This happens because the k8s actions client never sends the
EndTimefield when reporting task completion viaUpdateActionStatusRequest. The server then falls back totime.Now()as the end time, which can be significantly later than the actual task completion time. The duration is then calculated astime.Now() - created_at, which includes all queue time and processing delays.What changes were proposed in this pull request?
terminalPhaseTimestamp()helper that extracts theOccurredAttimestamp from the last entry in the TaskAction'sPhaseHistorywhen it represents a terminal phase (Succeeded, Failed, Aborted, TimedOut)EndTimein theUpdateActionStatusRequestusing the actual completion timestamp from the TaskAction's phase historyFiles changed:
actions/k8s/client.go— AddedterminalPhaseTimestamp()helper and setEndTimein status requestactions/k8s/client_test.go— Added integration test for EndTime inclusion and table-driven unit tests forterminalPhaseTimestamp()How was this patch tested?
TestNotifyRunService_UpdateActionStatusIncludesEndTime— verifies thatnotifyRunServicesends the correctEndTimefrom PhaseHistory when a task succeedsTestTerminalPhaseTimestamp— table-driven test covering: empty history, non-terminal phase, and all 4 terminal phases (Succeeded, Failed, Aborted, TimedOut)Check all the applicable boxes
main