Skip to content

fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089

Merged
pingsutw merged 37 commits intov2from
fix-duration
Mar 27, 2026
Merged

fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089
pingsutw merged 37 commits intov2from
fix-duration

Conversation

@pingsutw
Copy link
Copy Markdown
Member

@pingsutw pingsutw commented Mar 25, 2026

Tracking issue

N/A

Why are the changes needed?

The V2 UI left panel shows incorrect durations for tasks (e.g., "1h 10m" instead of "22s"). This happens because the k8s actions client never sends the EndTime field when reporting task completion via UpdateActionStatusRequest. The server then falls back to time.Now() as the end time, which can be significantly later than the actual task completion time. The duration is then calculated as time.Now() - created_at, which includes all queue time and processing delays.

What changes were proposed in this pull request?

  • Added a terminalPhaseTimestamp() helper that extracts the OccurredAt timestamp from the last entry in the TaskAction's PhaseHistory when it represents a terminal phase (Succeeded, Failed, Aborted, TimedOut)
  • Set EndTime in the UpdateActionStatusRequest using the actual completion timestamp from the TaskAction's phase history
  • This ensures the run service computes an accurate duration based on when the task actually completed, not when the server processed the update

Files changed:

  • actions/k8s/client.go — Added terminalPhaseTimestamp() helper and set EndTime in status request
  • actions/k8s/client_test.go — Added integration test for EndTime inclusion and table-driven unit tests for terminalPhaseTimestamp()

How was this patch tested?

  • TestNotifyRunService_UpdateActionStatusIncludesEndTime — verifies that notifyRunService sends the correct EndTime from PhaseHistory when a task succeeds
  • TestTerminalPhaseTimestamp — table-driven test covering: empty history, non-terminal phase, and all 4 terminal phases (Succeeded, Failed, Aborted, TimedOut)
  • All existing tests continue to pass

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.
  • main
    • Flyte 2 WIP #6583
      • fix: send actual EndTime in UpdateActionStatus to fix inflated durations 👈

The k8s actions client was not including EndTime when reporting task
completion via UpdateActionStatusRequest. This caused the server to
fall back to time.Now(), which could be significantly later than the
actual task completion time, resulting in inflated durations displayed
in the UI (e.g., "1h 10m" instead of "22s").

Extract the OccurredAt timestamp from the last PhaseHistory entry when
it represents a terminal phase and include it as EndTime in the status
update request.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw added the fixed For any bug fixes label Mar 25, 2026
@github-actions github-actions bot mentioned this pull request Mar 25, 2026
3 tasks
Signed-off-by: Kevin Su <pingsutw@apache.org>
Add a started_at column to the actions table that records when the
action first enters the RUNNING phase. Duration is now calculated as
ended_at - started_at (execution time only) instead of
ended_at - created_at (which included queue and setup time).

This fixes the left panel showing inflated durations (e.g., "1m 3s"
for a task that executed in 2s) because the previous calculation
included the time spent waiting in the queue.

Also update StartTime in API responses to use started_at when
available, so the UI shows the actual execution start time.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Tasks may skip RUNNING and go QUEUED → INITIALIZING → SUCCEEDED
directly. The previous code only set started_at on RUNNING, leaving
it NULL for tasks that skip that phase. Now started_at is set on any
phase beyond the waiting stages (INITIALIZING, RUNNING, or terminal).

Signed-off-by: Kevin Su <pingsutw@apache.org>
Copy link
Copy Markdown
Contributor

@popojk popojk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

taskActionStatusChanged() did not compare PhaseHistory, so when a new
phase was appended to the history but no other status field changed,
the change went undetected and no ActionEvent was emitted. This caused
intermediate phases like RUNNING to be silently dropped for fast tasks
where the plugin transitions quickly between reconciliation cycles.

Add a PhaseHistory length comparison so that new phase transitions
always trigger an event to be recorded.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
terminalPhaseTimestamp was comparing against incorrect phase names
("Succeeded", "Failed", "Aborted", "TimedOut") instead of the actual
ConditionReason constants used in PhaseHistory ("Completed",
"PermanentFailure", "RetryableFailure", "Aborted"). This caused the
function to never match terminal phases, so endTime was always nil,
forcing the run service to fall back to time.Now() on every watch
event — making completed action durations grow indefinitely.

Signed-off-by: Kevin Su <pingsutw@apache.org>
When a child TaskAction is created in k8s, the parent must already be
running (it created the child). However, due to the 5-second reconcile
interval, the parent's k8s condition may still show Initializing while
children are already executing, causing the UI to display "Initializing"
for the parent.

Fix this by sending an UpdateActionStatus(RUNNING) for the parent
whenever a child action's Added event is observed in the watcher. Also
add a phase monotonicity guard (phase <= ?) in UpdateActionPhase to
prevent phase downgrades from concurrent updates.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw self-assigned this Mar 25, 2026
@pingsutw pingsutw added this to the V2 GA milestone Mar 25, 2026
@pingsutw pingsutw marked this pull request as draft March 25, 2026 21:46
pingsutw added 15 commits March 25, 2026 16:58
Store actual pod container start/finish times (ExecutionStartedAt,
CompletedAt) in TaskAction CRD status, separate from PhaseHistory
which uses controller observation time. The watcher and run service
now prefer these actual timestamps for duration computation, falling
back to PhaseHistory for backward compatibility.

This fixes the duration mismatch where the sidebar showed inflated
durations (e.g. 15s) while the right panel showed correct times
(e.g. 2s), caused by controller polling delay inflating PhaseHistory
timestamps.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Set MaxConcurrentReconciles to 10 (was default 1) and reduce
TaskActionDefaultRequeueDuration from 5s to 1s. This dramatically
reduces the time for tasks to progress through their lifecycle,
cutting per-task processing time from ~50s to ~3s.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Action events arrive via the fast path (controller → events proxy →
RecordActionEvents) while actions.phase is updated via the slow path
(K8s watcher → UpdateActionStatus). This causes the sidebar (reading
actions.phase) to show "queue" while the right panel (reading
action_events) already shows "completed".

Fix: when recording action events, also advance the actions table
phase to match the highest event phase. This ensures both data paths
converge immediately.

Signed-off-by: Kevin Su <pingsutw@apache.org>
…s duration

The fast path (recordEvents) was passing controller observation timestamps
as startTime/endTime, which could corrupt started_at/ended_at ordering
when the slow path later set the real pod timestamps. Pass nil instead
so only the slow path (K8s watcher) sets accurate timestamps.

Signed-off-by: Kevin Su <pingsutw@apache.org>
The K8s watcher replays all existing TaskAction CRDs as ADDED events on
every watch reconnect (every 5 minutes). With 187+ old CRDs, each
triggering a synchronous UpdateActionStatus RPC, the event loop was
blocked for ~40 seconds. This caused:

1. UI status appearing 40s late (sidebar showed no status)
2. Main task pods waiting 40s+ for subtask completion notification
3. Main task duration inflated from ~9s to ~49s

Fix: skip terminal ADDED events (already completed) entirely in
handleWatchEvent, and skip duplicate ADDED events (bloom filter hit)
entirely in notifyRunService. This reduces cold-start processing from
187 RPCs to ~11 (only non-terminal actions).

Also fixes duration computation: use actual pod timestamps only (no
time.Now() fallback), handle SQL UPDATE reading old started_at value,
and remove CreatedAt fallback for StartTime in proto conversion.

Results: run duration dropped from 49s to 9s, status appears in <3s.
Signed-off-by: Kevin Su <pingsutw@apache.org>
Fixes three issues from the previous commit:

1. Terminal ADDED events were skipped unconditionally, but on cold start
   they haven't been processed yet — their timestamps were never set.
   Now only skip when the bloom filter confirms they're duplicates.

2. Reverted passing event timestamps as startTime from the fast path.
   Event timestamps are controller observation times that diverge from
   real pod times, causing duration to jump (e.g., 6s → 2s) when the
   slow path arrives with accurate timestamps.

3. Added GREATEST(0, ...) clamp on duration_ms SQL to prevent negative
   values from edge cases where started_at > ended_at.

Signed-off-by: Kevin Su <pingsutw@apache.org>
On watch reconnects, K8s replays all existing CRDs as ADDED events.
Previously, terminal duplicates in the bloom filter were skipped
entirely, including UpdateActionStatus. This left actions with NULL
started_at/ended_at/duration_ms if MODIFIED events were missed before
the reconnect.

Now:
- Non-terminal duplicates: still skip entirely (performance)
- Terminal duplicates: skip RecordAction/parent promotion, but still
  call UpdateActionStatus to repair potentially missing timestamps
- Parent promotion also gated by isDuplicate to avoid redundant RPCs

Signed-off-by: Kevin Su <pingsutw@apache.org>
The controller now sets StartTime on ActionEvent from the CRD's
ExecutionStartedAt field. The fast path (recordEvents) uses this real
pod start time to set started_at immediately, enabling the frontend
to show a live duration counter from the INITIALIZING phase.

Previously the fast path passed nil for startTime, so duration was
only visible after the slow path (K8s watcher) arrived 1-2s later.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Use the event's UpdatedTime (from the controller's PhaseHistory) to
set started_at via the fast path, enabling the frontend to show a live
duration counter from the first non-queued phase. This is simpler than
modifying the controller to set StartTime, since UpdatedTime is already
populated on every event.

The SQL uses COALESCE(started_at, ?) so the first value wins and the
slow path's real pod start time won't be overwritten if it arrives first.

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw marked this pull request as ready for review March 27, 2026 04:57
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw merged commit 0e78c27 into v2 Mar 27, 2026
17 checks passed
@pingsutw pingsutw deleted the fix-duration branch March 27, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed For any bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants