fix: send actual EndTime in UpdateActionStatus to fix inflated durations by pingsutw · Pull Request #7089 · flyteorg/flyte

pingsutw · 2026-03-25T05:24:06Z

Tracking issue

N/A

Why are the changes needed?

The V2 UI left panel shows incorrect durations for tasks (e.g., "1h 10m" instead of "22s"). This happens because the k8s actions client never sends the EndTime field when reporting task completion via UpdateActionStatusRequest. The server then falls back to time.Now() as the end time, which can be significantly later than the actual task completion time. The duration is then calculated as time.Now() - created_at, which includes all queue time and processing delays.

What changes were proposed in this pull request?

Added a terminalPhaseTimestamp() helper that extracts the OccurredAt timestamp from the last entry in the TaskAction's PhaseHistory when it represents a terminal phase (Succeeded, Failed, Aborted, TimedOut)
Set EndTime in the UpdateActionStatusRequest using the actual completion timestamp from the TaskAction's phase history
This ensures the run service computes an accurate duration based on when the task actually completed, not when the server processed the update

Files changed:

actions/k8s/client.go — Added terminalPhaseTimestamp() helper and set EndTime in status request
actions/k8s/client_test.go — Added integration test for EndTime inclusion and table-driven unit tests for terminalPhaseTimestamp()

How was this patch tested?

TestNotifyRunService_UpdateActionStatusIncludesEndTime — verifies that notifyRunService sends the correct EndTime from PhaseHistory when a task succeeds
TestTerminalPhaseTimestamp — table-driven test covering: empty history, non-terminal phase, and all 4 terminal phases (Succeeded, Failed, Aborted, TimedOut)
All existing tests continue to pass

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

main
- Flyte 2 WIP #6583
  - fix: send actual EndTime in UpdateActionStatus to fix inflated durations 👈

The k8s actions client was not including EndTime when reporting task completion via UpdateActionStatusRequest. This caused the server to fall back to time.Now(), which could be significantly later than the actual task completion time, resulting in inflated durations displayed in the UI (e.g., "1h 10m" instead of "22s"). Extract the OccurredAt timestamp from the last PhaseHistory entry when it represents a terminal phase and include it as EndTime in the status update request. Signed-off-by: Kevin Su <pingsutw@apache.org>

Signed-off-by: Kevin Su <pingsutw@apache.org>

Add a started_at column to the actions table that records when the action first enters the RUNNING phase. Duration is now calculated as ended_at - started_at (execution time only) instead of ended_at - created_at (which included queue and setup time). This fixes the left panel showing inflated durations (e.g., "1m 3s" for a task that executed in 2s) because the previous calculation included the time spent waiting in the queue. Also update StartTime in API responses to use started_at when available, so the UI shows the actual execution start time. Signed-off-by: Kevin Su <pingsutw@apache.org>

Tasks may skip RUNNING and go QUEUED → INITIALIZING → SUCCEEDED directly. The previous code only set started_at on RUNNING, leaving it NULL for tasks that skip that phase. Now started_at is set on any phase beyond the waiting stages (INITIALIZING, RUNNING, or terminal). Signed-off-by: Kevin Su <pingsutw@apache.org>

popojk

LGTM, thanks!

taskActionStatusChanged() did not compare PhaseHistory, so when a new phase was appended to the history but no other status field changed, the change went undetected and no ActionEvent was emitted. This caused intermediate phases like RUNNING to be silently dropped for fast tasks where the plugin transitions quickly between reconciliation cycles. Add a PhaseHistory length comparison so that new phase transitions always trigger an event to be recorded. Signed-off-by: Kevin Su <pingsutw@apache.org>

Signed-off-by: Kevin Su <pingsutw@apache.org>

terminalPhaseTimestamp was comparing against incorrect phase names ("Succeeded", "Failed", "Aborted", "TimedOut") instead of the actual ConditionReason constants used in PhaseHistory ("Completed", "PermanentFailure", "RetryableFailure", "Aborted"). This caused the function to never match terminal phases, so endTime was always nil, forcing the run service to fall back to time.Now() on every watch event — making completed action durations grow indefinitely. Signed-off-by: Kevin Su <pingsutw@apache.org>

When a child TaskAction is created in k8s, the parent must already be running (it created the child). However, due to the 5-second reconcile interval, the parent's k8s condition may still show Initializing while children are already executing, causing the UI to display "Initializing" for the parent. Fix this by sending an UpdateActionStatus(RUNNING) for the parent whenever a child action's Added event is observed in the watcher. Also add a phase monotonicity guard (phase <= ?) in UpdateActionPhase to prevent phase downgrades from concurrent updates. Signed-off-by: Kevin Su <pingsutw@apache.org>

Store actual pod container start/finish times (ExecutionStartedAt, CompletedAt) in TaskAction CRD status, separate from PhaseHistory which uses controller observation time. The watcher and run service now prefer these actual timestamps for duration computation, falling back to PhaseHistory for backward compatibility. This fixes the duration mismatch where the sidebar showed inflated durations (e.g. 15s) while the right panel showed correct times (e.g. 2s), caused by controller polling delay inflating PhaseHistory timestamps. Signed-off-by: Kevin Su <pingsutw@apache.org>

Set MaxConcurrentReconciles to 10 (was default 1) and reduce TaskActionDefaultRequeueDuration from 5s to 1s. This dramatically reduces the time for tasks to progress through their lifecycle, cutting per-task processing time from ~50s to ~3s. Signed-off-by: Kevin Su <pingsutw@apache.org>

Action events arrive via the fast path (controller → events proxy → RecordActionEvents) while actions.phase is updated via the slow path (K8s watcher → UpdateActionStatus). This causes the sidebar (reading actions.phase) to show "queue" while the right panel (reading action_events) already shows "completed". Fix: when recording action events, also advance the actions table phase to match the highest event phase. This ensures both data paths converge immediately. Signed-off-by: Kevin Su <pingsutw@apache.org>

…s duration The fast path (recordEvents) was passing controller observation timestamps as startTime/endTime, which could corrupt started_at/ended_at ordering when the slow path later set the real pod timestamps. Pass nil instead so only the slow path (K8s watcher) sets accurate timestamps. Signed-off-by: Kevin Su <pingsutw@apache.org>

The K8s watcher replays all existing TaskAction CRDs as ADDED events on every watch reconnect (every 5 minutes). With 187+ old CRDs, each triggering a synchronous UpdateActionStatus RPC, the event loop was blocked for ~40 seconds. This caused: 1. UI status appearing 40s late (sidebar showed no status) 2. Main task pods waiting 40s+ for subtask completion notification 3. Main task duration inflated from ~9s to ~49s Fix: skip terminal ADDED events (already completed) entirely in handleWatchEvent, and skip duplicate ADDED events (bloom filter hit) entirely in notifyRunService. This reduces cold-start processing from 187 RPCs to ~11 (only non-terminal actions). Also fixes duration computation: use actual pod timestamps only (no time.Now() fallback), handle SQL UPDATE reading old started_at value, and remove CreatedAt fallback for StartTime in proto conversion. Results: run duration dropped from 49s to 9s, status appears in <3s. Signed-off-by: Kevin Su <pingsutw@apache.org>

Fixes three issues from the previous commit: 1. Terminal ADDED events were skipped unconditionally, but on cold start they haven't been processed yet — their timestamps were never set. Now only skip when the bloom filter confirms they're duplicates. 2. Reverted passing event timestamps as startTime from the fast path. Event timestamps are controller observation times that diverge from real pod times, causing duration to jump (e.g., 6s → 2s) when the slow path arrives with accurate timestamps. 3. Added GREATEST(0, ...) clamp on duration_ms SQL to prevent negative values from edge cases where started_at > ended_at. Signed-off-by: Kevin Su <pingsutw@apache.org>

On watch reconnects, K8s replays all existing CRDs as ADDED events. Previously, terminal duplicates in the bloom filter were skipped entirely, including UpdateActionStatus. This left actions with NULL started_at/ended_at/duration_ms if MODIFIED events were missed before the reconnect. Now: - Non-terminal duplicates: still skip entirely (performance) - Terminal duplicates: skip RecordAction/parent promotion, but still call UpdateActionStatus to repair potentially missing timestamps - Parent promotion also gated by isDuplicate to avoid redundant RPCs Signed-off-by: Kevin Su <pingsutw@apache.org>

The controller now sets StartTime on ActionEvent from the CRD's ExecutionStartedAt field. The fast path (recordEvents) uses this real pod start time to set started_at immediately, enabling the frontend to show a live duration counter from the INITIALIZING phase. Previously the fast path passed nil for startTime, so duration was only visible after the slow path (K8s watcher) arrived 1-2s later. Signed-off-by: Kevin Su <pingsutw@apache.org>

Use the event's UpdatedTime (from the controller's PhaseHistory) to set started_at via the fast path, enabling the frontend to show a live duration counter from the first non-queued phase. This is simpler than modifying the controller to set StartTime, since UpdatedTime is already populated on every event. The SQL uses COALESCE(started_at, ?) so the first value wins and the slow path's real pod start time won't be overwritten if it arrives first. Signed-off-by: Kevin Su <pingsutw@apache.org>

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw added the fixed For any bug fixes label Mar 25, 2026

github-actions bot mentioned this pull request Mar 25, 2026

Flyte 2 WIP #6583

Draft

3 tasks

pingsutw added 3 commits March 24, 2026 22:38

update config

e3fde16

Signed-off-by: Kevin Su <pingsutw@apache.org>

popojk approved these changes Mar 25, 2026

View reviewed changes

pingsutw added 5 commits March 24, 2026 23:38

Merge branch 'v2' of github.com:flyteorg/flyte into fix-duration

5fa2b00

fix

528bbf0

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw self-assigned this Mar 25, 2026

pingsutw added this to the V2 GA milestone Mar 25, 2026

pingsutw marked this pull request as draft March 25, 2026 21:46

pingsutw added 15 commits March 25, 2026 16:58

fix

e43e4b4

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

9debc6f

Signed-off-by: Kevin Su <pingsutw@apache.org>

fix

ebd3f18

Signed-off-by: Kevin Su <pingsutw@apache.org>

fix event

755c699

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

5501881

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

d5c53dc

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw added 8 commits March 26, 2026 14:39

nit

afb20fd

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

94db693

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

180b1a5

Signed-off-by: Kevin Su <pingsutw@apache.org>

test

120bbe3

Signed-off-by: Kevin Su <pingsutw@apache.org>

fix

a719140

Signed-off-by: Kevin Su <pingsutw@apache.org>

Merge branch 'v2' of github.com:flyteorg/flyte into fix-duration

d421c5f

test

20c5905

Signed-off-by: Kevin Su <pingsutw@apache.org>

fix-tests

8dc472d

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw marked this pull request as ready for review March 27, 2026 04:57

pingsutw added 5 commits March 26, 2026 22:00

fix

f570b3e

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

f521fae

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

ce49644

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

457c7f6

Signed-off-by: Kevin Su <pingsutw@apache.org>

nit

16d21b2

Signed-off-by: Kevin Su <pingsutw@apache.org>

pingsutw merged commit 0e78c27 into v2 Mar 27, 2026
17 checks passed

pingsutw deleted the fix-duration branch March 27, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089

fix: send actual EndTime in UpdateActionStatus to fix inflated durations#7089
pingsutw merged 37 commits intov2from
fix-duration

pingsutw commented Mar 25, 2026 •

edited by github-actions bot

Loading

Uh oh!

popojk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pingsutw commented Mar 25, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

Files changed:

How was this patch tested?

Check all the applicable boxes

Uh oh!

popojk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pingsutw commented Mar 25, 2026 •

edited by github-actions bot

Loading