Skip to content

Fixes #38709 - Fix hosts job status stuck in "Pending"#999

Closed
pavanshekar wants to merge 1 commit intotheforeman:masterfrom
pavanshekar:issue-38709
Closed

Fixes #38709 - Fix hosts job status stuck in "Pending"#999
pavanshekar wants to merge 1 commit intotheforeman:masterfrom
pavanshekar:issue-38709

Conversation

@pavanshekar
Copy link
Copy Markdown

Fix hosts job status remaining "Pending" after completion

Testing steps:

  1. Navigate to Monitor -> Jobs
  2. Click on "Run Job" button
  3. Job category -> Katello
  4. Job template -> Install errata by search query - Katello Script Default
  5. Filter the hosts and enter search query
  6. Submit
  7. Once the job completes successfully, the status should be updated accordingly (e.g., to "Success" or "Completed") for all associated hosts mentioned in the table.

Copy link
Copy Markdown
Member

@ekohl ekohl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to write a test for this?

Copy link
Copy Markdown
Contributor

@adamruzicka adamruzicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disclaimer:
I don't recall encountering the problem this is meant to fix and I have not tested this, all I wrote here is just me trying to reason about it.

How it is supposed to work:

  1. When a job is created it spawns a "parent" task backing the whole job, that's @job_invocation.task
  2. This parent task spawns child sub-tasks, one for every single host in the job
  3. The parent task suspends itself and periodically (every 15 seconds or so) checks it children tasks
  4. When all the child tasks are done, the parent marks itself as done

Assuming the code was like this for 5+ years and we're seeing it just now as we switched to the new job invocation page, intuitively I'd look for problems there.

If any of the assumptions I just mentioned don't hold (anymore), then this change feels like putting a band aid on a much deeper problem.

@job_organization = Taxonomy.find_by(id: @job_invocation.task.input[:current_organization_id])
@job_location = Taxonomy.find_by(id: @job_invocation.task.input[:current_location_id])
@auto_refresh = @job_invocation.task.try(:pending?)
@auto_refresh = @job_invocation.task.try(:pending?) || has_pending_hosts?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_pending_host? only comes into play if @job_invocation.task.try(:pending?) returns a false-y value. This can happen if either @job_invocation.task does not exist yet or if the task exists but is already done, which would imply all its children have already finished too.

If the @job_invocation.task does not exist yet, the child tasks cannot exist so has_pending_hosts? will be false as well.

If @job_invocation.task.pending? returns false, then all its children should already be done too, so has_pending_hosts? should always be false too.

@adamruzicka
Copy link
Copy Markdown
Contributor

Needs a rebase to make the conflict go away

@adamruzicka adamruzicka requested a review from Copilot September 3, 2025 14:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue where host job statuses remain stuck in "Pending" state after job completion by implementing a manual refresh mechanism that triggers when auto-refresh is disabled.

  • Adds a refresh trigger mechanism to force table data updates when jobs finish
  • Implements useEffect hook to detect job completion and trigger refresh
  • Passes refresh trigger to the host table component to update API options

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
webpack/JobInvocationDetail/index.js Adds refresh trigger state and effect to detect job completion
webpack/JobInvocationDetail/JobInvocationHostTable.js Accepts refresh trigger prop and updates API options to force data refresh

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +146 to +152
params: { ...prev.params }
}));
setAllAPIOptions(prev => ({
...prev,
params: { ...prev.params }
Copy link

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The useEffect creates new objects with identical content ({ ...prev.params }), which may not trigger a re-fetch since the params remain the same. Consider adding a timestamp or cache-busting parameter to force the API refresh, such as params: { ...prev.params, _refresh: Date.now() }.

Suggested change
params: { ...prev.params }
}));
setAllAPIOptions(prev => ({
...prev,
params: { ...prev.params }
params: { ...prev.params, _refresh: Date.now() }
}));
setAllAPIOptions(prev => ({
...prev,
params: { ...prev.params, _refresh: Date.now() }

Copilot uses AI. Check for mistakes.
@pavanshekar pavanshekar force-pushed the issue-38709 branch 3 times, most recently from 5afa8c0 to c9f5add Compare September 4, 2025 12:17
@kmalyjur
Copy link
Copy Markdown
Contributor

kmalyjur commented Sep 9, 2025

@pavanshekar I couldn't reproduce the exact job from your testing steps, but when I test this on a job that starts in a pending state, I see the update happen in two steps after it finishes:

  1. The chart updates correctly
  2. The status in the table takes a few extra seconds to catch up

Is that the same behavior you were seeing? I'm looking into the code more.

@pavanshekar
Copy link
Copy Markdown
Author

@pavanshekar I couldn't reproduce the exact job from your testing steps, but when I test this on a job that starts in a pending state, I see the update happen in two steps after it finishes:

  1. The chart updates correctly
  2. The status in the table takes a few extra seconds to catch up

Is that the same behavior you were seeing? I'm looking into the code more.

Yes, that's the behavior I am seeing! The fix addresses the issue by implementing a refreshTrigger mechanism that forces a table refresh when the job finishes - previously the table would remain stuck in "Pending" status even after successful completion because it wasn't properly refreshing the data, so now it triggers a fresh API call with a timestamp parameter to ensure the table gets the updated status immediately.

@Lukshio
Copy link
Copy Markdown
Contributor

Lukshio commented Sep 9, 2025

Little bit late to the party, but I tried to reproduce the issue, but I got same behavior as @kmalyjur . It is little bit slower - it takes like +/-10s then the chart, but it is working. Yes it's not ideal, but does it make sense to fix it by adding another watcher to speed it to "just" +/-5s

Anyway I would look for another solution then adding new prop + watcher. The code below should be the solution.
I hope this comment helps. Thank you!

// add prop autoRefresh to component, it is already used in parent
  useEffect(() => {
    setAPIOptions(prevOptions => ({ ...prevOptions }));
  }, [autoRefresh, setAPIOptions]);

Comment on lines +143 to +151
if (refreshTrigger > 0) {
setAPIOptions(prev => ({
...prev,
params: { ...prev.params, _refresh: Date.now() },
}));
setAllAPIOptions(prev => ({
...prev,
params: { ...prev.params, _refresh: Date.now() },
}));
Copy link
Copy Markdown
Contributor

@Lukshio Lukshio Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (refreshTrigger > 0) {
setAPIOptions(prev => ({
...prev,
params: { ...prev.params, _refresh: Date.now() },
}));
setAllAPIOptions(prev => ({
...prev,
params: { ...prev.params, _refresh: Date.now() },
}));
// add prop autoRefresh to component, it is already used in parent
useEffect(() => {
setAPIOptions(prevOptions => ({ ...prevOptions }));
}, [autoRefresh, setAPIOptions]);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! This approach is much better and cleaner. I've implemented the suggested changes by replacing the complex refreshTrigger mechanism with the simple autoRefresh useEffect that responds to the autoRefresh prop already being passed from the parent component.

@pavanshekar pavanshekar force-pushed the issue-38709 branch 2 times, most recently from 447c0f1 to 02b18e8 Compare September 9, 2025 17:48
@adamruzicka
Copy link
Copy Markdown
Contributor

Without this patch (as of v16.2.1), there is a bit of a delay, but then things switch to the proper state.

Screen.Recording.2025-09-16.at.16.58.12.mov

With this patch applied, I'm getting this

Screen.Recording.2025-09-16.at.16.56.13.mov

This doesn't really seem to improve things for me

@qcjames53
Copy link
Copy Markdown

@adamruzicka Thank you for the videos. I've been seeing the same thing. Thought I was going crazy! I'm going to set up a meeting with @pavanshekar to see if I'm triggering something incorrectly.

Copy link
Copy Markdown

@qcjames53 qcjames53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavanshekar and I just spoke about this issue. Thanks for meeting up! Here are the minimal testing steps to verify the bug:

  1. Create several hosts on your system. 20+ is a good number. Pavan shared the following snippet to autogenerate many fake hosts in the rails console:
100.times do |i|00" if (i + 1) % 10 == 0
  host = Host.first.dup
  host.name = "fake-host-#{Time.now.to_i}-#{i.to_s.rjust(3, '0')}"
  host.mac = nil
  host.ip = nil
  host.save!
  puts "Created host #{i+1}/100" if (i + 1) % 10 == 0
end
  1. On the all hosts page, select all hosts and click 'Schedule a job'.
  2. Any job will work for this. An easy one is 'Commands' -> 'Run Command - Script Default' and set the command to anything.
  3. Click through all pages and run on the selected hosts. The UI will take you to the job's webpage.
  4. The table will show all hosts as "In progress". When refreshing the page, some of the hosts will switch to "Failed". Any fix needs to automatically update the table without page refresh.

@pavanshekar
Copy link
Copy Markdown
Author

Thanks @adamruzicka for sharing the video! I updated the code where autoRefresh was checking task?.state === 'pending' but the actual task state during execution is 'running', not 'pending'. I updated it so that host statuses now update both as individual hosts complete during execution and when the job finishes by monitoring the host status changes.

Copy link
Copy Markdown

@qcjames53 qcjames53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the help reviewing this, Pavan. The autorefresh works well on my end and seems to be reasonably efficient with the network calls. Nice job! I can confirm the host list refreshes appropriately once the job is "warmed up" but unfortunately there are some issues when the server is slow:

With enough hosts, the calls to refresh table contents start getting to take longer than the job refresh clock (seems to be 1 second). I know this adds extra work but do you think it would be possible to prevent the job invocation element to stop refreshes while waiting for the server to send back the prior request? Implementing this might even give you the next issue for free.

There are also issues with page load. The table begins updating once a second before the initial contents even have the chance to load, causing the element to flip between page load and content load animations. I would disable this autorefresh behavior entirely until the table contents are populated for the first time.

@pavanshekar
Copy link
Copy Markdown
Author

Thank you for the help reviewing this, Pavan. The autorefresh works well on my end and seems to be reasonably efficient with the network calls. Nice job! I can confirm the host list refreshes appropriately once the job is "warmed up" but unfortunately there are some issues when the server is slow:

With enough hosts, the calls to refresh table contents start getting to take longer than the job refresh clock (seems to be 1 second). I know this adds extra work but do you think it would be possible to prevent the job invocation element to stop refreshes while waiting for the server to send back the prior request? Implementing this might even give you the next issue for free.

There are also issues with page load. The table begins updating once a second before the initial contents even have the chance to load, causing the element to flip between page load and content load animations. I would disable this autorefresh behavior entirely until the table contents are populated for the first time.

Thank you for the detailed feedback! I've already addressed both of these issues in the current implementation. The code now prevents request overlapping when the server is slow and also disables autorefresh until the initial table contents are loaded.

Copy link
Copy Markdown

@qcjames53 qcjames53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel bad to keep requesting changes but there are still scenarios where this autoupdate slams the server with dozens of requests at once. I'm seeing this when I am waiting on one job to complete and start another, making all of the hosts stay in "pending" for a while. The requests named "7" are the ones driving the autoupdate and they pretty much DOS attack a single-threaded server.
image

@qcjames53
Copy link
Copy Markdown

Oh but other that that everything works really well! It's just I can't ACK with this page DOSing slow single threaded servers.

@pavanshekar
Copy link
Copy Markdown
Author

I feel bad to keep requesting changes but there are still scenarios where this autoupdate slams the server with dozens of requests at once. I'm seeing this when I am waiting on one job to complete and start another, making all of the hosts stay in "pending" for a while. The requests named "7" are the ones driving the autoupdate and they pretty much DOS attack a single-threaded server. image

I've addressed the DOS issue by implementing a throttled refresh mechanism that only calls the hosts API when host status actually changes (when individual hosts complete) and limits these calls to a maximum of once every 3 seconds.

@ekohl
Copy link
Copy Markdown
Member

ekohl commented Sep 20, 2025

AflFAIK today we don't use it but we may want to consider https://guides.rubyonrails.org/action_cable_overview.html to get updates. Rather than polling you can subscribe to changes. Perhaps an improvement we can make? For example, it could get all job statuses and then subscribe to all jobs for updates. No ddos in terms of requests.

I don't want to push it, but in the past we first didn't use it because it didn't exist. Then we used mod_passenger which didn't support it. Now we can use it but don't. I'm not sure if there's a good reason or not.

1 similar comment
@ekohl

This comment was marked as duplicate.

Copy link
Copy Markdown

@qcjames53 qcjames53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes Pavan; the page itself runs great now no matter what I do! Huge improvement! It looks like there are some minor changes requested from the JS test cases but after you patch those up I'm okay to ACK.

Ewoud's suggestion is interesting. Let's talk about it in standup.

Copy link
Copy Markdown

@qcjames53 qcjames53 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for talking this over. Everything is looking good to me! ACKed

@adamruzicka
Copy link
Copy Markdown
Contributor

We still need those tests to pass though

@Lukshio
Copy link
Copy Markdown
Contributor

Lukshio commented Sep 23, 2025

Hi everyone, meanwhile my PR about API calls refactor got merged. @pavanshekar Can we retest it with latest version? In my case it looks like the issue disappeared. Also if the issue still occurs, it will definitely affect this code, because the API call logic has changed.

@pavanshekar
Copy link
Copy Markdown
Author

pavanshekar commented Sep 24, 2025

Hi everyone, meanwhile my PR about API calls refactor got merged. @pavanshekar Can we retest it with latest version? In my case it looks like the issue disappeared. Also if the issue still occurs, it will definitely affect this code, because the API call logic has changed.

Thanks for confirming! I retested with the latest changes and the issue seems to be resolved. Since the original issue was about hosts remaining in “Pending” status even after a job completed successfully, and the expected behavior was that hosts should update to “Success” or “Completed,” it looks like @Lukshio s’ refactor addresses this. I’ll go ahead and close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants