Skip to content

Comments

[WIP] feat: add support for native retry policies#4634

Open
dejanzele wants to merge 5 commits intoarmadaproject:masterfrom
dejanzele:preemption-retries
Open

[WIP] feat: add support for native retry policies#4634
dejanzele wants to merge 5 commits intoarmadaproject:masterfrom
dejanzele:preemption-retries

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Jan 27, 2026

A detailed description can be found in the GitHub Issue #4683

@dejanzele dejanzele force-pushed the preemption-retries branch 3 times, most recently from 126aa9c to b4a392a Compare January 27, 2026 14:55
@dejanzele dejanzele changed the title feat: add support for native preemption retries feat: add support for native retry policies Jan 28, 2026
@Sovietaced
Copy link
Contributor

One thing we ran into recently is that there can also be collision on service names and ingress names. This would only affect folks opting into those features so can probably be done in a follow up pull request.

@dejanzele dejanzele changed the title feat: add support for native retry policies [WIP] feat: add support for native retry policies Jan 30, 2026
@Sovietaced
Copy link
Contributor

Another issue that we see daily that seems to completely disrupt scheduling is the fact that there is no concept of a gang generation. This specifically seems to happen when the scheduler has performed a preemption on a gang.

Later the scheduler will do some logical schedule and preempt logic in the scheduler but it will see some old pods from the original gang schedule that were preempted still on the nodes and then we get error messages like

scheduler.go:202 scheduling cycle failure error="gang runner-1328d9002fde4882bcb-n3-0-n3-0-dn0-0 was partially evicted: 2 out of 3 jobs evicted" cycleNumber=1030057

I believe that it misinterprets real pods on the nodes from the previous gang generation as related to the logical schedule/preempt it does as part of the regular scheduling algorithm and then blows up.

@dejanzele dejanzele force-pushed the preemption-retries branch 7 times, most recently from 0e6d36b to 517b0f5 Compare February 12, 2026 14:41
jparraga-stackav and others added 4 commits February 23, 2026 15:34
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants