Skip to content

Comments

prov/lnx: lnx_open_core_domains - Correctly track number of open domains#11905

Open
jfillers wants to merge 1 commit intoofiwg:mainfrom
jfillers:lnx_track_open_domains
Open

prov/lnx: lnx_open_core_domains - Correctly track number of open domains#11905
jfillers wants to merge 1 commit intoofiwg:mainfrom
jfillers:lnx_track_open_domains

Conversation

@jfillers
Copy link

Fixing lnx_open_core_domains to correctly track the number of open domains by decrementing by 1 when a domain fails to open. Removed an unnecessary null check in lnx_domain_close.

Fixing lnx_open_core_domains to correctly track the number of open domains by decrementing by 1 when a domain fails to open.
Removed an unnecessary null check in lnx_domain_close.

Signed-off-by: Thomas Fillers <fillersjt@ornl.gov>
@jfillers jfillers force-pushed the lnx_track_open_domains branch from 40d707f to 18fbc67 Compare February 20, 2026 17:57
&cd->cd_domain, context);
if (rc)
if (rc){
lnx_domain->ld_num_doms--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if the failing domain is the last one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stops on the first failure. So you'll never get a working-notworking-working. You'll always get working-notworking-> fail and clean up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the count would be incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. The count is incremented earlier in this function. So when there is a failure, the decrement ensures that the count is set to the number of domains that need to be cleaned up.

Copy link
Contributor

@j-xiong j-xiong Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at L130, ld_num_dom is increased in a loop. If, say, the loop count is 4 then it's increased by 4. Now the domain creation fails at the first one. The count is decreased by 1. That doesn't reflect how many domains are actually valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. we can do:
lnx_domain->ld_num_doms = inter_dom_start - 1
That should give the exact number of domains which were successfully started.
Then the close function will always assume that it's closing open ones.
This should work if there are only shm domain or a combination of shm and other domains and one of them fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants