Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions prov/lnx/src/lnx_domain.c
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,9 @@ static int lnx_domain_close(struct fid *fid)
for (i = 0; i < domain->ld_num_doms; i++) {
cd = &domain->ld_core_domains[i];

if (cd->cd_domain) {
rc = fi_close(&cd->cd_domain->fid);
if (rc)
frc = rc;
}
rc = fi_close(&cd->cd_domain->fid);
if (rc)
frc = rc;
}

ofi_bufpool_destroy(domain->ld_mem_reg_bp);
Expand Down Expand Up @@ -180,9 +178,10 @@ static int lnx_open_core_domains(struct lnx_fabric *lnx_fab,

rc = fi_domain(cf->cf_fabric, cd->cd_info,
&cd->cd_domain, context);
if (rc)
if (rc){
lnx_domain->ld_num_doms--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if the failing domain is the last one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stops on the first failure. So you'll never get a working-notworking-working. You'll always get working-notworking-> fail and clean up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the count would be incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. The count is incremented earlier in this function. So when there is a failure, the decrement ensures that the count is set to the number of domains that need to be cleaned up.

Copy link
Contributor

@j-xiong j-xiong Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at L130, ld_num_dom is increased in a loop. If, say, the loop count is 4 then it's increased by 4. Now the domain creation fails at the first one. The count is decreased by 1. That doesn't reflect how many domains are actually valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. we can do:
lnx_domain->ld_num_doms = inter_dom_start - 1
That should give the exact number of domains which were successfully started.
Then the close function will always assume that it's closing open ones.
This should work if there are only shm domain or a combination of shm and other domains and one of them fails.

return rc;

}
cd->cd_fabric = cf;
}
}
Expand Down