Skip to content

Fixing pcidevice device plugin stop deadlock#66

Open
WebberHuang1118 wants to merge 1 commit intoharvester:masterfrom
WebberHuang1118:fix-dp-stop
Open

Fixing pcidevice device plugin stop deadlock#66
WebberHuang1118 wants to merge 1 commit intoharvester:masterfrom
WebberHuang1118:fix-dp-stop

Conversation

@WebberHuang1118
Copy link
Copy Markdown
Member

@WebberHuang1118 WebberHuang1118 commented Feb 19, 2024

Problem:
pci controller stops the device plugin with potential deadlock

Solution:
Refactor the device plugin stop flow to terminate it gracefully, the flow would like this:

Related Issue:
harvester/harvester#5164

Test Plan:

  • Enable and disable one pcidevice passthrough
  • There should be not error message similar as device plugin failed to deregister: rpc error: code = Unavaila │ ble desc = transport is closing","pos":"device_manager.go:250","reason":"rpc error: code = Unavailable desc = transport is closing" in the pci controller daemonset

@WebberHuang1118 WebberHuang1118 force-pushed the fix-dp-stop branch 2 times, most recently from 794ac14 to 8a337ca Compare February 19, 2024 04:10
Signed-off-by: Webber Huang <webber.huang@suse.com>

Fixing codeFactor "Complex Method" in pcidevice plugin healthcheck()
@Yu-Jack
Copy link
Copy Markdown
Contributor

Yu-Jack commented Mar 1, 2024

I think we just need to add following codes before L165.

if !dp.starter.started {
    return nil
}

close(dp.starter.stopChan)
dp.starter.started = false

func (dp *PCIDevicePlugin) Stop() error {
return dp.stopDevicePlugin()
}

The reason is stop and done are kind of different thoughts here.

  • stop is more like someone controls this object from outside, it will tell this object should be stopped.
  • done is more like object inner status, it tells whole object about current progress.

Just like I mentioned in that PR #67 (comment), our structure lacks of some objects, but we still used the similar flow of KubeVirt's device plugin. I think we could keep it original flow for now until we have time to redesign it based on our structure.

Then, we won't need to delete all stop channel in the other PR. How do you think about it?

@Yu-Jack
Copy link
Copy Markdown
Contributor

Yu-Jack commented Mar 4, 2024

BTW, this is commit what I changed (Yu-Jack@bac0ffd)

@WebberHuang1118 WebberHuang1118 requested a review from Yu-Jack March 4, 2024 07:16
@mergify
Copy link
Copy Markdown

mergify bot commented Jun 25, 2025

This pull request is now in conflict. Could you fix it @WebberHuang1118? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants