Skip to content

Conversation

@chunjiez
Copy link
Collaborator

@chunjiez chunjiez commented Jan 9, 2026

In some corner case, physical-device-path xenstore watch event is fired before slave tapback process ready to process xenstore watch event, thus, slave tapback process would miss xenstore watch event, then blktap io datapath fails to establish.

In xenopsd side, the vbd-script waits for tapback slave process ready by checking /var/run/tapback..statefile, if the file is present and file contains "ping" string, then vbd-script updates the file, writes "pong" to the file and continues to update xenstore, otherwise, just wait.

In tapback slave process side, once it get prepared to process xenstore watch event, it writes "ping" string to /var/run/tapback..statefile, then waits for acknowledge by checking if the file contains "pong" string, after seeing "pong" string, it removes /var/run/tapback..statefile and continues to work.

@chunjiez
Copy link
Collaborator Author

chunjiez commented Jan 9, 2026

The tapback side code update, xapi-project/blktap#435

@chunjiez
Copy link
Collaborator Author

chunjiez commented Jan 9, 2026

wait_tapback_ready()
{
local statefile="/var/run/tapback.${DOMID}.statefile"
while true; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible the while never exits if statefile fails to be created? Consider to add a timeout for max number of retries.

Copy link
Contributor

@lindig lindig Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple way would be:

seq 120 | while read i; do
  ...
  sleep 1
done

This would iterate at most 120 times or (2min)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already commented on the related PR to tapdisk that I don't think this is the correct approach and it just adds even more complexity and fargility to the system which will induce even more cost of maintenance going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarkSymsCtx could you link to that PR. What is a generally better approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems correct to me that this side needs to be sure that tapback is ready as otherwise tapback won't be able to process events. So checking for readiness for a limited time and failing if tapback is not ready seems to me not fragile and I would be curious how it could be avoided @MarkSymsCtx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we generally expect tapback to be available and ready, we should only wait briefly before we fail. I agree with @BengangY that we should not wait indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants