Skip to content

[Windows] Reconcile host-network Pods after agent is restarted#6944

Merged
antoninbas merged 1 commit intoantrea-io:mainfrom
wenyingd:windows_reconcile_hostnetwork
Feb 3, 2025
Merged

[Windows] Reconcile host-network Pods after agent is restarted#6944
antoninbas merged 1 commit intoantrea-io:mainfrom
wenyingd:windows_reconcile_hostnetwork

Conversation

@wenyingd
Copy link
Copy Markdown
Contributor

This change is to support reconciling the host-network Pods on Windows because k8s expects to let CNI manage such Pods as long as they are not using host-process containers.

Antrea has received the CmdAdd request for such Pods when they are created, so they should be included in the Pod reconcile list after agent is restarted.

Fix: #6943

@wenyingd wenyingd requested a review from tnqn January 20, 2025 08:18
@luolanzone
Copy link
Copy Markdown
Contributor

@wenyingd there is a golang-lint issue, please fix it.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 547d964 to da952ef Compare January 20, 2025 08:36
@wenyingd
Copy link
Copy Markdown
Contributor Author

@wenyingd there is a golang-lint issue, please fix it.

updated.

@wenyingd wenyingd requested a review from luolanzone January 20, 2025 08:37
@luolanzone luolanzone added the area/OS/windows Issues or PRs related to the Windows operating system. label Jan 20, 2025
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-all
/test-windows-all

}

// getPodsListOptions returns the Pods running on the current Node. Note, the host-network Pods are not filtered
// out on Windows because they are also managed by antrea as long as "spec.SecurityContext.windowsOptions.hostProcess"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if "spec.SecurityContext.windowsOptions.hostProcess" is configured and the Pod is not filtered out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing was performed since we didn't create OVS interfaces for such Pods.

Having offline sync with @tnqn , we would not perform other filters on Windows in reconcile logic except for the node name to get local Pods and resourceVersion=0 to avoid performance issue, this is to try our best to ensure the reconciled Pod set can cover those Antrea have handled.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from da952ef to 065af4a Compare January 20, 2025 08:55
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-all
/test-windows-all

Copy link
Copy Markdown
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit.

@tnqn @antoninbas can you take a look? thanks.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 065af4a to 1193dc7 Compare January 21, 2025 02:25
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-all
/test-windows-all

@XinShuYang
Copy link
Copy Markdown
Contributor

/test-windows-e2e
/test-windows-networkpolicy

@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-networkpolicy
/test-windows-e2e

Copy link
Copy Markdown
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add an e2e test for this as well?


// Re-install Pod1 flows
podFlowsInstalled := make(chan string, 2)
expReInstalledPodCount := 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expectedReinstalledPodCount

reinstall is one word

Copy link
Copy Markdown
Contributor Author

@wenyingd wenyingd Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.
I will add an e2e case to verify the host-network Pod traffic is working even if antrea-agent is restarted on Windows.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 1193dc7 to f6c3466 Compare January 22, 2025 03:26
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-all

func (c *CNIConfig) getInfraContainer() string {
return getInfraContainer(c.ContainerId, c.Netns)
}

Copy link
Copy Markdown
Contributor

@XinShuYang XinShuYang Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Pod Type | Spec.HostNetwork | windowsOptions.hostProcess | Kubelet CMDADD | OVS interface needed? | Reconcile needed? |
| 1. Normal Pod (non-HostNetwork) | false | false / not set | Yes | Yes | Yes |
| 2. Windows HostNetwork Pod | true | false / not set | Yes | Yes | Yes |
| 3. Windows HostProcess Pod | (any) | true | No | No | No (no real network config) |

@wenyingd Can we add a table to the code comments to explain the different behaviors of Windows Pods for developers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from f6c3466 to c3d9724 Compare January 22, 2025 05:34
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-e2e

1 similar comment
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-e2e

@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-all

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from c3d9724 to 127aaf4 Compare January 22, 2025 09:47
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-e2e

assert.NoErrorf(t, err, "The Openflow entries should be consistent after Antrea agent restarts")

data.runPingMesh(t, podInfos, toolboxContainerName, true)
time.Sleep(10 * time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you had this for debugging purposes, that's a very long sleep :)

Copy link
Copy Markdown
Contributor Author

@wenyingd wenyingd Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove it after finishing the debug on the new e2e case.

// This case only run on Windows. The Pods in the test includes both host-network Pod and none host-network Pod,
// because CNI on Windows is also responsible for the host-network Pod's networking as long as it is not using
// host-process containers.
func testWindowsPodConnectivityAfterAntreaRestart(t *testing.T, data *TestData) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one issue with the test as currently written is that if there is a failure, it's not clear which one of the 2 Pods may be responsible (assuming only 1 Pod is broken after restart).
Maybe we could have 1 Linux Pod as the client, and then use it to ping each Windows Pod independently? This way we could split this test into 2 subtests, one for the regular Pod and one for the hostNetwork Pod. Would that make sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, updated.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 127aaf4 to 1cf8dcf Compare January 23, 2025 02:19
@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 1cf8dcf to 6234815 Compare January 23, 2025 02:30
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-all

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 6234815 to 471d2a7 Compare January 23, 2025 04:16
@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-windows-all

@wenyingd
Copy link
Copy Markdown
Contributor Author

/test-all

@wenyingd
Copy link
Copy Markdown
Contributor Author

The new added e2e is succeeded, the failed case in windows-e2e is TestFQDNCacheMinTTL by issue #6891

See: http://10.164.243.223/view/Windows/job/antrea-windows-e2e-for-pull-request/89

XinShuYang
XinShuYang previously approved these changes Jan 24, 2025
Copy link
Copy Markdown
Contributor

@XinShuYang XinShuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple nits, otherwise LGTM

if err := data.createToolboxPodOnNode(clientPod.Name, clientPod.Namespace, clientPod.NodeName, false); err != nil {
t.Fatalf("Error when creating Pod '%s': %v", clientPod.Name, err)
}
defer deletePodWrapper(t, data, clientPod.Namespace, clientPod.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to omit this, especially for a simple Linux Pod like this one, the Pod will be automatically deleted when the test Namespace is deleted at the end of the test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this code since personally I prefer to clean up the new resources created dedicated for the case after the test.

data.runPingMesh(t, testPodInfos, toolboxContainerName, true)

// Count the OVS flows.
oriOVSFlows := data.dumpOVSFlows(t, winWorkerNode)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out original
originalOVSFlows or initialOVSFlows

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated accordingly.

}
}

// verifyWindowsPodConnectivity checks the Pods' connectivity after antrea-agent is restarted on Windows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks Pod connectivity

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

}

// verifyWindowsPodConnectivity checks the Pods' connectivity after antrea-agent is restarted on Windows.
// The Pods in the test includes both host-network Pod and the generic Pod, because CNI on Windows is also
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test both the generic Pod case and the host-network Pod case, because ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea has received the CmdAdd request for such Pods when they are created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
@wenyingd
Copy link
Copy Markdown
Contributor Author

wenyingd commented Feb 3, 2025

/test-all
/test-windows-all

Copy link
Copy Markdown
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas antoninbas merged commit 71c4290 into antrea-io:main Feb 3, 2025
52 of 53 checks passed
@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Feb 3, 2025
@antoninbas
Copy link
Copy Markdown
Contributor

@wenyingd will you backport this?

@wenyingd
Copy link
Copy Markdown
Contributor Author

wenyingd commented Feb 5, 2025

@wenyingd will you backport this?

Sure, I will back port it to release-2.1 and release-2.2.

wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 5, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 5, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 11, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 17, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
luolanzone pushed a commit that referenced this pull request Feb 18, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
luolanzone pushed a commit that referenced this pull request Feb 18, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <wenyingd@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows doesn't reconcile "hostNetwork" Pod after agent is restarted

4 participants