Skip to content

Conversation

romanchyla
Copy link

Adding details pertinent to EKS Auto Mode

Last week I reached out to AWS support. The NodePool provisioned nodes and the containers running on them did not see the neuron devices. The AWS Support wasn't aware of the QUICKSTART and it is only Provisioners that are described in AWS Documentation (the auto-generated response, before I got to speak to an engineer was: this is known issue, see screenshot)

I reached to internal slack channel where I got help. The critical piece was specifying the request so that the neuron-plugin exposes the device. And other critical piece was: EKS Auto Mode should work.

Here is a screenshot of AWS support Q:

image

Testing done:

After the changes, deployed via flyte. And tested the devices were doing inference. Here is my pod definition:

return PodTemplate(
        primary_container_name="inferentia-primary",
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="inferentia-primary",
                    # IPC_LOCK is required for the neuron devices to be visible to the container
                    # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/docker-example/
                    security_context=V1SecurityContext(
                        capabilities=V1Capabilities(add=["IPC_LOCK"])
                    ),
                    # must be requested, neuron devices aren't exposed automatically (7/22/25)
                    resources={
                        "requests": {"aws.amazon.com/neuroncore": str(num_cores)},
                        "limits": {"aws.amazon.com/neuroncore": str(num_cores)},
                    },
                )
            ],
            affinity=V1Affinity(
                node_affinity=V1NodeAffinity(
                    required_during_scheduling_ignored_during_execution=V1NodeSelector(
                        node_selector_terms=[
                            V1NodeSelectorTerm(
                                match_expressions=[
                                    V1NodeSelectorRequirement(
                                        key="eks.amazonaws.com/instance-family",
                                        operator="In",
                                        values=[instance_type],
                                    )
                                ]
                            )
                        ]
                    )
                )
            ),
            tolerations=[
                V1Toleration(
                    key="aws.amazon.com/neuron",
                    operator="Exists",
                    effect="NoSchedule",
                )
            ],
            host_ipc=True,
            host_network=True,
            # must be set explicitly for the neuron devices to be visible to the container
            # https://github.com/bottlerocket-os/bottlerocket/blob/develop/QUICKSTART-EKS.md#neuron-support
            security_context=V1PodSecurityContext(
                run_as_user=1001,
                run_as_group=2001,
                fs_group=3001,
            ),
        ),
    )

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Adding details pertinent to EKS Auto Mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant