Skip to content

Conversation

@maekawataiki
Copy link

@maekawataiki maekawataiki commented Dec 7, 2025

Issue #, if available:
#913 (related #127)

Description of changes:

  • Change containerd root to EBS volume instead of root volume to prevent Docker build cache occupying root volume.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@KeitaW KeitaW self-requested a review December 8, 2025 02:23
@KeitaW
Copy link
Contributor

KeitaW commented Dec 17, 2025

Thanks @maekawataiki for reporting the issue and creating this PR. The suggested fix did not take effect as expected because the original config.toml had commented out the corresponding lines.

$ bash easy-ssh.sh  -r us-east-1 hyperpod-after-20241216

=================================================

==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====


=================================================
srun Cluster id: jhroxiiv5v3e
Instance id: i-05060251d0a782283
Node Group: controller-machine
SSH User: ubuntu
1. Detected hyperpod-after-20241216 in ~/.ssh/config. Skipping adding...
2. Detected SSH public key ~/.ssh/id_rsa.pub for user ubuntu on the cluster. Skipping adding...

Now you can run:

$ ssh hyperpod-after-20241216

Starting session with SessionId: i-0f5934b931601f25a-epjgelyqlb4aq6l44epdkeuo4q
$ srun cat /etc/containerd/config.toml
#   Copyright 2018-2022 Docker Inc.

#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at

#       http://www.apache.org/licenses/LICENSE-2.0

#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

disabled_plugins = ["cri"]

#root = "/opt/dlami/nvme/docker/containerd" # Here
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"
$ 

@KeitaW
Copy link
Contributor

KeitaW commented Dec 17, 2025

For reference, here's the original containerd before applying the suggested fix:

$ bash easy-ssh.sh  -r us-east-1 hyperpod-before-20241216

=================================================

==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====


=================================================
Cluster id: 69q9l3vgs5iv
Instance id: i-0105da2ccc9eae353
Node Group: controller-machine
SSH User: ubuntu
1. Detected hyperpod-before-20241216 in ~/.ssh/config. Skipping adding...
2. Detected SSH public key ~/.ssh/id_rsa.pub for user ubuntu on the cluster. Skipping adding...

Now you can run:

$ ssh hyperpod-before-20241216

Starting session with SessionId: i-0f5934b931601f25a-dab2bng46eqgfk9a3vyx8pesdq
$  srun cat /etc/containerd/config.toml
#   Copyright 2018-2022 Docker Inc.

#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at

#       http://www.apache.org/licenses/LICENSE-2.0

#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"
$ 

@KeitaW
Copy link
Contributor

KeitaW commented Dec 17, 2025

In the original version,
/etc/containerd/config.toml has root = "/opt/dlami/nvme/docker/containerd" commented out, so containerd uses the default /var/lib/containerd on /.
/etc/docker/daemon.json sets "data-root": "/opt/dlami/nvme/docker/data-root", but BuildKit/containerd snapshots go to containerd’s root, not Docker’s data-root.
We need a slight modification e4583d6 to uncomment the corresponding lines.

@KeitaW
Copy link
Contributor

KeitaW commented Dec 17, 2025

Workaround on existing clusters:

#!/usr/bin/env bash
set -euo pipefail

NVME_ROOT="/opt/dlami/nvme"
CTR_ROOT="${NVME_ROOT}/docker/containerd"
CTR_STATE="/run/containerd"
CTR_CFG="/etc/containerd/config.toml"
BACKUP="${CTR_CFG}.$(date +%Y%m%d%H%M%S).bak"

if [[ ! -d "${NVME_ROOT}" ]]; then
  echo "ERROR: ${NVME_ROOT} not found or not mounted. Aborting." >&2
  exit 1
fi

echo "Stopping docker and containerd..."
sudo systemctl stop docker || true
sudo systemctl stop containerd || true

echo "Backing up containerd config to ${BACKUP}..."
sudo cp "${CTR_CFG}" "${BACKUP}" 2>/dev/null || true

if [[ ! -f "${CTR_CFG}" ]]; then
  echo "Generating default containerd config..."
  sudo containerd config default | sudo tee "${CTR_CFG}" >/dev/null
fi

echo "Setting containerd root/state to NVMe..."
sudo sed -i \
  -e "s|^#\?root *=.*|root = \"${CTR_ROOT}\"|" \
  -e "s|^#\?state *=.*|state = \"${CTR_STATE}\"|" \
  "${CTR_CFG}"

echo "Ensuring target directories exist..."
sudo mkdir -p "${CTR_ROOT}"
sudo mkdir -p "$(dirname "${CTR_STATE}")"

echo "Cleaning old containerd root at /var/lib/containerd ..."
sudo rm -rf /var/lib/containerd/*

echo "Reloading systemd units..."
sudo systemctl daemon-reload

echo "Starting containerd and docker..."
sudo systemctl start containerd
sudo systemctl start docker

echo "Done. Current containerd root setting:"
sudo containerd config dump | grep '^root ='

Removed state configuration from containerd setup for both paths.
containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
fi
sudo sed -i \
-e 's|^#\\?root *=.*|root = "/opt/dlami/nvme/docker/containerd"|' \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested these changes but it didn't work but the below did
sudo sed -i -e 's|^#\?root *=.*|root = "/opt/sagemaker/docker/containerd"|'

containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
fi
sudo sed -i \
-e 's|^#\\?root *=.*|root = "/opt/sagemaker/docker/containerd"|' \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested these changes but it didn't work but the below did
sudo sed -i -e 's|^#\?root *=.*|root = "/opt/sagemaker/docker/containerd"|'

containerd config default | sudo tee /etc/containerd/config.toml >/dev/null
fi
sudo sed -i \
-e 's|^#\\?root *=.*|root = "/opt/sagemaker/docker/containerd"|' \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use /opt/sagemaker/containerd/data-root instead of /opt/sagemaker/docker/containerd? for consistency with HyperPod EKS side.

See: https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create_main.sh#L70

@KeitaW KeitaW self-requested a review January 22, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants