|
| 1 | +# Kind 集群镜像拉取问题解决方案(中国大陆) |
| 2 | + |
| 3 | +## 问题 |
| 4 | + |
| 5 | +在中国大陆使用 Kind 部署时,即使本地有 VPN,Kind 集群内的容器运行时也无法直接拉取 GitHub Container Registry (ghcr.io) 的镜像。 |
| 6 | + |
| 7 | +## 解决方案 |
| 8 | + |
| 9 | +### 方案 1:本地构建并加载镜像(推荐)⭐ |
| 10 | + |
| 11 | +这是最可靠的方法,利用你本地的 VPN 连接构建镜像,然后加载到 Kind 集群。 |
| 12 | + |
| 13 | +#### 步骤: |
| 14 | + |
| 15 | +1. **确保在 vllm 环境中运行**: |
| 16 | + |
| 17 | +```bash |
| 18 | +conda activate vllm # 如果还没激活的话 |
| 19 | +cd /home/jared/vllm-project/semantic-router |
| 20 | +``` |
| 21 | + |
| 22 | +2. **构建镜像(使用本地 VPN)**: |
| 23 | + |
| 24 | +```bash |
| 25 | +docker build -t semantic-router-extproc:local -f Dockerfile.extproc . |
| 26 | +``` |
| 27 | + |
| 28 | +3. **加载镜像到 Kind 集群**: |
| 29 | + |
| 30 | +```bash |
| 31 | +kind load docker-image semantic-router-extproc:local --name semantic-router-cluster |
| 32 | +``` |
| 33 | + |
| 34 | +4. **更新 Kubernetes 配置使用本地镜像**: |
| 35 | + |
| 36 | +编辑 `deploy/kubernetes/kustomization.yaml`,修改 images 部分: |
| 37 | + |
| 38 | +```yaml |
| 39 | +images: |
| 40 | + - name: ghcr.io/vllm-project/semantic-router/extproc |
| 41 | + newName: semantic-router-extproc |
| 42 | + newTag: local |
| 43 | +``` |
| 44 | +
|
| 45 | +5. **重新部署**: |
| 46 | +
|
| 47 | +```bash |
| 48 | +# 删除旧的部署 |
| 49 | +kubectl delete deployment semantic-router -n vllm-semantic-router-system |
| 50 | + |
| 51 | +# 重新应用配置 |
| 52 | +kubectl apply -k deploy/kubernetes/ |
| 53 | + |
| 54 | +# 监控部署状态 |
| 55 | +kubectl get pods -n vllm-semantic-router-system -w |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +### 方案 2:使用自动化脚本 |
| 61 | + |
| 62 | +我已经创建了自动化脚本来帮你完成上述步骤: |
| 63 | + |
| 64 | +```bash |
| 65 | +conda activate vllm |
| 66 | +cd /home/jared/vllm-project/semantic-router |
| 67 | +./tools/kind/build-and-load-image.sh |
| 68 | +``` |
| 69 | + |
| 70 | +**注意**:需要修复脚本中的集群名称检测问题。如果脚本失败,请手动执行方案 1 的步骤。 |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +### 方案 3:配置 Kind 节点使用代理 |
| 75 | + |
| 76 | +这个方法让 Kind 集群节点能够使用你的代理服务器。 |
| 77 | + |
| 78 | +#### 步骤: |
| 79 | + |
| 80 | +1. **获取主机 IP**: |
| 81 | + |
| 82 | +```bash |
| 83 | +HOST_IP=$(hostname -I | awk '{print $1}') |
| 84 | +echo "Host IP: $HOST_IP" |
| 85 | +``` |
| 86 | + |
| 87 | +2. **配置 Kind 节点代理**: |
| 88 | + |
| 89 | +对于每个节点(control-plane 和 worker),执行: |
| 90 | + |
| 91 | +```bash |
| 92 | +# Control plane |
| 93 | +docker exec semantic-router-cluster-control-plane bash -c "mkdir -p /etc/systemd/system/docker.service.d" |
| 94 | +docker exec semantic-router-cluster-control-plane bash -c "cat > /etc/systemd/system/docker.service.d/http-proxy.conf << 'EOF' |
| 95 | +[Service] |
| 96 | +Environment=\"HTTP_PROXY=http://${HOST_IP}:7897\" |
| 97 | +Environment=\"HTTPS_PROXY=http://${HOST_IP}:7897\" |
| 98 | +Environment=\"NO_PROXY=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.svc.cluster.local\" |
| 99 | +EOF" |
| 100 | + |
| 101 | +# Worker |
| 102 | +docker exec semantic-router-cluster-worker bash -c "mkdir -p /etc/systemd/system/docker.service.d" |
| 103 | +docker exec semantic-router-cluster-worker bash -c "cat > /etc/systemd/system/docker.service.d/http-proxy.conf << 'EOF' |
| 104 | +[Service] |
| 105 | +Environment=\"HTTP_PROXY=http://${HOST_IP}:7897\" |
| 106 | +Environment=\"HTTPS_PROXY=http://${HOST_IP}:7897\" |
| 107 | +Environment=\"NO_PROXY=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.svc.cluster.local\" |
| 108 | +EOF" |
| 109 | + |
| 110 | +# 重启 containerd |
| 111 | +docker exec semantic-router-cluster-control-plane systemctl daemon-reload |
| 112 | +docker exec semantic-router-cluster-control-plane systemctl restart containerd |
| 113 | +docker exec semantic-router-cluster-worker systemctl daemon-reload |
| 114 | +docker exec semantic-router-cluster-worker systemctl restart containerd |
| 115 | +``` |
| 116 | + |
| 117 | +3. **确保代理可以从容器访问**: |
| 118 | + |
| 119 | +需要确保你的代理(localhost:7897)可以从 Docker 容器访问。可能需要修改代理设置允许来自 Docker 网络的连接。 |
| 120 | + |
| 121 | +4. **重启部署**: |
| 122 | + |
| 123 | +```bash |
| 124 | +kubectl rollout restart deployment/semantic-router -n vllm-semantic-router-system |
| 125 | +kubectl get pods -n vllm-semantic-router-system -w |
| 126 | +``` |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +### 方案 4:使用国内镜像源 |
| 131 | + |
| 132 | +如果镜像已经推送到国内的镜像仓库(如阿里云、腾讯云),可以修改配置使用这些镜像源。 |
| 133 | + |
| 134 | +编辑 `deploy/kubernetes/kustomization.yaml`: |
| 135 | + |
| 136 | +```yaml |
| 137 | +images: |
| 138 | + - name: ghcr.io/vllm-project/semantic-router/extproc |
| 139 | + newName: registry.cn-hangzhou.aliyuncs.com/your-namespace/semantic-router-extproc |
| 140 | + newTag: latest |
| 141 | +``` |
| 142 | +
|
| 143 | +--- |
| 144 | +
|
| 145 | +## 推荐执行流程 |
| 146 | +
|
| 147 | +**最简单可靠的方式是方案 1(本地构建)**: |
| 148 | +
|
| 149 | +```bash |
| 150 | +# 1. 切换环境 |
| 151 | +conda activate vllm |
| 152 | + |
| 153 | +# 2. 进入项目目录 |
| 154 | +cd /home/jared/vllm-project/semantic-router |
| 155 | + |
| 156 | +# 3. 构建镜像(会使用你的 VPN) |
| 157 | +docker build -t semantic-router-extproc:local -f Dockerfile.extproc . |
| 158 | + |
| 159 | +# 4. 加载到 Kind |
| 160 | +kind load docker-image semantic-router-extproc:local --name semantic-router-cluster |
| 161 | + |
| 162 | +# 5. 更新配置文件 |
| 163 | +# 编辑 deploy/kubernetes/kustomization.yaml,将镜像改为: |
| 164 | +# newName: semantic-router-extproc |
| 165 | +# newTag: local |
| 166 | + |
| 167 | +# 6. 重新部署 |
| 168 | +kubectl delete deployment semantic-router -n vllm-semantic-router-system |
| 169 | +kubectl apply -k deploy/kubernetes/ |
| 170 | +kubectl get pods -n vllm-semantic-router-system -w |
| 171 | +``` |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## 验证部署 |
| 176 | + |
| 177 | +```bash |
| 178 | +# 查看 Pod 状态 |
| 179 | +kubectl get pods -n vllm-semantic-router-system |
| 180 | + |
| 181 | +# 查看详细信息 |
| 182 | +kubectl describe pod -n vllm-semantic-router-system -l app=semantic-router |
| 183 | + |
| 184 | +# 查看日志 |
| 185 | +kubectl logs -f deployment/semantic-router -n vllm-semantic-router-system |
| 186 | + |
| 187 | +# 检查镜像 |
| 188 | +kubectl get pods -n vllm-semantic-router-system -o jsonpath='{.items[*].spec.containers[*].image}' |
| 189 | +``` |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## 常见问题 |
| 194 | + |
| 195 | +### Q: init 容器仍然失败,提示无法拉取 python:3.11-slim |
| 196 | + |
| 197 | +A: 同样需要预先拉取这个基础镜像: |
| 198 | + |
| 199 | +```bash |
| 200 | +# 本地拉取 |
| 201 | +docker pull python:3.11-slim |
| 202 | + |
| 203 | +# 加载到 Kind |
| 204 | +kind load docker-image python:3.11-slim --name semantic-router-cluster |
| 205 | +``` |
| 206 | + |
| 207 | +### Q: Hugging Face 模型下载失败 |
| 208 | + |
| 209 | +A: 可以使用 Hugging Face 镜像站点: |
| 210 | + |
| 211 | +```bash |
| 212 | +export HF_ENDPOINT=https://hf-mirror.com |
| 213 | +``` |
| 214 | + |
| 215 | +然后在 deployment.yaml 的 init 容器中添加环境变量: |
| 216 | + |
| 217 | +```yaml |
| 218 | +env: |
| 219 | + - name: HF_ENDPOINT |
| 220 | + value: "https://hf-mirror.com" |
| 221 | +``` |
| 222 | +
|
| 223 | +--- |
| 224 | +
|
| 225 | +## 其他注意事项 |
| 226 | +
|
| 227 | +1. **init 容器下载模型**:deployment 中的 init 容器需要从 Hugging Face 下载模型,这也可能因为网络问题失败。建议在 VPN 环境下先本地下载模型,然后挂载到容器中。 |
| 228 | +
|
| 229 | +2. **资源限制**:当前配置需要较多资源(6Gi 内存)。如果你的机器资源有限,可以进一步调整 `deploy/kubernetes/deployment.yaml` 中的资源限制。 |
| 230 | + |
| 231 | +3. **持久化存储**:模型使用 PVC 存储,确保 Kind 集群的存储类可用: |
| 232 | + |
| 233 | +```bash |
| 234 | +kubectl get storageclass |
| 235 | +``` |
0 commit comments