Skip to content

Conversation

@aleck31
Copy link

@aleck31 aleck31 commented Jul 3, 2025

Fix ECR token expiration issue by implementing automatic token refresh for Nomad cluster nodes that require ECR access.

Affected nodes and ECR token requirements:

  • API节点 (API Node): E2B API服务 ✅ ECR token required
  • Client节点 (Client Node): 沙箱执行环境 ✅ ECR token required
  • Build节点 (Build Node): 模板构建环境 ✅ ECR token required
  • Server节点 (Server Node): Nomad/Consul管理 ❌ No ECR token needed

Modified startup scripts:

  1. infra-iac/terraform/scripts/start-api.sh
  2. infra-iac/terraform/scripts/start-client.sh
  3. infra-iac/terraform/scripts/start-build-cluster.sh

Implementation details:

  • Initial ECR token setup during node startup
  • Automatic token refresh script (/usr/local/bin/refresh-ecr-token.sh)
  • Cron job for periodic refresh (every 10 hours)
  • Nomad service restart after token refresh
  • Comprehensive error handling and logging
  • ECR token validity: 12 hours, refresh interval: 10 hours

This ensures continuous Docker image pulling capability from ECR without manual intervention or service disruption.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Fix ECR token expiration issue by implementing automatic token refresh
for Nomad cluster nodes that require ECR access.

Affected nodes and ECR token requirements:
- API节点 (API Node): E2B API服务 ✅ ECR token required
- Client节点 (Client Node): 沙箱执行环境 ✅ ECR token required
- Build节点 (Build Node): 模板构建环境 ✅ ECR token required
- Server节点 (Server Node): Nomad/Consul管理 ❌ No ECR token needed

Modified startup scripts:
1. infra-iac/terraform/scripts/start-api.sh
2. infra-iac/terraform/scripts/start-client.sh
3. infra-iac/terraform/scripts/start-build-cluster.sh

Implementation details:
- Initial ECR token setup during node startup
- Automatic token refresh script (/usr/local/bin/refresh-ecr-token.sh)
- Cron job for periodic refresh (every 10 hours)
- Nomad service restart after token refresh
- Comprehensive error handling and logging
- ECR token validity: 12 hours, refresh interval: 10 hours

This ensures continuous Docker image pulling capability from ECR
without manual intervention or service disruption.
@jayden-jia
Copy link
Contributor

jayden-jia commented Jul 7, 2025

I'm very happy to receive your PR. This token issue has indeed been a problem that has troubled us for a long time. In commit: 6de8f9e, this issue has been resolved by combining amazon-ecr-credential-helper with Nomad. However, due to time constraints, I only made changes to the API cluster. If possible, you could apply these changes to the other clusters as well.

@aleck31
Copy link
Author

aleck31 commented Jul 7, 2025

This patch includes fixes for the ECR token expiration issues affecting API nodes, client nodes, and build nodes. You can remove the fixed API node-related code while keeping the other modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants