feat: Add EKS capabilities integration #423

allamand · 2026-01-11T19:12:14Z

Add EKS capabilities for ArgoCD, Kro, and ACK controllers
Add Identity Center integration for SSO
Add multi-cluster ACK role management
Add JupyterHub addon with SSO integration
Add Helm chart dependencies for all charts
Update deployment scripts and utilities
Add comprehensive documentation and steering guides

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add EKS capabilities for ArgoCD, Kro, and ACK controllers - Add Identity Center integration for SSO - Add multi-cluster ACK role management - Add JupyterHub addon with SSO integration - Add Helm chart dependencies for all charts - Update deployment scripts and utilities - Add comprehensive documentation and steering guides

- Remove complex pod and deployment checks since ArgoCD runs as EKS managed service - Simplify to only check API availability via kubectl get applications - Update status messages to reflect EKS capabilities context - Remove domain availability checks as they're not needed for managed ArgoCD

- Create gitops/addons/bootstrap/default/addons/platform-manifests/values.yaml - Set gpu.enabled to true as static configuration - Separates static config from dynamic templated values

- Add gpu-nodepool.yaml template for EKS Auto Mode GPU support - Remove chart default values.yaml (moved to GitOps addon structure)

- Add Helm chart for pre-pulling container images - Improves pod startup time by caching images on nodes

- Add Kro ResourceGraphDefinition for Ray Serve deployments - Enables declarative Ray cluster and service management

- Add GitOps workload definitions for Ray deployments - Supports Ray Serve ML model serving

- Replace static ray-serve.yaml with Kro-based deployment - Add comprehensive README for Ray Serve setup - Update catalog-info with detailed component metadata - Enhance template with improved parameter handling

- Register ray-serve template in main catalog-info.yaml

- Add Ray Serve deployment to homepage quick actions

- Add image-prepuller addon definition to bootstrap/default - Add platform-manifests addon definition with GPU support - Enable addons in control-plane environment - Move GPU config from valuesObject to values.yaml

- Minor updates to EKS RGD configuration

- Enable image_prepuller addon - Enable platform_manifests addon

- Enhance error handling and logging in utils.sh

- Add Backstage Redis session storage implementation guide - Add custom NodePool migration guide for EKS Auto Mode - Add Ray Service S3 model cache implementation plan These guides document key platform improvements for session persistence, node lifecycle management, and ML model caching.

- Add Keycloak split-brain detection ConfigMap for cluster health monitoring - Add custom PEEKS-managed NodePools with optimized consolidation settings - peeks-general-purpose: 10m consolidation (vs 30s default) - peeks-system: 30m consolidation for critical workloads - 48h termination grace period for stability These additions improve platform reliability and node lifecycle management.

Reduce GPU node consolidateAfter from 1h to 5m for faster resource cleanup while maintaining stability for GPU workloads. This improves cost efficiency without impacting running jobs.

Refactor RayService ResourceGraphDefinition to support both CPU and GPU workloads with conditional resource allocation: - Add rayserviceCpu for CPU-only models (includeWhen: gpu == 0) - Add rayserviceGpu for GPU-accelerated models (includeWhen: gpu > 0) - Improve resource specifications and autoscaling configuration - Add proper labels and annotations for Backstage integration - Update skeleton manifest with model configuration support This enables efficient resource allocation based on model requirements.

Add comprehensive model selection and resource configuration: - Add 5 pre-configured AI models with resource recommendations: - DialoGPT-medium (CPU, 1.4GB) - Phi-2 (CPU, 5.5GB) - TinyLlama (CPU, 2.2GB) - Mistral-7B (GPU, 14GB) - Llama-2-7B (GPU, 13GB) - Add model-specific resource defaults and validation - Add max generation length configuration - Update default serve config to gpu-demo-serve-config.zip - Include resource sizing guidance in template description This simplifies model deployment by providing tested configurations and clear resource requirements for each model type.

Add CPU and memory resource configuration for Flux2 controllers: - helmController: 100m CPU, 128Mi-256Mi memory - imageReflectionController: 100m CPU, 128Mi-256Mi memory This ensures proper resource allocation and prevents OOM issues while maintaining efficient resource usage.

- Move GPU configuration from addons.yaml to values.yaml - Add customNodepools.enabled flag for PEEKS-managed nodepools - Clean up redundant GPU configuration in addons.yaml This allows clusters to opt-in to custom nodepools with optimized consolidation settings instead of EKS Auto Mode defaults.

Add pre-configured Ray Serve packages: - cpu-serve-config.zip: CPU-optimized model serving - gpu-serve-config.zip: GPU-accelerated inference - gpu-demo-serve-config.zip: Demo configuration with GPU support - vllm-serve-config.zip: vLLM-based high-performance serving - vllm_serve.py: vLLM deployment implementation with async engine These packages provide ready-to-use configurations for different Ray Serve deployment scenarios with appropriate resource allocation.

Signed-off-by: Workshop User <[email protected]>

- Use validate command instead of state list for faster checks - Add 10s timeout to prevent hanging on locked states - Handle timeout exit codes properly - Skip lock check if validation times out

- Switch GPU variant to public.ecr.aws/data-on-eks/ray2.24.0-py310-vllm-gpu:v1 - Downgrade Ray version from 2.34.0 to 2.24.0 for vLLM compatibility - Add num-cpus: 0 to head node to prevent scheduling workloads on head - Maintain CPU variant with standard rayproject/ray:2.24.0 image This enables GPU inference with pre-built vLLM support and proper resource isolation between head and worker nodes.

- Split into Basic Configuration and Resource Configuration pages - Use const values for automatic resource allocation based on CPU/GPU - Remove detailed model recommendations table (simplified description) - CPU: 2 head CPU, 8Gi head memory, 4 worker CPU, 16Gi worker memory, 0 GPU - GPU: 2 head CPU, 16Gi head memory, 8 worker CPU, 48Gi worker memory, 1 GPU This matches the platform-on-eks-workshop template with proper dynamic resource configuration based on deployment type selection.

Remove verbose prerequisite instructions from template description. Keep it concise and focused on the template's purpose.

Document the production-ready approach for Ray GPU inference: - Custom pre-built Ray+vLLM images via CodeBuild - Automated image build pipeline (Terraform → Lambda → CodeBuild → ECR) - Solutions for HuggingFace token issues and runtime pip failures - Based on AWS GenAI on EKS workshop proven patterns This guide explains the architecture and implementation for reliable GPU-accelerated model serving with Ray and vLLM.

Add Helm chart for KubeRay operator deployment: - Chart.yaml with kuberay-operator v1.2.2 dependency - Minimal values.yaml for configuration - Templates for namespace creation This enables GitOps-managed Ray operator deployment across clusters.

Add automated Ray+vLLM custom image build infrastructure: - Dockerfile.ray-vllm: Ray 2.49.0 with vLLM 0.6.4.post1 and CUDA support - ray-image-build.tf: CodeBuild project for automated image builds - model-storage.tf: S3 bucket for model caching (optional) - trigger_codebuild.zip: Lambda function to trigger builds This enables production-ready GPU inference with pre-built images, eliminating runtime pip install failures and HuggingFace token issues.

Add default values for platform-manifests addon: - GPU nodepool enabled by default - Custom nodepools enabled for optimized consolidation This allows clusters to use PEEKS-managed nodepools with better lifecycle management instead of EKS Auto Mode aggressive defaults.

Relocate values.yaml from bootstrap/default/addons to default/addons for consistency with addon configuration structure.

Switch from upstream kuberay-operator Helm chart to local wrapper chart. This allows customization and integration with platform-specific configurations like model prestaging and service accounts.

Add ray-worker-sa service account to both CPU and GPU Ray worker pods. This enables Pod Identity for AWS service access (S3 model caching, ECR image pulls) without using node IAM roles.

- Modified cpu-serve-config.zip to load models from S3-mounted paths - Added local_files_only=True to prevent HuggingFace downloads - Added MODEL-MANAGEMENT.md with instructions for adding new models - CPU and GPU now use consistent model loading approach

Add S3-backed persistent storage for model caching: - Create PersistentVolume using S3 CSI driver - Create PersistentVolumeClaim for model access - Mount /mnt/models in both head and worker pods - Support configurable S3 bucket via s3ModelBucket parameter - Upgrade Ray version from 2.24.0 to 2.34.0 for CPU variant - Add volume mounts to both CPU and GPU variants This enables fast model loading from S3 without downloading from HuggingFace on every deployment.

Add Terraform module for S3 CSI driver deployment: - Install Mountpoint for Amazon S3 CSI driver - Configure IAM role for S3 access via Pod Identity - Enable ReadOnlyMany access mode for model sharing - Support for S3 bucket mounting in Ray workloads This provides the infrastructure for S3-backed model storage.

Add Dockerfile and build script for custom Ray GPU images: - Based on rayproject/ray:2.34.0-py310-gpu - Pre-installs vLLM 0.6.4.post1 with CUDA support - Includes transformers, accelerate, and bitsandbytes - Build script with ECR push automation This eliminates runtime pip installs and HuggingFace token issues.

Update template to support S3-backed model caching: - Add s3ModelBucket parameter with default 'peeks-ray-models' - Add awsRegion parameter for S3 CSI driver configuration - Update skeleton manifests with S3 bucket parameters - Simplify resource configuration with better defaults - Update catalog-info with proper metadata This enables users to specify custom S3 buckets for model storage.

Simplify prestaging job to focus on S3 upload: - Remove Pod Identity validation (handled by S3 CSI driver) - Streamline download and upload process - Reduce resource requirements - Improve error handling and retry logic With S3 CSI driver, models are mounted directly rather than downloaded at runtime, making this job optional for pre-warming.

- Update model-storage.tf S3 bucket configuration - Add ray-serve template to Backstage catalog - Improve Terraform utils.sh error handling

Switch from public Ray images to custom ECR images: - Use ray-gpu-optimized image from ECR for all variants - Add awsAccountId parameter to schema for ECR image path - Update default modelId to use S3-mounted models path - Apply custom image to both CPU and GPU head/worker pods This enables use of pre-built images with vLLM and eliminates HuggingFace authentication issues.

Update README.md to match the latest version from riv25 branch with: - Simplified single-region deployment (us-west-2) - Updated CloudFormation template URL - Use .yaml extension instead of .json - Use templateBucket variable for cleaner syntax

allamand marked this pull request as draft January 11, 2026 19:12

allamand mentioned this pull request Jan 11, 2026

integrate EKS capabilities #410

Closed

allamand force-pushed the feat/eks-capabilities-squash branch 2 times, most recently from b3b62d0 to a721a7a Compare January 20, 2026 09:17

user1 and others added 26 commits January 23, 2026 13:24

feat(platform-manifests): add default values.yaml with GPU enabled

47e00f5

- Create gitops/addons/bootstrap/default/addons/platform-manifests/values.yaml - Set gpu.enabled to true as static configuration - Separates static config from dynamic templated values

feat(platform-manifests): add GPU nodepool template

910e39a

- Add gpu-nodepool.yaml template for EKS Auto Mode GPU support - Remove chart default values.yaml (moved to GitOps addon structure)

feat(addons): add image-prepuller chart

a338d3d

- Add Helm chart for pre-pulling container images - Improves pod startup time by caching images on nodes

feat(kro): add RayService resource group definition

d5375bb

- Add Kro ResourceGraphDefinition for Ray Serve deployments - Enables declarative Ray cluster and service management

feat(workloads): add Ray workload configurations

3391428

- Add GitOps workload definitions for Ray deployments - Supports Ray Serve ML model serving

feat(backstage): enhance Ray Serve template with Kro integration

83bea73

- Replace static ray-serve.yaml with Kro-based deployment - Add comprehensive README for Ray Serve setup - Update catalog-info with detailed component metadata - Enhance template with improved parameter handling

feat(backstage): add Ray Serve template to catalog

b5bdbbe

- Register ray-serve template in main catalog-info.yaml

feat(backstage): update homepage with Ray Serve quick start

4dd752b

- Add Ray Serve deployment to homepage quick actions

feat(addons): add image-prepuller and platform-manifests addons

ad96c40

- Add image-prepuller addon definition to bootstrap/default - Add platform-manifests addon definition with GPU support - Enable addons in control-plane environment - Move GPU config from valuesObject to values.yaml

fix(kro): update EKS resource group definition

6dd70f1

- Minor updates to EKS RGD configuration

feat(terraform): enable new addons in hub cluster

d73fc69

- Enable image_prepuller addon - Enable platform_manifests addon

refactor(terraform): improve deployment script utilities

07a0783

- Enhance error handling and logging in utils.sh

Fix serve_config.py format for Ray

a08c941

Add __init__.py for Python package

e963298

Update app.py to read model config from environment variables

59c832e

perf: Optimize GPU NodePool consolidation timing

5283467

Reduce GPU node consolidateAfter from 1h to 5m for faster resource cleanup while maintaining stability for GPU workloads. This improves cost efficiency without impacting running jobs.

chore: Add __pycache__ to gitignore

67a7ab3

Workshop User and others added 27 commits January 23, 2026 13:24

update vllm name

8ac3d98

Signed-off-by: Workshop User <[email protected]>

fix: Improve Terraform state lock detection with timeout

39c6436

- Use validate command instead of state list for faster checks - Add 10s timeout to prevent hanging on locked states - Handle timeout exit codes properly - Skip lock check if validation times out

docs: Simplify create-dev-and-prod-env template description

04f3a59

Remove verbose prerequisite instructions from template description. Keep it concise and focused on the template's purpose.

feat: Add Ray Operator Helm chart wrapper

0bc0e93

Add Helm chart for KubeRay operator deployment: - Chart.yaml with kuberay-operator v1.2.2 dependency - Minimal values.yaml for configuration - Templates for namespace creation This enables GitOps-managed Ray operator deployment across clusters.

refactor: Move platform-manifests values to default/addons location

e1f2a70

Relocate values.yaml from bootstrap/default/addons to default/addons for consistency with addon configuration structure.

feat: Use local ray-operator chart instead of upstream

8e24167

Switch from upstream kuberay-operator Helm chart to local wrapper chart. This allows customization and integration with platform-specific configurations like model prestaging and service accounts.

feat: Add service account to Ray worker pods

ff0be67

Add ray-worker-sa service account to both CPU and GPU Ray worker pods. This enables Pod Identity for AWS service access (S3 model caching, ECR image pulls) without using node IAM roles.

Fix vllm_serve to handle local model paths

bdc5f35

Add Transformers-based serve for local models

d491a3a

Fix deployment name

cc122bf

Add app.py for simpler import

229b9bb

Fix Ray Serve deployment parameters

3df01d0

Add local_files_only for local model loading

5a4ebd8

chore: Minor updates to Terraform and Backstage configs

c5cafea

- Update model-storage.tf S3 bucket configuration - Add ray-serve template to Backstage catalog - Improve Terraform utils.sh error handling

allamand force-pushed the feat/eks-capabilities-squash branch from 0090bab to 63f98ad Compare January 23, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add EKS capabilities integration #423

feat: Add EKS capabilities integration #423

Uh oh!

allamand commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add EKS capabilities integration #423

Are you sure you want to change the base?

feat: Add EKS capabilities integration #423

Uh oh!

Conversation

allamand commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant