Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
6261feb
feat: Add EKS capabilities integration
Jan 11, 2026
bf6cbc3
feat: Simplify ArgoCD readiness check for EKS capabilities
Jan 12, 2026
47e00f5
feat(platform-manifests): add default values.yaml with GPU enabled
Jan 14, 2026
910e39a
feat(platform-manifests): add GPU nodepool template
Jan 14, 2026
a338d3d
feat(addons): add image-prepuller chart
Jan 14, 2026
d5375bb
feat(kro): add RayService resource group definition
Jan 14, 2026
3391428
feat(workloads): add Ray workload configurations
Jan 14, 2026
83bea73
feat(backstage): enhance Ray Serve template with Kro integration
Jan 14, 2026
b5bdbbe
feat(backstage): add Ray Serve template to catalog
Jan 14, 2026
4dd752b
feat(backstage): update homepage with Ray Serve quick start
Jan 14, 2026
ad96c40
feat(addons): add image-prepuller and platform-manifests addons
Jan 14, 2026
6dd70f1
fix(kro): update EKS resource group definition
Jan 14, 2026
d73fc69
feat(terraform): enable new addons in hub cluster
Jan 14, 2026
07a0783
refactor(terraform): improve deployment script utilities
Jan 14, 2026
a08c941
Fix serve_config.py format for Ray
Jan 14, 2026
e963298
Add __init__.py for Python package
Jan 14, 2026
59c832e
Update app.py to read model config from environment variables
Jan 14, 2026
b93f3b2
docs: Add platform operations and optimization guides
Jan 15, 2026
9c3bc5a
feat: Add Keycloak split-brain detector and custom NodePools
Jan 15, 2026
5283467
perf: Optimize GPU NodePool consolidation timing
Jan 15, 2026
ac628a8
feat: Split Ray Service RGD into CPU and GPU variants
Jan 15, 2026
4ed7df0
feat: Enhance Ray Serve Backstage template with model selection
Jan 15, 2026
d77f5da
perf: Add resource limits to Flux controllers
Jan 15, 2026
cafa7ed
feat: Enable custom NodePools configuration
Jan 16, 2026
a58e43a
feat: Add Ray Serve configuration packages for multiple deployment types
Jan 16, 2026
67a7ab3
chore: Add __pycache__ to gitignore
Jan 16, 2026
8ac3d98
update vllm name
Jan 16, 2026
39c6436
fix: Improve Terraform state lock detection with timeout
Jan 16, 2026
4d260bd
feat: Update Ray Service RGD to use vLLM-optimized images
Jan 16, 2026
f62ad49
feat: Simplify Ray Serve Backstage template with const-based resources
Jan 16, 2026
04f3a59
docs: Simplify create-dev-and-prod-env template description
Jan 16, 2026
e620f01
docs: Add Ray GPU inference production strategy guide
Jan 16, 2026
0bc0e93
feat: Add Ray Operator Helm chart wrapper
Jan 16, 2026
01af3f9
feat: Add Terraform infrastructure for Ray vLLM image builds
Jan 16, 2026
c2abd73
feat: Add platform-manifests addon default configuration
Jan 16, 2026
e1f2a70
refactor: Move platform-manifests values to default/addons location
Jan 16, 2026
8e24167
feat: Use local ray-operator chart instead of upstream
Jan 16, 2026
ff0be67
feat: Add service account to Ray worker pods
Jan 16, 2026
bdc5f35
Fix vllm_serve to handle local model paths
Jan 18, 2026
d491a3a
Add Transformers-based serve for local models
Jan 18, 2026
cc122bf
Fix deployment name
Jan 18, 2026
229b9bb
Add app.py for simpler import
Jan 18, 2026
3df01d0
Fix Ray Serve deployment parameters
Jan 18, 2026
5a4ebd8
Add local_files_only for local model loading
Jan 18, 2026
bc87f9a
Update CPU serve config to use S3 models and add model management docs
Jan 19, 2026
aa537cd
feat: Add S3 CSI model storage to Ray Service RGD
Jan 19, 2026
d060578
feat: Add S3 CSI driver Terraform configuration
Jan 19, 2026
60dfd24
feat: Add optimized Ray GPU Docker image build
Jan 19, 2026
8108d11
feat: Update Ray Serve Backstage template with S3 model storage
Jan 19, 2026
8f04118
refactor: Simplify model prestage job for S3 CSI approach
Jan 19, 2026
c5cafea
chore: Minor updates to Terraform and Backstage configs
Jan 19, 2026
28d2f51
feat: Use custom ECR image for Ray Service deployments
Jan 20, 2026
63f98ad
fix: Use correct README.md from origin/riv25
Jan 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,4 @@ platform/infra/terraform/terraform-aws-observability-accelerator
prodplan
devplan
devdbplan
__pycache__/
16 changes: 14 additions & 2 deletions Taskfile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,25 @@ tasks:
- helm repo add crossplane-stable https://charts.crossplane.io/stable || true
- helm repo add kubevela https://kubevela.github.io/charts || true
- helm repo add fluxcd-community https://fluxcd-community.github.io/helm-charts || true
- helm repo add apache-airflow https://airflow.apache.org || true
- helm repo add jupyterhub https://hub.jupyter.org/helm-chart/ || true
- helm repo add spark-operator https://kubeflow.github.io/spark-operator || true
- helm repo add mlflow https://community-charts.github.io/helm-charts || true
- helm repo update
- echo "Building dependencies for flux chart..."
- cd ./gitops/addons/charts/flux && helm dependency build
- echo "Building dependencies for crossplane chart..."
- cd ./gitops/addons/charts/crossplane && helm dependency build
- echo "Building dependencies for kubevela chart..."
- cd ./gitops/addons/charts/kubevela && helm dependency build
- echo "Building dependencies for airflow chart..."
- cd ./gitops/addons/charts/airflow && helm dependency build
- echo "Building dependencies for devlake chart..."
- cd ./gitops/addons/charts/devlake && helm dependency build
- echo "Building dependencies for jupyterhub chart..."
- cd ./gitops/addons/charts/jupyterhub && helm dependency build
- echo "Building dependencies for spark-operator chart..."
- cd ./gitops/addons/charts/spark-operator && helm dependency build
- echo "Building dependencies for mlflow chart..."
- cd ./gitops/addons/charts/mlflow && helm dependency build
- echo "All Helm chart dependencies built successfully!"
- echo "Don't forget to commit the generated Chart.lock files and charts/ directories"

Expand Down
10 changes: 10 additions & 0 deletions backstage/packages/app/src/customPlatform/CustomHomepage.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,11 @@ export const CustomHomepage = () => {
label: 'Argo Workflows',
icon: <img src="/backstage/img/argo-workflows.png" alt="Argo Workflows" style={{ width: '24px', height: '24px' }} />,
},
{
url: domainUrl + '/jupyterhub',
label: 'JupyterHub',
icon: <img src="/backstage/img/jupyter.png" alt="JupyterHub" style={{ width: '24px', height: '24px' }} />,
},
{
url: domainUrl,
label: 'Kargo',
Expand All @@ -154,6 +159,11 @@ export const CustomHomepage = () => {
label: 'Keycloak',
icon: <img src="/backstage/img/keycloak.png" alt="Keycloak" style={{ width: '24px', height: '24px' }} />,
},
{
url: domainUrl + '/keycloak/realms/platform/protocol/openid-connect/logout',
label: 'Logout',
icon: <img src="/backstage/img/logout.png" alt="Logout" style={{ width: '24px', height: '24px' }} />,
},
]}
/>
</Grid>
Expand Down
131 changes: 131 additions & 0 deletions docs/EKS-Capabilities-ArgoCD-Setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# EKS Capabilities ArgoCD Setup

This document describes the configuration changes needed when using AWS EKS Capabilities (managed ArgoCD capability) instead of self-hosted ArgoCD.

## Overview

When using EKS Capabilities, ArgoCD runs as a managed service outside the cluster. The cluster needs to be properly configured to allow the managed ArgoCD instance to manage resources.

## Required Changes

### 1. GitOps Bridge Configuration

Update the `gitops_bridge_bootstrap` module in `platform/infra/terraform/common/argocd.tf` to create the cluster secret with the EKS cluster ARN:

```hcl
module "gitops_bridge_bootstrap" {
source = "gitops-bridge-dev/gitops-bridge/helm"
version = "0.1.0"

create = true
install = false # Skip ArgoCD installation since EKS Capabilities provides it

cluster = {
cluster_name = local.hub_cluster.name
environment = local.hub_cluster.environment
metadata = local.addons_metadata[local.hub_cluster_key]
addons = local.addons[local.hub_cluster_key]
server = data.aws_eks_cluster.clusters[local.hub_cluster_key].arn # Use cluster ARN
}

apps = local.argocd_apps
}
```

**Key points:**
- Set `install = false` to skip ArgoCD installation
- Set `server` to the EKS cluster ARN instead of `https://kubernetes.default.svc`

### 2. EKS Access Policy

The EKS Capabilities ArgoCD role needs cluster admin permissions. Associate the cluster admin policy:

```bash
aws eks associate-access-policy \
--cluster-name <cluster-name> \
--principal-arn "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole" \
--policy-arn "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" \
--access-scope type=cluster \
--region <region>
```

**Example:**
```bash
aws eks associate-access-policy \
--cluster-name peeks-hub \
--principal-arn "arn:aws:iam::382076407153:role/AmazonEKSCapabilityArgoCDRole" \
--policy-arn "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" \
--access-scope type=cluster \
--region ap-northeast-2
```

### 3. Kubernetes RBAC (Optional)

If additional RBAC is needed beyond EKS access policies, create a ClusterRoleBinding:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argocd-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole"
```

Apply with:
```bash
kubectl apply -f argocd-ack-permissions.yaml
```

## Verification

1. Check that the cluster secret is created with the correct ARN:
```bash
kubectl get secret <cluster-name> -n argocd -o jsonpath='{.data.server}' | base64 -d
```

2. Verify the access policy is associated:
```bash
aws eks list-associated-access-policies \
--cluster-name <cluster-name> \
--principal-arn "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole" \
--region <region>
```

3. Check ArgoCD can sync applications:
```bash
# Using ArgoCD CLI or API
argocd app list
```

## Troubleshooting

### Error: "is forbidden: User cannot get/list resource"

**Cause:** The ArgoCD role lacks necessary permissions.

**Solution:** Ensure the `AmazonEKSClusterAdminPolicy` is associated with the ArgoCD role (see step 2 above).

### Error: "there are no clusters with this name"

**Cause:** The cluster secret uses `name` instead of `server` with the cluster ARN.

**Solution:** Update the gitops_bridge_bootstrap module to include `server = data.aws_eks_cluster.clusters[...].arn` (see step 1 above).

### Error: "cluster is disabled"

**Cause:** ArgoCD cannot find the cluster by the server URL.

**Solution:** Verify the cluster secret has the correct EKS cluster ARN in the `server` field.

## References

- [AWS EKS Access Policies](https://docs.aws.amazon.com/eks/latest/userguide/access-policies.html)
- [GitOps Bridge Module](https://github.com/gitops-bridge-dev/gitops-bridge)
- [ArgoCD Cluster Management](https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#clusters)
131 changes: 131 additions & 0 deletions docs/Marina-ArgoCD-Setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Marina ArgoCD Capability Setup

This document describes the configuration changes needed when using AWS EKS Marina (managed ArgoCD capability) instead of self-hosted ArgoCD.

## Overview

When using Marina, ArgoCD runs as a managed service outside the cluster. The cluster needs to be properly configured to allow Marina's ArgoCD instance to manage resources.

## Required Changes

### 1. GitOps Bridge Configuration

Update the `gitops_bridge_bootstrap` module in `platform/infra/terraform/common/argocd.tf` to create the cluster secret with the EKS cluster ARN:

```hcl
module "gitops_bridge_bootstrap" {
source = "gitops-bridge-dev/gitops-bridge/helm"
version = "0.1.0"

create = true
install = false # Skip ArgoCD installation since Marina provides it

cluster = {
cluster_name = local.hub_cluster.name
environment = local.hub_cluster.environment
metadata = local.addons_metadata[local.hub_cluster_key]
addons = local.addons[local.hub_cluster_key]
server = data.aws_eks_cluster.clusters[local.hub_cluster_key].arn # Use cluster ARN
}

apps = local.argocd_apps
}
```

**Key points:**
- Set `install = false` to skip ArgoCD installation
- Set `server` to the EKS cluster ARN instead of `https://kubernetes.default.svc`

### 2. EKS Access Policy

The Marina ArgoCD role needs cluster admin permissions. Associate the cluster admin policy:

```bash
aws eks associate-access-policy \
--cluster-name <cluster-name> \
--principal-arn "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole" \
--policy-arn "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" \
--access-scope type=cluster \
--region <region>
```

**Example:**
```bash
aws eks associate-access-policy \
--cluster-name peeks-hub \
--principal-arn "arn:aws:iam::382076407153:role/AmazonEKSCapabilityArgoCDRole" \
--policy-arn "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" \
--access-scope type=cluster \
--region ap-northeast-2
```

### 3. Kubernetes RBAC (Optional)

If additional RBAC is needed beyond EKS access policies, create a ClusterRoleBinding:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argocd-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole"
```

Apply with:
```bash
kubectl apply -f argocd-ack-permissions.yaml
```

## Verification

1. Check that the cluster secret is created with the correct ARN:
```bash
kubectl get secret <cluster-name> -n argocd -o jsonpath='{.data.server}' | base64 -d
```

2. Verify the access policy is associated:
```bash
aws eks list-associated-access-policies \
--cluster-name <cluster-name> \
--principal-arn "arn:aws:iam::<account-id>:role/AmazonEKSCapabilityArgoCDRole" \
--region <region>
```

3. Check ArgoCD can sync applications:
```bash
# Using ArgoCD CLI or API
argocd app list
```

## Troubleshooting

### Error: "is forbidden: User cannot get/list resource"

**Cause:** The ArgoCD role lacks necessary permissions.

**Solution:** Ensure the `AmazonEKSClusterAdminPolicy` is associated with the ArgoCD role (see step 2 above).

### Error: "there are no clusters with this name"

**Cause:** The cluster secret uses `name` instead of `server` with the cluster ARN.

**Solution:** Update the gitops_bridge_bootstrap module to include `server = data.aws_eks_cluster.clusters[...].arn` (see step 1 above).

### Error: "cluster is disabled"

**Cause:** ArgoCD cannot find the cluster by the server URL.

**Solution:** Verify the cluster secret has the correct EKS cluster ARN in the `server` field.

## References

- [AWS EKS Access Policies](https://docs.aws.amazon.com/eks/latest/userguide/access-policies.html)
- [GitOps Bridge Module](https://github.com/gitops-bridge-dev/gitops-bridge)
- [ArgoCD Cluster Management](https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#clusters)
12 changes: 12 additions & 0 deletions docs/argocd-ack-permissions.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argocd-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: "arn:aws:iam::382076407153:role/AmazonEKSCapabilityArgoCDRole"
Loading