Iris: Add Coreweave Integration by rjpower · Pull Request #2891 · marin-community/marin

rjpower · 2026-02-19T17:06:05Z

Summary

This PR completes the CoreWeave Kubernetes integration (Spiral 2) with a production-ready platform implementation, Kubernetes-native task execution, containerd support, and comprehensive observability.

Key Architectural Changes

New Kubernetes Runtime (`iris/cluster/runtime/kubernetes.py`)

Maps the ContainerRuntime protocol to Kubernetes Pods — each task attempt is a separate Pod with full GPU/RDMA resource claiming
In-Pod bundle materialization via emptyDir enables dynamic scheduling across shared NodePools
Profiling support (py-spy CPU, memray memory) via kubectl exec

New Containerd Runtime (`iris/cluster/runtime/containerd.py`)

Implements ContainerRuntime using crictl (CRI CLI) for CoreWeave bare-metal nodes where Docker is not available
CRI pod sandboxes with host-network namespace for flat endpoint connectivity

Refactored Log Reading

New RuntimeLogReader protocol replaces timestamp-based log fetching
Byte-offset deduplication fixes duplicate log lines from sub-second polling
Per-runtime readers: KubernetesLogReader, ContainerdLogReader, DockerLogReader, ProcessLogReader

Production CoreWeave Platform (`iris/cluster/platform/coreweave.py`)

Shared NodePool Model: one pre-allocated NodePool per scale group (instance type) with CoreWeave autoscaling enabled — Pod-based scheduling takes seconds vs. 20–30 min per-slice NodePool provisioning
CoreweaveWorkerHandle: kubectl exec for commands, pod status mapping
CoreweaveSliceHandle: lifecycle state machine (CREATING → BOOTSTRAPPING → READY | FAILED)
S3 credentials injected via Kubernetes Secrets with automatic endpoint injection
Tunnel support via kubectl port-forward
Idempotent controller Deployment/Service/ConfigMap management

Kubectl Wrapper (`iris/cluster/k8s/kubectl.py`)

Encapsulates all kubectl subprocess interactions with consistent timeout/error handling, JSON serialization, and stream log reading

New Features

S3 Object Storage: Move bundle/log storage to S3-compatible endpoints (CoreWeave Object Storage)
- Credentials renamed from CW_KEY_ID/CW_KEY_SECRET → R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY (standard R2/AWS naming)
- Auto-detect virtual-hosted-style addressing for *.cwobject.com endpoints
- FSSPEC_S3_* env vars for automatic endpoint propagation
Controller Dockerfiles: Dockerfile.controller, Dockerfile.task, Dockerfile.worker
CLI Enhancements:
- iris cluster reload via new platform.reload() abstraction
- iris ls / iris cat / iris cp / iris browse for S3 bucket inspection
- Platform-aware debug cleanup/logs commands

Bug Fixes

Spurious ASSIGNED→PENDING regression (d346765): Worker now reports BUILDING (not PENDING) for tasks with pending execution threads; controller ignores worker-reported PENDING in running tasks — fixes backpressure bypass that allowed unbounded task creation in K8s.
Log deduplication (72e59e8): Byte-offset tracking eliminates duplicate log lines from sub-second polling; fixes fsspec/s3fs DirCache staleness and partial-upload races.
S3 range request handling: Avoid passing start=0 to fsspec.cat_file — GCS rejects with 416 when file is empty.
K8s task pod networking: Task pods use IRIS_ADVERTISE_HOST from downward API (status.hostIP) for proper node-local endpoint registration.
Image pull policy: Set imagePullPolicy: Always on all Iris containers to prevent stale cached images.

Test Coverage Added

Test file	Lines	Description
`test_coreweave_platform.py`	1706	17 behavioral tests with `FakeKubectl`; lifecycle, failure paths, S3 creds, NodePool management
`test_kubernetes_runtime.py`	208	Pod creation, resource requests, profiling, env injection
`test_containerd_runtime.py`	265	CRI sandbox/container lifecycle, log parsing, resource limits
`test_coreweave_live_kubernetes_runtime.py`	390	Integration tests against live K8s
`test_autoscaler.py`	46	Scaling behavior on CoreWeave
`test_gpu_worker_metadata.py`	131	GPU detection and propagation

Existing tests updated: test_state.py (heartbeat race regression), test_autoscaler.py, test_scheduler.py, test_task_logging.py, test_worker.py, test_env_probe.py.

Infrastructure & Config

K8s manifests (infra/coreweave/k8s/): Namespace, ServiceAccount, RBAC for operator deployment
CoreWeave example config (iris/examples/coreweave.yaml): shared NodePools for h100-8x and cpu-erapids
Design doc (iris/docs/coreweave-v2.md): architectural rationale for shared NodePool model

Breaking Changes

S3 credential env vars renamed: CW_KEY_ID/CW_KEY_SECRET → R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY
Log reader API: ContainerHandle.logs(since=...) → log_reader() returning RuntimeLogReader
platform.reload() replaces reload_controller()

claude · 2026-02-19T17:06:26Z

Claude encountered an error —— View job

Review: CoreWeave Platform Integration for Iris (Spiral 2)

This PR implements the CoreWeave platform backend for Iris: CoreweavePlatform (kubectl-based NodePool + Pod lifecycle), ContainerdRuntime (crictl-based task containers), --runtime flag on the worker, operator K8s manifests, proto extensions, and tests. It follows the design doc at docs/design/coreweave-iris-integration.md (Spiral 2 items).

Specification check: A detailed spec exists. The implementation tracks it closely. Key design decisions (one NodePool per slice, async create_slice, CRI host-network mode, reconcile-driven recovery, containerd socket security context) are all implemented as described.

Findings:

ConfigMap template field name mismatch (infra/coreweave/k8s/configmap.yaml:47): The template uses ram: 2048GB under resources, but the proto ScaleGroupResources has memory_bytes (int64 in bytes). This field will be silently ignored during proto deserialization. Should be memory_bytes: 2199023255552 (or whatever the correct conversion is). Fix this ->
tunnel() hardcodes service name and port (coreweave.py:797-803): tunnel() ignores the address parameter and hardcodes service_name="iris-controller-svc" and remote_port=10000. If an operator customizes these in the ConfigMap/ControllerVmConfig, tunneling won't reflect the change. The other platform implementations pass through the address parameter. Consider parsing address or accepting config.
Worker Pod spec missing tolerations (coreweave.py:596-621): CoreWeave GPU node pools may apply taints (e.g., nvidia.com/gpu=present:NoSchedule or custom taints). Without matching tolerations on the worker Pod, the Pod may remain unschedulable. This depends on CoreWeave's specific node pool behavior -- worth validating during the smoke test. If taints are applied, this is a blocking issue.
_list_slices_by_labels makes N+1 kubectl calls (coreweave.py:679-753): One kubectl get nodepools + one kubectl get pod per nodepool. At scale (50+ slices per the design doc's open question), this could be slow. A single kubectl get pods -l iris-managed=true to batch-fetch all worker pods would reduce this to 2 calls. Not blocking, but worth noting for operational performance.
_monitor_slice swallows exception type (coreweave.py:482-493): When _wait_for_nodepool_ready raises QuotaExhaustedError inside the background thread, the catch-all except Exception logs it but the handle just becomes FAILED. The autoscaler cannot distinguish quota exhaustion from generic failure. The synchronous quota check on initial kubectl apply (line 437-438) handles the immediate case, but quota errors that manifest only in NodePool status conditions are silently downgraded.
Progress doc checked in (lib/iris/docs/coreweave-progress.md): 438-line progress log is implementation-internal and should probably not be in the final PR.

CoreWeave Spiral 2, Stage 1: proto extensions + progress log Extend config.proto for CoreWeave platform support: - CoreweavePlatformConfig: add namespace, kubeconfig_path fields - CoreweaveSliceConfig: add gpus_per_node, gpu_class, infiniband fields - CoreweaveControllerConfig: new message with port and service_name - ControllerVmConfig: wire CoreweaveControllerConfig as oneof field 4 - BootstrapConfig: add runtime and task_image fields - Regenerate proto bindings Also update design doc with implementation progress tracking and add coreweave-progress.md execution log. CoreWeave Spiral 2, Stage 2: ContainerdRuntime implementation Implement ContainerdRuntime and ContainerdContainerHandle using crictl for containerd-based hosts (CoreWeave bare metal). Follows the ContainerRuntime/ContainerHandle protocols from runtime/types.py. Key behaviors: - CRI pod sandbox per task with host-network (namespace_options.network=NODE) - Two-phase lifecycle: build() for setup commands, run() for main command - crictl pull/runp/create/start/inspect/logs/stats/rm for full lifecycle - Runtime endpoint threaded through all crictl calls - Labels (iris.managed, iris.task_id, iris.job_id) for discoverability Integration tests run against local containerd socket, skip when unavailable. CoreWeave Spiral 2, Stage 5: operator K8s manifests Add operator-managed Kubernetes manifests for deploying Iris controller on CoreWeave CKS clusters under infra/coreweave/k8s/: - namespace.yaml: iris namespace - service-account.yaml, cluster-role.yaml, cluster-role-binding.yaml: RBAC - configmap.yaml: template Iris cluster config with operator guidance - controller-nodepool.yaml: cd-gp-i64-erapids (smallest CPU instance) - controller-deployment.yaml: controller pod with config/secret mounts - controller-service.yaml: ClusterIP service on port 10000 CoreWeave Spiral 2, Stage 3: --runtime flag + crictl in worker image Add --runtime docker|containerd flag to worker serve command. Selects ContainerdRuntime or DockerRuntime at startup and passes to Worker. Install crictl v1.32.0 in Dockerfile.worker so the single image supports both Docker (GCP) and containerd (CoreWeave) backends. CoreWeave Spiral 2, Stage 4: CoreweavePlatform implementation Replace the CoreweavePlatform stub with a full kubectl-based implementation for CoreWeave CKS clusters. CoreweaveWorkerHandle: kubectl exec for run_command (with on_line streaming support), pod status mapping to CloudWorkerState. CoreweaveSliceHandle: lifecycle state machine (CREATING -> BOOTSTRAPPING -> READY | FAILED) with thread-safe state transitions. terminate() deletes Pod then NodePool. CoreweavePlatform: create_slice applies NodePool CRD and spawns background monitoring thread. list_slices/list_all_slices query by label. discover_controller forms K8s Service DNS name. tunnel via kubectl port-forward. Quota error detection from NodePool conditions. 17 behavioral tests using FakeKubectl to intercept subprocess calls. CoreWeave Spiral 2, Stage 6: CoreWeave controller Dockerfile + CI Add Dockerfile.controller.coreweave: python:3.11-slim with kubectl (no gcloud) for CoreWeave in-cluster deployment. Add to CI matrix in docker-images.yaml for automatic GHCR publishing. CoreWeave Spiral 2, Stage 7: validation results All test suites pass (546 cluster tests, 65 E2E tests), pre-commit clean. Update progress log with final validation results. CoreWeave Spiral 2: fix blockers from final review - Add security context to worker Pod (Decision 10: no privilege escalation, drop all capabilities, RuntimeDefault seccomp) - Add hostNetwork: true + ClusterFirstWithHostNet DNS policy to worker Pod for flat-network endpoint reachability (Decision 8) - Change restartPolicy from Never to Always for crash recovery - Reject num_vms > 1 with clear ValueError (design doc non-goal) - Fix controller-deployment.yaml image to iris-controller-coreweave Fix codex review findings: runtime API, lifecycle, container generalization - Add remove_all_iris_containers() to ContainerRuntime protocol; rename ContainerdRuntime.remove_all_iris_sandboxes to match (fixes crash on containerd worker startup) - Fail fast in CoreweavePlatform.create_slice when bootstrap_config is missing (prevents slices stuck in CREATING forever) - Remove Dockerfile.controller.coreweave; add kubectl to generic Dockerfile.controller; update k8s manifests and CI - Add containerd pytest marker, exclude from default test runs - Replace raw time.monotonic() deadline patterns with Deadline.from_seconds() - Move local imports to top of test file - Extract _run_kubectl() top-level helper to deduplicate kubectl subprocess calls Cleanups. Add platform.reload() abstraction, remove reload_controller Replace the SSH/Docker-specific reload_controller() with a platform.reload() method on the Platform protocol. Each platform now implements its own reload strategy: - GCP/Manual: terminate all worker slices, stop controller, start controller - CoreWeave: update ConfigMap, reload worker Pods in parallel, then rolling update controller Deployment via kubectl set image + rollout status - Local: stop + start in-process controller The CLI `iris cluster reload` now calls platform.reload(config) directly. Make debug cleanup and logs commands platform-aware for CoreWeave The debug CLI's cleanup and logs commands were hardcoded to require a GCP platform config, causing immediate failure when run with a CoreWeave config. The smoke test uses both commands, so CoreWeave integration testing was completely blocked at Phase 0. - Add _get_platform_type() helper to detect platform without assuming GCP - Add _cleanup_coreweave() that uses kubectl to find and delete iris-managed NodePools, Pods, and controller resources (Deployment, Service, ConfigMap) - Add _make_kubectl() helper to create Kubectl from CoreWeave config - Update cleanup command to dispatch based on platform type - Update logs command to use kubectl logs for CoreWeave deployments Smoke test: auto-detect GPU/TPU, add multi-GPU tests, add debug loop - Add gpu_device() helper to types.py (mirrors tpu_device()) - Smoke test auto-detects accelerator type from config instead of hardcoded --tpu-type flag. GPU configs run nvidia-smi device check, concurrent GPU jobs, and region-constrained GPU job. TPU configs run existing test suite with region/replicas from config. - Add debug-coreweave-loop.py: automated debug loop that runs smoke test, hands failures to Claude, and commits changes between rounds. Only the smoke test exit code determines pass/fail — Claude cannot self-declare PASSED. now we're really smoking. Fix smoke test in local mode. CoreWeave shared NodePool model: fix RFC 1123 names, add minNodes/maxNodes, colocate controller - Fix NodePool names: sanitize underscores to hyphens for RFC 1123 compliance (e.g. iris-cpu_erapids → iris-cpu-erapids) - Add required minNodes/maxNodes to NodePool spec from scale group config; error if min_slices/max_slices not set - Remove targetNodes field (conflicts with autoscaler) - Add scale_group field to CoreweaveControllerConfig proto: controller runs on an existing scale group's NodePool instead of a dedicated one - Delete stale managed NodePools not in current config - Rename scale groups in coreweave.yaml: cpu_erapids → cpu-erapids, h100_8x → h100-8x - kubectl: always log stdin content for apply commands (debugging) - Update tests for shared NodePool model S3 storage support for CoreWeave: move bundle_prefix to StorageConfig, add S3 credentials and object storage endpoint Move bundle_prefix from ControllerVmConfig to a new top-level StorageConfig proto message (with log_prefix). Add S3 credential management via K8s Secrets (CW_KEY_ID/CW_KEY_SECRET env vars), object_storage_endpoint propagation to pods, and kubeconfig stripping from the in-cluster ConfigMap. Improve CoreWeave platform reliability: reconcile NodePool specs on every start (instead of skip-if-exists), clamp targetNodes to prevent autoscaler runaway, async NodePool deletion, longer kubectl timeouts, and controller pod health checking with fatal error fast-fail (ImagePullBackOff, FailedMount, etc.). Tee all subprocess output to stdout and log files in debug-coreweave-loop - Add FileHandler to logging so logger output goes to both stdout and debug-loop.log - Switch run_smoke_test from subprocess.run to Popen with line-by-line streaming - Add _run_and_tee helper used by cleanup_kubernetes and collect_kubectl_diagnostics - Move configure_logging after log_dir resolution so it can receive the path debug-coreweave step 1: FAILED _normalize_scale_group_resources rejects proto field names (memory_bytes, disk_bytes) when loading ConfigMap JSON debug-coreweave step 2: FAILED ConfigMap serialization uses raw MessageToDict producing proto field names (memory_bytes) instead of config_to_dict which normalizes to YAML names (ram) debug-coreweave step 3: FAILED ConfigMap JSON includes namespace and object_storage_endpoint fields that the deployed controller image's protobuf does not recognize debug-coreweave step 4: FAILED ConfigMap JSON includes bootstrap.runtime and bootstrap.task_image fields that the deployed controller image's protobuf does not recognize, causing ParseDict to fail Fix S3 bundle storage for CoreWeave and centralize fsspec config The controller's BundleStore.write_bundle() was failing with "The specified key does not exist" when writing bundles to CoreWeave S3 object storage. Two root causes: 1. fsspec.url_to_fs("s3://...") creates an S3FileSystem with no endpoint_url. Whether AWS_ENDPOINT_URL is picked up depends on the boto3 version — fragile in Docker images. 2. The write-tmp-then-rename pattern (fs.rename) is unreliable on S3-compatible stores. S3 has no native rename; s3fs implements it as copy+delete which breaks on CoreWeave Object Storage. S3 PUTs are already atomic, making the pattern unnecessary. Fix: introduce iris/cluster/fs.py as the single point of fsspec configuration. It wraps url_to_fs(), filesystem(), and open() to inject endpoint_url from AWS_ENDPOINT_URL for all S3 paths. All four fsspec call sites now go through this module: - controller/bundle_store.py (bundle uploads) - worker/bundle_cache.py (bundle downloads) - task_logging.py FsspecLogSink (log writes) - task_logging.py LogReader (log reads) Also includes prior debug iteration fixes: GHCR image support in CLI, controller address fallback, rollout restart on deploy, ConfigMap field stripping cleanup, docker push noise filtering, and image tag pinning in the debug loop. More smoke logging. Move gpu_count to SliceConfig, propagate error messages, detect pod failures early - Move gpus_per_node from CoreweaveSliceConfig to gpu_count on SliceConfig, populated via prepare_slice_config from ScaleGroupResources - Thread error messages through slice failure path (SliceState -> SliceStatus -> SliceInfo proto) and log them in smoke test - Detect fatal pod conditions (ImagePullBackOff, CrashLoopBackOff, etc.) during _wait_for_pod_ready instead of waiting for timeout - Add serviceAccountName to worker pods - Fix S3 storage detection to use config instead of instance attribute - Remove duplicate copyright headers, update image tags debug-coreweave step 1: FAILED AT_CAPACITY groups reject demand routing, causing current_demand=0 and immediate scale-down of ready slices Fix CoreWeave GPU scaledown bug, add GPU variant matching and process log sink - Fix freshly-ready slices being immediately scaled down by initializing last_active in mark_slice_ready() (was epoch(0), making all new slices appear idle) - Add substring-based device variant matching so short config names (e.g. "H100") match full nvidia-smi names ("NVIDIA H100 80GB HBM3") - Propagate GPU metadata (count, name, memory) through worker attributes and display in dashboard Device column - Add ProcessLogSink for periodic worker log upload to S3/fsspec storage - Fix S3 credential passing (explicit key/secret to s3fs) and skip makedirs on S3 (not supported by all endpoints) - Add diagnostic logging throughout autoscaler, scheduler, and platform - Rewrite test-s3.py as interactive TUI bucket browser - Add smoke-test worker process log dumping on failure Cleanups. Getting ti working. debug-coreweave step 1: FAILED Docker image tag mismatch: _build_image() tags with short git SHA but push uses config's longer hash debug-coreweave step 2: FAILED Task image pushed as :latest instead of versioned tag because _build_and_push_task_image hardcoded version='latest' and used worker_tag instead of task_tag from config debug-coreweave step 3: FAILED CRI pod sandbox metadata missing required uid field and name contains invalid forward slashes from task_id debug-coreweave step 4: FAILED K8s uses stale cached images because imagePullPolicy defaults to IfNotPresent for versioned tags — none of the 3 prior fixes ever ran on the remote worker Fix CoreWeave S3 virtual-hosted addressing for object storage CoreWeave Object Storage endpoints (*.cwobject.com) use virtual-hosted-style addressing where the bucket name is a subdomain. The code was passing these bucket-specific URLs as endpoint_url to s3fs, which then double-prefixed the bucket name — resulting in silent empty listings and objects stored under wrong key paths. Fix: detect CoreWeave endpoints in _s3_storage_options() and force virtual addressing style. Update the example config to use the base endpoint (https://cwobject.com) instead of the bucket-specific URL. Rename test-s3.py to s3-browser.py, add file viewer with JSON table rendering Press Enter on files to view contents in a scrollable viewer. JSON/JSONL files are rendered as tables via tabulate (list-of-dicts, single dict, list-of-lists); non-tabular data falls back to pretty JSON or plain text. Blargh Blargh,v 2 Build it. Fix line-too-long lint in CoreWeave hostname check gpus. debug-coreweave step 1: FAILED Task IDs with slashes used directly as K8s label values; sanitize them better build yo debug-coreweave step 1: FAILED KubernetesContainerHandle.status() treats K8s terminated reason 'Completed' as an error, causing successful tasks to be marked FAILED Fix logging etc. Pin latest iris images on cluster start/reload WIP loggin

…straint API Move log reading from RemoteClusterClient (which required S3 credentials) to the controller via GetTaskLogs RPC. This simplifies wait_for_job() and get_task_logs() significantly and avoids client-side storage access. Also harden region_constraint() to reject bare strings with a TypeError, update callers to pass [region], and fix tests for moved modules (kubectl -> k8s.kubectl, parse_artifact_registry_tag -> bootstrap), changed APIs (bundle_cache -> runtime), and newly required docker_image config defaults.

…o R2 env vars Replace timestamp-based incremental log fetching (logs(since=...)) with a RuntimeLogReader protocol backed by byte-offset deduplication, fixing duplicate log lines when polling at sub-second granularity (kubectl --since-time has only second precision). Key changes: - ContainerHandle.logs(since) → log_reader() returning RuntimeLogReader - KubernetesLogReader: byte-offset into raw kubectl output - ContainerdLogReader: byte-offset into CRI log file read directly - DockerLogReader: timestamp cursor + 1ms (Docker --since has sub-second precision) - ProcessLogReader: list-index cursor into in-memory log list - task_attempt._stream_logs simplified to take RuntimeLogReader, no return value Also: - Rename CW_KEY_ID/CW_KEY_SECRET → R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY throughout - Fix IRIS_ADVERTISE_HOST in K8s task pods to use downward API (status.hostIP) instead of inheriting the worker's static IP - Add /root/.cargo/target bind mount and CARGO_TARGET_DIR env to cache Rust builds - Extract build_kvstore_spec() in tensorstore_serialization to explicitly forward AWS_ENDPOINT_URL/region to tensorstore (it doesn't read env vars itself) - Add CLI subcommands (ls, cat, cp, browse) to s3-browser.py; load config from YAML - Add _wrap_in_venv_shell() for profiling commands to resolve venv symlinks via PATH

TaskAttempt initializes with status=PENDING and only transitions to BUILDING when the execution thread calls _download_bundle(). A heartbeat between assignment and thread start would report PENDING to the controller, which accepted it as a valid state change from ASSIGNED. This silently removed the task from the building-count backpressure window, allowing the scheduler to assign more tasks than the limit and overwhelming the K8s pod API with simultaneous creations. Fix: - Worker: report BUILDING (not PENDING) for tasks whose thread hasn't started yet - this is the accurate state since execution is imminent. - Controller: ignore worker-reported PENDING in running_tasks; workers should never legitimately report PENDING and accepting it causes silent backpressure bypass.

yonromai

Follow-up inline comments (non-blocking).

Generated with Codex

yonromai · 2026-02-20T23:44:59Z

lib/iris/src/iris/cluster/worker/task_attempt.py

-                last_log_time = self._stream_logs(last_log_time)
+                self._stream_logs(log_reader)

                # Read result file only if container succeeded


Priority: High

This completion path only reads _result.pkl from worker-local self.workdir. For KubernetesRuntime, /app is pod-local emptyDir, so callable results stay inside the task pod and are never copied back to the worker filesystem. That means successful callable tasks can return no result.

Recommended fix: add a runtime-specific result export step (for example kubectl read_file for /app/_result.pkl) before this host-side read, or add a runtime hook so result collection is runtime-aware.

Generated with Codex

yonromai · 2026-02-20T23:44:59Z

lib/iris/src/iris/cluster/task_logging.py

+        records = self._log_buffer.query_since(self._last_seq, limit=self._max_entries)
+        if not records:
+            return
+        self._last_seq = max(r.seq for r in records)


Priority: High

_last_seq is advanced before _append_lines. If object storage write fails once, this batch is skipped permanently on the next sync, causing silent process-log loss.

Recommended fix: only advance _last_seq after _append_lines succeeds (or restore cursor on exception) so failed batches are retried.

Generated with Codex

yonromai · 2026-02-20T23:44:59Z

lib/iris/src/iris/cluster/platform/coreweave.py

+
+    # -- NodePool Management ----------------------------------------------------
+
+    def _nodepool_name(self, scale_group: str) -> str:


Priority: High

_nodepool_name() only replaces _ with - and does not lowercase/sanitize. If a scale-group name contains uppercase characters, generated NodePool names violate RFC-1123 and kubectl apply fails.

Recommended fix: normalize generated names to lowercase RFC-1123-safe format before creating/reconciling NodePools.

Generated with Codex

yonromai · 2026-02-20T23:44:59Z

lib/iris/src/iris/cluster/platform/coreweave.py

+        self.stop_controller(config)
+        return target_names
+
+    def reload(self, config: config_pb2.IrisClusterConfig) -> str:


Priority: High

reload() updates ConfigMap/pods/image but does not recompute S3 mode or reconcile secret/env wiring. A reload that changes storage from gs:// to s3:// can leave controller/workers without required AWS_*/FSSPEC_S3 env.

Recommended fix: recompute S3 mode in reload(), ensure/remove the secret accordingly, and re-apply controller deployment env (or equivalent rollout) before worker reload.

Generated with Codex

yonromai · 2026-02-20T23:44:59Z

lib/iris/src/iris/cluster/runtime/docker.py

+        """Return new log lines since the last read. Advances the cursor by 1ms past the last line."""
+        if not self._container_id:
+            return []
+        lines = _docker_logs(self._container_id, since=self._last_timestamp)


Priority: Medium

Cursor advancement uses lines[-1], but _docker_logs() appends stdout then stderr blocks. If stderr’s last timestamp is older than latest stdout, the cursor regresses and next poll duplicates previously streamed lines.

Recommended fix: advance cursor using the max timestamp across all returned lines.

Generated with Codex

rjpower force-pushed the rjpower/20260218-coreweave branch 2 times, most recently from ccf895d to 75f83e3 Compare February 20, 2026 20:32

rjpower added 3 commits February 20, 2026 12:46

rjpower force-pushed the rjpower/20260218-coreweave branch from 75f83e3 to 72e59e8 Compare February 20, 2026 22:40

rjpower requested review from ravwojdyla and yonromai February 20, 2026 23:15

This comment was marked as duplicate.

Sign in to view

yonromai reviewed Feb 20, 2026

View reviewed changes

Request dedicated worker resources, implement remove() api.

6fa91d8

rjpower force-pushed the rjpower/20260218-coreweave branch from e2f9712 to 6fa91d8 Compare February 20, 2026 23:57

remove bad tests, bad

2bdbb0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Iris: Add Coreweave Integration#2891

Iris: Add Coreweave Integration#2891
rjpower wants to merge 6 commits intomainfrom
rjpower/20260218-coreweave

rjpower commented Feb 19, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

This comment was marked as duplicate.

Uh oh!

yonromai left a comment

Uh oh!

yonromai Feb 20, 2026

Uh oh!

yonromai Feb 20, 2026

Uh oh!

yonromai Feb 20, 2026

Uh oh!

yonromai Feb 20, 2026

Uh oh!

yonromai Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		# -- NodePool Management ----------------------------------------------------

		def _nodepool_name(self, scale_group: str) -> str:

Comments

Conversation

rjpower commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Architectural Changes

New Kubernetes Runtime (iris/cluster/runtime/kubernetes.py)

New Containerd Runtime (iris/cluster/runtime/containerd.py)

Refactored Log Reading

Production CoreWeave Platform (iris/cluster/platform/coreweave.py)

Kubectl Wrapper (iris/cluster/k8s/kubectl.py)

New Features

Bug Fixes

Test Coverage Added

Infrastructure & Config

Breaking Changes

Uh oh!

claude bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: CoreWeave Platform Integration for Iris (Spiral 2)

Uh oh!

This comment was marked as duplicate.

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Feb 19, 2026 •

edited

Loading

New Kubernetes Runtime (`iris/cluster/runtime/kubernetes.py`)

New Containerd Runtime (`iris/cluster/runtime/containerd.py`)

Production CoreWeave Platform (`iris/cluster/platform/coreweave.py`)

Kubectl Wrapper (`iris/cluster/k8s/kubectl.py`)

claude bot commented Feb 19, 2026 •

edited

Loading