Skip to content

Comments

Iris: Add Coreweave Integration#2891

Open
rjpower wants to merge 6 commits intomainfrom
rjpower/20260218-coreweave
Open

Iris: Add Coreweave Integration#2891
rjpower wants to merge 6 commits intomainfrom
rjpower/20260218-coreweave

Conversation

@rjpower
Copy link
Collaborator

@rjpower rjpower commented Feb 19, 2026

Summary

This PR completes the CoreWeave Kubernetes integration (Spiral 2) with a production-ready platform implementation, Kubernetes-native task execution, containerd support, and comprehensive observability.

Key Architectural Changes

New Kubernetes Runtime (iris/cluster/runtime/kubernetes.py)

  • Maps the ContainerRuntime protocol to Kubernetes Pods — each task attempt is a separate Pod with full GPU/RDMA resource claiming
  • In-Pod bundle materialization via emptyDir enables dynamic scheduling across shared NodePools
  • Profiling support (py-spy CPU, memray memory) via kubectl exec

New Containerd Runtime (iris/cluster/runtime/containerd.py)

  • Implements ContainerRuntime using crictl (CRI CLI) for CoreWeave bare-metal nodes where Docker is not available
  • CRI pod sandboxes with host-network namespace for flat endpoint connectivity

Refactored Log Reading

  • New RuntimeLogReader protocol replaces timestamp-based log fetching
  • Byte-offset deduplication fixes duplicate log lines from sub-second polling
  • Per-runtime readers: KubernetesLogReader, ContainerdLogReader, DockerLogReader, ProcessLogReader

Production CoreWeave Platform (iris/cluster/platform/coreweave.py)

  • Shared NodePool Model: one pre-allocated NodePool per scale group (instance type) with CoreWeave autoscaling enabled — Pod-based scheduling takes seconds vs. 20–30 min per-slice NodePool provisioning
  • CoreweaveWorkerHandle: kubectl exec for commands, pod status mapping
  • CoreweaveSliceHandle: lifecycle state machine (CREATING → BOOTSTRAPPING → READY | FAILED)
  • S3 credentials injected via Kubernetes Secrets with automatic endpoint injection
  • Tunnel support via kubectl port-forward
  • Idempotent controller Deployment/Service/ConfigMap management

Kubectl Wrapper (iris/cluster/k8s/kubectl.py)

  • Encapsulates all kubectl subprocess interactions with consistent timeout/error handling, JSON serialization, and stream log reading

New Features

  • S3 Object Storage: Move bundle/log storage to S3-compatible endpoints (CoreWeave Object Storage)

    • Credentials renamed from CW_KEY_ID/CW_KEY_SECRETR2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY (standard R2/AWS naming)
    • Auto-detect virtual-hosted-style addressing for *.cwobject.com endpoints
    • FSSPEC_S3_* env vars for automatic endpoint propagation
  • Controller Dockerfiles: Dockerfile.controller, Dockerfile.task, Dockerfile.worker

  • CLI Enhancements:

    • iris cluster reload via new platform.reload() abstraction
    • iris ls / iris cat / iris cp / iris browse for S3 bucket inspection
    • Platform-aware debug cleanup/logs commands

Bug Fixes

  1. Spurious ASSIGNED→PENDING regression (d346765): Worker now reports BUILDING (not PENDING) for tasks with pending execution threads; controller ignores worker-reported PENDING in running tasks — fixes backpressure bypass that allowed unbounded task creation in K8s.

  2. Log deduplication (72e59e8): Byte-offset tracking eliminates duplicate log lines from sub-second polling; fixes fsspec/s3fs DirCache staleness and partial-upload races.

  3. S3 range request handling: Avoid passing start=0 to fsspec.cat_file — GCS rejects with 416 when file is empty.

  4. K8s task pod networking: Task pods use IRIS_ADVERTISE_HOST from downward API (status.hostIP) for proper node-local endpoint registration.

  5. Image pull policy: Set imagePullPolicy: Always on all Iris containers to prevent stale cached images.

Test Coverage Added

Test file Lines Description
test_coreweave_platform.py 1706 17 behavioral tests with FakeKubectl; lifecycle, failure paths, S3 creds, NodePool management
test_kubernetes_runtime.py 208 Pod creation, resource requests, profiling, env injection
test_containerd_runtime.py 265 CRI sandbox/container lifecycle, log parsing, resource limits
test_coreweave_live_kubernetes_runtime.py 390 Integration tests against live K8s
test_autoscaler.py 46 Scaling behavior on CoreWeave
test_gpu_worker_metadata.py 131 GPU detection and propagation

Existing tests updated: test_state.py (heartbeat race regression), test_autoscaler.py, test_scheduler.py, test_task_logging.py, test_worker.py, test_env_probe.py.

Infrastructure & Config

  • K8s manifests (infra/coreweave/k8s/): Namespace, ServiceAccount, RBAC for operator deployment
  • CoreWeave example config (iris/examples/coreweave.yaml): shared NodePools for h100-8x and cpu-erapids
  • Design doc (iris/docs/coreweave-v2.md): architectural rationale for shared NodePool model

Breaking Changes

  • S3 credential env vars renamed: CW_KEY_ID/CW_KEY_SECRETR2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY
  • Log reader API: ContainerHandle.logs(since=...)log_reader() returning RuntimeLogReader
  • platform.reload() replaces reload_controller()

@claude
Copy link
Contributor

claude bot commented Feb 19, 2026

Claude encountered an error —— View job


Review: CoreWeave Platform Integration for Iris (Spiral 2)

This PR implements the CoreWeave platform backend for Iris: CoreweavePlatform (kubectl-based NodePool + Pod lifecycle), ContainerdRuntime (crictl-based task containers), --runtime flag on the worker, operator K8s manifests, proto extensions, and tests. It follows the design doc at docs/design/coreweave-iris-integration.md (Spiral 2 items).


Specification check: A detailed spec exists. The implementation tracks it closely. Key design decisions (one NodePool per slice, async create_slice, CRI host-network mode, reconcile-driven recovery, containerd socket security context) are all implemented as described.


Findings:

  • ConfigMap template field name mismatch (infra/coreweave/k8s/configmap.yaml:47): The template uses ram: 2048GB under resources, but the proto ScaleGroupResources has memory_bytes (int64 in bytes). This field will be silently ignored during proto deserialization. Should be memory_bytes: 2199023255552 (or whatever the correct conversion is). Fix this ->

  • tunnel() hardcodes service name and port (coreweave.py:797-803): tunnel() ignores the address parameter and hardcodes service_name="iris-controller-svc" and remote_port=10000. If an operator customizes these in the ConfigMap/ControllerVmConfig, tunneling won't reflect the change. The other platform implementations pass through the address parameter. Consider parsing address or accepting config.

  • Worker Pod spec missing tolerations (coreweave.py:596-621): CoreWeave GPU node pools may apply taints (e.g., nvidia.com/gpu=present:NoSchedule or custom taints). Without matching tolerations on the worker Pod, the Pod may remain unschedulable. This depends on CoreWeave's specific node pool behavior -- worth validating during the smoke test. If taints are applied, this is a blocking issue.

  • _list_slices_by_labels makes N+1 kubectl calls (coreweave.py:679-753): One kubectl get nodepools + one kubectl get pod per nodepool. At scale (50+ slices per the design doc's open question), this could be slow. A single kubectl get pods -l iris-managed=true to batch-fetch all worker pods would reduce this to 2 calls. Not blocking, but worth noting for operational performance.

  • _monitor_slice swallows exception type (coreweave.py:482-493): When _wait_for_nodepool_ready raises QuotaExhaustedError inside the background thread, the catch-all except Exception logs it but the handle just becomes FAILED. The autoscaler cannot distinguish quota exhaustion from generic failure. The synchronous quota check on initial kubectl apply (line 437-438) handles the immediate case, but quota errors that manifest only in NodePool status conditions are silently downgraded.

  • Progress doc checked in (lib/iris/docs/coreweave-progress.md): 438-line progress log is implementation-internal and should probably not be in the final PR.

@rjpower rjpower force-pushed the rjpower/20260218-coreweave branch 2 times, most recently from ccf895d to 75f83e3 Compare February 20, 2026 20:32
CoreWeave Spiral 2, Stage 1: proto extensions + progress log

Extend config.proto for CoreWeave platform support:
- CoreweavePlatformConfig: add namespace, kubeconfig_path fields
- CoreweaveSliceConfig: add gpus_per_node, gpu_class, infiniband fields
- CoreweaveControllerConfig: new message with port and service_name
- ControllerVmConfig: wire CoreweaveControllerConfig as oneof field 4
- BootstrapConfig: add runtime and task_image fields
- Regenerate proto bindings

Also update design doc with implementation progress tracking and
add coreweave-progress.md execution log.

CoreWeave Spiral 2, Stage 2: ContainerdRuntime implementation

Implement ContainerdRuntime and ContainerdContainerHandle using crictl
for containerd-based hosts (CoreWeave bare metal). Follows the
ContainerRuntime/ContainerHandle protocols from runtime/types.py.

Key behaviors:
- CRI pod sandbox per task with host-network (namespace_options.network=NODE)
- Two-phase lifecycle: build() for setup commands, run() for main command
- crictl pull/runp/create/start/inspect/logs/stats/rm for full lifecycle
- Runtime endpoint threaded through all crictl calls
- Labels (iris.managed, iris.task_id, iris.job_id) for discoverability

Integration tests run against local containerd socket, skip when unavailable.

CoreWeave Spiral 2, Stage 5: operator K8s manifests

Add operator-managed Kubernetes manifests for deploying Iris controller
on CoreWeave CKS clusters under infra/coreweave/k8s/:

- namespace.yaml: iris namespace
- service-account.yaml, cluster-role.yaml, cluster-role-binding.yaml: RBAC
- configmap.yaml: template Iris cluster config with operator guidance
- controller-nodepool.yaml: cd-gp-i64-erapids (smallest CPU instance)
- controller-deployment.yaml: controller pod with config/secret mounts
- controller-service.yaml: ClusterIP service on port 10000

CoreWeave Spiral 2, Stage 3: --runtime flag + crictl in worker image

Add --runtime docker|containerd flag to worker serve command. Selects
ContainerdRuntime or DockerRuntime at startup and passes to Worker.
Install crictl v1.32.0 in Dockerfile.worker so the single image supports
both Docker (GCP) and containerd (CoreWeave) backends.

CoreWeave Spiral 2, Stage 4: CoreweavePlatform implementation

Replace the CoreweavePlatform stub with a full kubectl-based
implementation for CoreWeave CKS clusters.

CoreweaveWorkerHandle: kubectl exec for run_command (with on_line
streaming support), pod status mapping to CloudWorkerState.

CoreweaveSliceHandle: lifecycle state machine (CREATING -> BOOTSTRAPPING
-> READY | FAILED) with thread-safe state transitions. terminate()
deletes Pod then NodePool.

CoreweavePlatform: create_slice applies NodePool CRD and spawns
background monitoring thread. list_slices/list_all_slices query by
label. discover_controller forms K8s Service DNS name. tunnel via
kubectl port-forward. Quota error detection from NodePool conditions.

17 behavioral tests using FakeKubectl to intercept subprocess calls.

CoreWeave Spiral 2, Stage 6: CoreWeave controller Dockerfile + CI

Add Dockerfile.controller.coreweave: python:3.11-slim with kubectl
(no gcloud) for CoreWeave in-cluster deployment. Add to CI matrix
in docker-images.yaml for automatic GHCR publishing.

CoreWeave Spiral 2, Stage 7: validation results

All test suites pass (546 cluster tests, 65 E2E tests), pre-commit
clean. Update progress log with final validation results.

CoreWeave Spiral 2: fix blockers from final review

- Add security context to worker Pod (Decision 10: no privilege
  escalation, drop all capabilities, RuntimeDefault seccomp)
- Add hostNetwork: true + ClusterFirstWithHostNet DNS policy to worker
  Pod for flat-network endpoint reachability (Decision 8)
- Change restartPolicy from Never to Always for crash recovery
- Reject num_vms > 1 with clear ValueError (design doc non-goal)
- Fix controller-deployment.yaml image to iris-controller-coreweave

Fix codex review findings: runtime API, lifecycle, container generalization

- Add remove_all_iris_containers() to ContainerRuntime protocol; rename
  ContainerdRuntime.remove_all_iris_sandboxes to match (fixes crash on
  containerd worker startup)
- Fail fast in CoreweavePlatform.create_slice when bootstrap_config is
  missing (prevents slices stuck in CREATING forever)
- Remove Dockerfile.controller.coreweave; add kubectl to generic
  Dockerfile.controller; update k8s manifests and CI
- Add containerd pytest marker, exclude from default test runs
- Replace raw time.monotonic() deadline patterns with Deadline.from_seconds()
- Move local imports to top of test file
- Extract _run_kubectl() top-level helper to deduplicate kubectl subprocess calls

Cleanups.

Add platform.reload() abstraction, remove reload_controller

Replace the SSH/Docker-specific reload_controller() with a platform.reload()
method on the Platform protocol. Each platform now implements its own reload
strategy:

- GCP/Manual: terminate all worker slices, stop controller, start controller
- CoreWeave: update ConfigMap, reload worker Pods in parallel, then rolling
  update controller Deployment via kubectl set image + rollout status
- Local: stop + start in-process controller

The CLI `iris cluster reload` now calls platform.reload(config) directly.

Make debug cleanup and logs commands platform-aware for CoreWeave

The debug CLI's cleanup and logs commands were hardcoded to require a GCP
platform config, causing immediate failure when run with a CoreWeave config.
The smoke test uses both commands, so CoreWeave integration testing was
completely blocked at Phase 0.

- Add _get_platform_type() helper to detect platform without assuming GCP
- Add _cleanup_coreweave() that uses kubectl to find and delete iris-managed
  NodePools, Pods, and controller resources (Deployment, Service, ConfigMap)
- Add _make_kubectl() helper to create Kubectl from CoreWeave config
- Update cleanup command to dispatch based on platform type
- Update logs command to use kubectl logs for CoreWeave deployments

Smoke test: auto-detect GPU/TPU, add multi-GPU tests, add debug loop

- Add gpu_device() helper to types.py (mirrors tpu_device())
- Smoke test auto-detects accelerator type from config instead of
  hardcoded --tpu-type flag. GPU configs run nvidia-smi device check,
  concurrent GPU jobs, and region-constrained GPU job. TPU configs
  run existing test suite with region/replicas from config.
- Add debug-coreweave-loop.py: automated debug loop that runs smoke
  test, hands failures to Claude, and commits changes between rounds.
  Only the smoke test exit code determines pass/fail — Claude cannot
  self-declare PASSED.

now we're really smoking.

Fix smoke test in local mode.

CoreWeave shared NodePool model: fix RFC 1123 names, add minNodes/maxNodes, colocate controller

- Fix NodePool names: sanitize underscores to hyphens for RFC 1123 compliance
  (e.g. iris-cpu_erapids → iris-cpu-erapids)
- Add required minNodes/maxNodes to NodePool spec from scale group config;
  error if min_slices/max_slices not set
- Remove targetNodes field (conflicts with autoscaler)
- Add scale_group field to CoreweaveControllerConfig proto: controller runs
  on an existing scale group's NodePool instead of a dedicated one
- Delete stale managed NodePools not in current config
- Rename scale groups in coreweave.yaml: cpu_erapids → cpu-erapids,
  h100_8x → h100-8x
- kubectl: always log stdin content for apply commands (debugging)
- Update tests for shared NodePool model

S3 storage support for CoreWeave: move bundle_prefix to StorageConfig, add S3 credentials and object storage endpoint

Move bundle_prefix from ControllerVmConfig to a new top-level StorageConfig
proto message (with log_prefix). Add S3 credential management via K8s Secrets
(CW_KEY_ID/CW_KEY_SECRET env vars), object_storage_endpoint propagation to
pods, and kubeconfig stripping from the in-cluster ConfigMap.

Improve CoreWeave platform reliability: reconcile NodePool specs on every
start (instead of skip-if-exists), clamp targetNodes to prevent autoscaler
runaway, async NodePool deletion, longer kubectl timeouts, and controller
pod health checking with fatal error fast-fail (ImagePullBackOff,
FailedMount, etc.).

Tee all subprocess output to stdout and log files in debug-coreweave-loop

- Add FileHandler to logging so logger output goes to both stdout and debug-loop.log
- Switch run_smoke_test from subprocess.run to Popen with line-by-line streaming
- Add _run_and_tee helper used by cleanup_kubernetes and collect_kubectl_diagnostics
- Move configure_logging after log_dir resolution so it can receive the path

debug-coreweave step 1: FAILED

_normalize_scale_group_resources rejects proto field names (memory_bytes, disk_bytes) when loading ConfigMap JSON

debug-coreweave step 2: FAILED

ConfigMap serialization uses raw MessageToDict producing proto field names (memory_bytes) instead of config_to_dict which normalizes to YAML names (ram)

debug-coreweave step 3: FAILED

ConfigMap JSON includes namespace and object_storage_endpoint fields that the deployed controller image's protobuf does not recognize

debug-coreweave step 4: FAILED

ConfigMap JSON includes bootstrap.runtime and bootstrap.task_image fields that the deployed controller image's protobuf does not recognize, causing ParseDict to fail

Fix S3 bundle storage for CoreWeave and centralize fsspec config

The controller's BundleStore.write_bundle() was failing with "The
specified key does not exist" when writing bundles to CoreWeave S3
object storage. Two root causes:

1. fsspec.url_to_fs("s3://...") creates an S3FileSystem with no
   endpoint_url. Whether AWS_ENDPOINT_URL is picked up depends on the
   boto3 version — fragile in Docker images.

2. The write-tmp-then-rename pattern (fs.rename) is unreliable on
   S3-compatible stores. S3 has no native rename; s3fs implements it
   as copy+delete which breaks on CoreWeave Object Storage. S3 PUTs
   are already atomic, making the pattern unnecessary.

Fix: introduce iris/cluster/fs.py as the single point of fsspec
configuration. It wraps url_to_fs(), filesystem(), and open() to
inject endpoint_url from AWS_ENDPOINT_URL for all S3 paths. All four
fsspec call sites now go through this module:
- controller/bundle_store.py (bundle uploads)
- worker/bundle_cache.py (bundle downloads)
- task_logging.py FsspecLogSink (log writes)
- task_logging.py LogReader (log reads)

Also includes prior debug iteration fixes: GHCR image support in CLI,
controller address fallback, rollout restart on deploy, ConfigMap
field stripping cleanup, docker push noise filtering, and image tag
pinning in the debug loop.

More smoke logging.

Move gpu_count to SliceConfig, propagate error messages, detect pod failures early

- Move gpus_per_node from CoreweaveSliceConfig to gpu_count on SliceConfig,
  populated via prepare_slice_config from ScaleGroupResources
- Thread error messages through slice failure path (SliceState -> SliceStatus
  -> SliceInfo proto) and log them in smoke test
- Detect fatal pod conditions (ImagePullBackOff, CrashLoopBackOff, etc.)
  during _wait_for_pod_ready instead of waiting for timeout
- Add serviceAccountName to worker pods
- Fix S3 storage detection to use config instead of instance attribute
- Remove duplicate copyright headers, update image tags

debug-coreweave step 1: FAILED

AT_CAPACITY groups reject demand routing, causing current_demand=0 and immediate scale-down of ready slices

Fix CoreWeave GPU scaledown bug, add GPU variant matching and process log sink

- Fix freshly-ready slices being immediately scaled down by initializing
  last_active in mark_slice_ready() (was epoch(0), making all new slices
  appear idle)
- Add substring-based device variant matching so short config names
  (e.g. "H100") match full nvidia-smi names ("NVIDIA H100 80GB HBM3")
- Propagate GPU metadata (count, name, memory) through worker attributes
  and display in dashboard Device column
- Add ProcessLogSink for periodic worker log upload to S3/fsspec storage
- Fix S3 credential passing (explicit key/secret to s3fs) and skip
  makedirs on S3 (not supported by all endpoints)
- Add diagnostic logging throughout autoscaler, scheduler, and platform
- Rewrite test-s3.py as interactive TUI bucket browser
- Add smoke-test worker process log dumping on failure

Cleanups.

Getting ti working.

debug-coreweave step 1: FAILED

Docker image tag mismatch: _build_image() tags with short git SHA but push uses config's longer hash

debug-coreweave step 2: FAILED

Task image pushed as :latest instead of versioned tag because _build_and_push_task_image hardcoded version='latest' and used worker_tag instead of task_tag from config

debug-coreweave step 3: FAILED

CRI pod sandbox metadata missing required uid field and name contains invalid forward slashes from task_id

debug-coreweave step 4: FAILED

K8s uses stale cached images because imagePullPolicy defaults to IfNotPresent for versioned tags — none of the 3 prior fixes ever ran on the remote worker

Fix CoreWeave S3 virtual-hosted addressing for object storage

CoreWeave Object Storage endpoints (*.cwobject.com) use virtual-hosted-style
addressing where the bucket name is a subdomain. The code was passing these
bucket-specific URLs as endpoint_url to s3fs, which then double-prefixed the
bucket name — resulting in silent empty listings and objects stored under
wrong key paths.

Fix: detect CoreWeave endpoints in _s3_storage_options() and force virtual
addressing style. Update the example config to use the base endpoint
(https://cwobject.com) instead of the bucket-specific URL.

Rename test-s3.py to s3-browser.py, add file viewer with JSON table rendering

Press Enter on files to view contents in a scrollable viewer.
JSON/JSONL files are rendered as tables via tabulate (list-of-dicts,
single dict, list-of-lists); non-tabular data falls back to pretty JSON
or plain text.

Blargh

Blargh,v 2

Build it.

Fix line-too-long lint in CoreWeave hostname check

gpus.

debug-coreweave step 1: FAILED

Task IDs with slashes used directly as K8s label values; sanitize them

better build yo

debug-coreweave step 1: FAILED

KubernetesContainerHandle.status() treats K8s terminated reason 'Completed' as an error, causing successful tasks to be marked FAILED

Fix logging etc.

Pin latest iris images on cluster start/reload

WIP

loggin
…straint API

Move log reading from RemoteClusterClient (which required S3 credentials) to the
controller via GetTaskLogs RPC. This simplifies wait_for_job() and get_task_logs()
significantly and avoids client-side storage access.

Also harden region_constraint() to reject bare strings with a TypeError, update
callers to pass [region], and fix tests for moved modules (kubectl -> k8s.kubectl,
parse_artifact_registry_tag -> bootstrap), changed APIs (bundle_cache -> runtime),
and newly required docker_image config defaults.
…o R2 env vars

Replace timestamp-based incremental log fetching (logs(since=...)) with a
RuntimeLogReader protocol backed by byte-offset deduplication, fixing duplicate
log lines when polling at sub-second granularity (kubectl --since-time has only
second precision).

Key changes:
- ContainerHandle.logs(since) → log_reader() returning RuntimeLogReader
- KubernetesLogReader: byte-offset into raw kubectl output
- ContainerdLogReader: byte-offset into CRI log file read directly
- DockerLogReader: timestamp cursor + 1ms (Docker --since has sub-second precision)
- ProcessLogReader: list-index cursor into in-memory log list
- task_attempt._stream_logs simplified to take RuntimeLogReader, no return value

Also:
- Rename CW_KEY_ID/CW_KEY_SECRET → R2_ACCESS_KEY_ID/R2_SECRET_ACCESS_KEY throughout
- Fix IRIS_ADVERTISE_HOST in K8s task pods to use downward API (status.hostIP)
  instead of inheriting the worker's static IP
- Add /root/.cargo/target bind mount and CARGO_TARGET_DIR env to cache Rust builds
- Extract build_kvstore_spec() in tensorstore_serialization to explicitly forward
  AWS_ENDPOINT_URL/region to tensorstore (it doesn't read env vars itself)
- Add CLI subcommands (ls, cat, cp, browse) to s3-browser.py; load config from YAML
- Add _wrap_in_venv_shell() for profiling commands to resolve venv symlinks via PATH
@rjpower rjpower force-pushed the rjpower/20260218-coreweave branch from 75f83e3 to 72e59e8 Compare February 20, 2026 22:40
TaskAttempt initializes with status=PENDING and only transitions to
BUILDING when the execution thread calls _download_bundle(). A heartbeat
between assignment and thread start would report PENDING to the controller,
which accepted it as a valid state change from ASSIGNED. This silently
removed the task from the building-count backpressure window, allowing the
scheduler to assign more tasks than the limit and overwhelming the K8s pod
API with simultaneous creations.

Fix:
- Worker: report BUILDING (not PENDING) for tasks whose thread hasn't
  started yet - this is the accurate state since execution is imminent.
- Controller: ignore worker-reported PENDING in running_tasks; workers
  should never legitimately report PENDING and accepting it causes
  silent backpressure bypass.
yonromai

This comment was marked as duplicate.

Copy link
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up inline comments (non-blocking).

Generated with Codex

last_log_time = self._stream_logs(last_log_time)
self._stream_logs(log_reader)

# Read result file only if container succeeded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: High

This completion path only reads _result.pkl from worker-local self.workdir. For KubernetesRuntime, /app is pod-local emptyDir, so callable results stay inside the task pod and are never copied back to the worker filesystem. That means successful callable tasks can return no result.

Recommended fix: add a runtime-specific result export step (for example kubectl read_file for /app/_result.pkl) before this host-side read, or add a runtime hook so result collection is runtime-aware.

Generated with Codex

records = self._log_buffer.query_since(self._last_seq, limit=self._max_entries)
if not records:
return
self._last_seq = max(r.seq for r in records)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: High

_last_seq is advanced before _append_lines. If object storage write fails once, this batch is skipped permanently on the next sync, causing silent process-log loss.

Recommended fix: only advance _last_seq after _append_lines succeeds (or restore cursor on exception) so failed batches are retried.

Generated with Codex


# -- NodePool Management ----------------------------------------------------

def _nodepool_name(self, scale_group: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: High

_nodepool_name() only replaces _ with - and does not lowercase/sanitize. If a scale-group name contains uppercase characters, generated NodePool names violate RFC-1123 and kubectl apply fails.

Recommended fix: normalize generated names to lowercase RFC-1123-safe format before creating/reconciling NodePools.

Generated with Codex

self.stop_controller(config)
return target_names

def reload(self, config: config_pb2.IrisClusterConfig) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: High

reload() updates ConfigMap/pods/image but does not recompute S3 mode or reconcile secret/env wiring. A reload that changes storage from gs:// to s3:// can leave controller/workers without required AWS_*/FSSPEC_S3 env.

Recommended fix: recompute S3 mode in reload(), ensure/remove the secret accordingly, and re-apply controller deployment env (or equivalent rollout) before worker reload.

Generated with Codex

"""Return new log lines since the last read. Advances the cursor by 1ms past the last line."""
if not self._container_id:
return []
lines = _docker_logs(self._container_id, since=self._last_timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority: Medium

Cursor advancement uses lines[-1], but _docker_logs() appends stdout then stderr blocks. If stderr’s last timestamp is older than latest stdout, the cursor regresses and next poll duplicates previously streamed lines.

Recommended fix: advance cursor using the max timestamp across all returned lines.

Generated with Codex

@rjpower rjpower force-pushed the rjpower/20260218-coreweave branch from e2f9712 to 6fa91d8 Compare February 20, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants