Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 131 additions & 70 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This document contains development notes, architecture decisions, and lessons le

## Project Structure

- `src/jupyter_scheduler_k8s/` - Main Python package with K8sExecutionManager
- `src/jupyter_scheduler_k8s/` - Main Python package with K8sExecutionManager and K8sDatabaseManager
- `image/` - Docker image with Pixi-based Python environment and notebook executor
- `local-dev/` - Local development configuration (Kind cluster)
- `Makefile` - Build and development automation with auto-detection
Expand All @@ -29,67 +29,39 @@ This document contains development notes, architecture decisions, and lessons le

## Key Design Principles

1. **Minimal Extension**: Only override ExecutionManager, reuse everything else from jupyter-scheduler
1. **Minimal Extension**: Only override ExecutionManager and DatabaseManager, reuse everything else from jupyter-scheduler
2. **Container Simplicity**: Container just executes notebooks, unaware of K8s or scheduler
3. **No Circular Dependencies**: Container doesn't depend on jupyter-scheduler package
4. **Staging Compatibility**: Work with jupyter-scheduler's existing file staging mechanism
4. **Jobs as Records**: Execution Jobs serve as both the computational workload AND the database records
5. **Staging Compatibility**: Work with jupyter-scheduler's existing file staging mechanism

## Data Flow (Pre-Populated PVC Architecture)
## Data Flow (Jobs-as-Records with S3 Storage)

1. User creates job → jupyter-scheduler copies files to staging directory
2. jupyter-scheduler calls our K8sExecutionManager.execute()
3. K8sExecutionManager creates PVC for storage
4. Helper pod created → files transferred via `kubectl cp` → helper pod deleted
5. Main execution job runs with pre-populated PVC
6. After completion, new helper pod retrieves outputs via `kubectl cp`
7. K8sExecutionManager places outputs in staging directory
8. User can download results via jupyter-scheduler UI
2. jupyter-scheduler calls K8sExecutionManager.execute()
3. K8sExecutionManager uploads files to S3
4. Execution Job is created with database metadata (labels/annotations)
5. Job downloads files from S3, executes notebook, uploads outputs to S3
6. K8sExecutionManager downloads outputs from S3 to staging directory
7. **Job persists as database record** (no cleanup)
8. User can download results and view job history via jupyter-scheduler UI

## Implementation Status

### Phase 1: Container Implementation ✅
- Kind cluster setup complete
- Container executes notebooks with parameters
- Uses nbconvert (same as jupyter-scheduler)
- Minimal dependencies, no circular refs
- Supports `PACKAGE_INPUT_FOLDER` for including data files

### Phase 2: K8s Backend Implementation ✅

### Pre-Populated PVC Architecture (Production-Ready)
- **Storage**: PVC (PersistentVolumeClaim) for production-ready file handling
- Works with all standard K8s clusters (Kind, minikube, EKS, GKE, AKS)
- Handles notebooks of any size, not limited by ConfigMap 1MB restriction
- Standard K8s pattern used in production

- **File Transfer**: Helper pods with `kubectl cp`
- Pre-populate PVC before execution
- Retrieve outputs after completion
- Standard K8s file transfer method (used by Helm, Argo, etc.)
- Sequential operations instead of complex container coordination

- **Auto-Detection**: Smart environment detection
- **Local clusters** (Kind/minikube) → `imagePullPolicy: Never`
- **Cloud clusters** (EKS/GKE/AKS) → `imagePullPolicy: Always`
- **Context-aware**: Reads kubectl current-context for detection

- **Watch API**: Event-driven job monitoring
- **Real-time**: Uses K8s Watch API instead of polling
- **Efficient**: Immediate response to state changes
- **Fallback**: Graceful degradation to polling if watch fails

- **Resource Management**: Configurable resource allocation
- **Configurable limits**: Resource controls for execution containers
- **Right-sized defaults**: Appropriate resource allocation for each role

### Cluster Configuration (Platform Agnostic)
- **User provides K8s cluster** - any distribution (Kind, minikube, EKS, GKE, etc.)
- **Default**: Uses `~/.kube/config` (standard kubectl location)
- **Configurable**: Can point to any kubeconfig path
- **In-cluster**: When jupyter-scheduler runs inside K8s
- Settings: namespace, image, resource limits all configurable

### S3 Configuration (Optional - for Durability)
### Current Architecture: Jobs-as-Records with S3 Storage ✅
- **Database**: Execution Jobs (`nb-job-*`) serve as permanent records with labels/annotations
- **File Storage**: S3 for durability across cluster failures
- **Monitoring**: Watch API for real-time job status updates
- **Resource Management**: Configurable CPU/memory limits
- **Platform Support**: Works with any K8s cluster (Kind, minikube, cloud providers)

### Development Environment ✅
- **Local Setup**: Kind + Finch for development
- **Container**: Pixi-based Python environment with nbconvert
- **Auto-Detection**: Smart imagePullPolicy based on cluster context
- **Debugging**: Automatic pod log capture on failures

### S3 Configuration (Required)

**Purpose:** Persist files beyond jupyter-scheduler server and K8s cluster failures

Expand Down Expand Up @@ -127,13 +99,13 @@ jupyter lab --Scheduler.execution_manager_class="jupyter_scheduler_k8s.K8sExecut

**Critical:** AWS credentials must be set in the same terminal session where you launch Jupyter Lab. The system passes these credentials to Kubernetes containers for S3 access.

### Phase 3: Future Enhancements
- **GPU resource configuration for k8s jobs from UI**: Configure GPU count/type for ML workloads
- **Kubernetes job stop/deletion from UI**: Implement `stop_job` and `delete_job` methods
- **Kubernetes-native scheduling from UI**: Use K8s CronJobs instead of SQL-based job definitions
- **PyPI package publishing**: Set up publishing scaffolding and publish to PyPI
- **CI/CD**: Set up automated testing and deployment pipeline
- **Cloud Cluster Testing**: Test deployment on EKS, GKE, AKS (should work but untested)
### Future Development Roadmap
- **GPU Support**: Resource configuration from UI for ML workloads
- **Job Management**: Stop/delete running K8s jobs from UI (`stop_job`, `delete_job` methods)
- **CRD Migration**: Custom Resource Definitions for optimized metadata storage
- **Job Archival**: Automated cleanup of old execution Jobs
- **K8s-native Scheduling**: CronJobs integration from UI
- **PyPI Distribution**: Official package publishing


## Lessons Learned
Expand Down Expand Up @@ -185,29 +157,118 @@ jupyter lab --Scheduler.execution_manager_class="jupyter_scheduler_k8s.K8sExecut
- Add `awscli` to both pyproject.toml and container image
- Required S3_BUCKET env var, no fallback for consistency

## Current Implementation Status
## S3 Implementation Details

### Latest Architecture: S3 Storage (Production Ready ✅)
1. **Upload inputs** - AWS CLI sync to S3 bucket
2. **Container execution** - Job downloads from S3, executes notebook, uploads outputs
3. **Download outputs** - AWS CLI sync from S3 to staging directory
4. **Durability** - Files survive cluster failures, can be retrieved later

**Key Implementation Details:**
**Key Implementation:**
- **AWS credentials passed at runtime**: K8sExecutionManager passes host AWS credentials to containers via environment variables
- **Auto pod debugging**: When jobs fail, automatically captures pod logs and container status for troubleshooting
- **AWS CLI for reliability**: Handles directory recursion, multipart uploads, retries automatically

## Architecture: Jobs-as-Records Implementation

### Current Approach
Execution Jobs serve as both computational workload AND database records:
- **Job Metadata**: Stored in labels/annotations on execution Jobs (`nb-job-*` pattern)
- **Job Persistence**: Execution Jobs remain after completion as permanent database records
- **Query Interface**: K8sSession/K8sQuery mimic SQLAlchemy patterns using K8s label selectors
- **Storage Location**: Job data in annotations, fast queries via labels

### Implementation Details
- **Execution Jobs** contain complete job data in `jupyter-scheduler.io/job-data` annotation
- **Label Selectors** enable efficient server-side filtering (`jupyter-scheduler.io/job-id`, etc.)
- **No Cleanup**: `_cleanup_job()` calls removed, Jobs persist indefinitely
- **Database Interface**: K8sDatabaseManager.commit() is now a no-op

### Storage Considerations & Future Enhancements

#### Resource Usage
- **Current**: Each Job ~1-2KB metadata + full K8s Job spec
- **Scale Impact**: 10,000 jobs ≈ 10-20MB etcd storage
- **Recommendation**: Archive Jobs older than 30-90 days for large deployments

#### Future: CRD-Based Database (Next Architecture Evolution)
**When to Migrate**: When query performance or storage optimization becomes critical

**CRD Benefits**:
- **Semantic Correctness**: Purpose-built API objects instead of abusing Jobs
- **Storage Efficiency**: ~1KB per record vs current Job overhead
- **Query Performance**: Native indexing and custom controllers
- **API Integration**: First-class kubectl support (`kubectl get scheduledjobs`)

**Implementation Path**:
```yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: scheduledjobs.jupyter-scheduler.io
spec:
# ... CRD definition for ScheduledJob resource
```

#### Archival Strategy (Implementation Ready)
```python
# Example: Archive jobs older than retention period
def archive_old_jobs(retention_days=30):
old_jobs = k8s_batch.list_namespaced_job(
label_selector=f"jupyter-scheduler.io/created-before={cutoff_date}"
)
# Extract metadata to ConfigMap/S3, delete Job
```

## Meta-Learnings for Future Claude Code Instances

### Architectural Decision-Making Process
When questioning existing architecture:
1. **Challenge assumptions**: Don't accept "that's how it was built" - question if the current approach makes semantic sense
2. **Follow the data flow**: Trace what actually contains the valuable information (execution Jobs had all the context)
3. **Apply first principles**: Ask "what is the natural representation of this concept in the target system?"
4. **Consider resource efficiency**: Balance semantic correctness with resource usage

### Jobs-as-Records Decision Process
**Original questioning**: "Why use separate job abstractions for execution and record keeping? Wouldn't it make sense to use the same?"

**Analysis approach**:
- **Semantic consistency**: Execution Jobs ARE the work that was done - they should be the record
- **Information completeness**: Execution Jobs contain logs, resource usage, exact specs - far more valuable than metadata shadows
- **Kubernetes principles**: Jobs are designed to represent "work that was completed"
- **Resource analysis**: Busybox containers were wasteful for storing JSON data

**Implementation philosophy**: Make the architecture match the domain model - the execution IS the record.

### Technical Implementation Patterns
- **User counterpoints validation**: When users challenge technical decisions, investigate thoroughly - they often spot architectural inconsistencies
- **Security context evaluation**: In development environments, don't over-engineer security if the baseline (JupyterLab) already has broad access
- **Future-proofing balance**: Plan for scale (mention CRDs) but implement the simplest correct solution first

### Documentation Audience Separation
- **README.md**: User and developer-facing, focus on features and usage
- **CLAUDE.md**: Internal development guidance, include decision reasoning and future Claude context
- **Avoid redundancy**: Don't repeat information between docs, reference when needed

## Code Quality Standards

- **Comments**: Only add comments that explain parts of code that are not evident from the code itself
- Explain WHY something is done when the reasoning isn't obvious
- Comments above the line they describe, not inline
- Comments above the line they describe, not inline
- Explain WHAT is being done when the code logic is complex or non-obvious
- If the code is self-evident, no comment is needed
- **Quality**: Insist on highest quality standards while avoiding over-engineering
- **Scope**: Stay strictly within defined scope - no feature creep or unnecessary complexity

## Logging Standards

- **Emoji Usage**: Use emojis sparingly and meaningfully - only for major action phases or critical states
- ✅ Good: `🔧 Initializing`, `📤 Uploading files`, `❌ S3 upload failed`
- ❌ Avoid: Emoji on every configuration line or routine status message
- **Configuration Logging**: Clean, scannable format without visual clutter
- Use simple indented format: ` S3_BUCKET: bucket-name`
- Reserve emojis for errors (❌) or important phase transitions (🔧, 📤, 📥)
- **Action Logging**: Lead with meaningful emoji, follow with clear description
- `📤 Uploading files from /path to S3...`
- `✅ Files successfully uploaded to s3://bucket/path`
- **Debug Information**: Include helpful context without emoji noise
- ` Command: aws s3 sync /local s3://bucket/path --quiet`

## Documentation Standards

- **Placeholder URLs/Values**: Use angle bracket format `<placeholder-description>`. Examples: `<your-s3-endpoint-url>`, `<your-minio-server-url>`, `<your-namespace>`
Expand Down
19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,8 @@ load-image: build-image

.PHONY: dev-env
dev-env: load-image
@echo "Setting kubectl context for Kind cluster..."
@kind export kubeconfig --name $(CLUSTER_NAME)
@echo ""
@echo "🚀 Development environment ready!"
@echo ""
Expand Down Expand Up @@ -204,4 +206,21 @@ status:
fi; \
else \
echo "❌ AWS CLI not installed"; \
fi
@echo ""
@echo "AWS Credentials (for container access):"
@if [ -n "$$AWS_ACCESS_KEY_ID" ]; then \
echo "✅ AWS_ACCESS_KEY_ID set"; \
else \
echo "❌ AWS_ACCESS_KEY_ID not set (required for container S3 access)"; \
fi
@if [ -n "$$AWS_SECRET_ACCESS_KEY" ]; then \
echo "✅ AWS_SECRET_ACCESS_KEY set"; \
else \
echo "❌ AWS_SECRET_ACCESS_KEY not set (required for container S3 access)"; \
fi
@if [ -n "$$AWS_SESSION_TOKEN" ]; then \
echo "✅ AWS_SESSION_TOKEN set (temporary credentials)"; \
else \
echo "ℹ️ AWS_SESSION_TOKEN not set (not required for permanent credentials)"; \
fi
39 changes: 23 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ Kubernetes backend for [jupyter-scheduler](https://github.com/jupyter-server/jup

1. Schedule notebook jobs through JupyterLab UI
2. Files uploaded to S3 bucket for storage
3. Kubernetes job downloads files, executes notebook in isolated pod
3. Kubernetes execution job downloads files, executes notebook in isolated pod
4. Results uploaded back to S3, then downloaded to JupyterLab and accessible through the UI
5. **Execution job persists as database record** - job history and debugging info preserved

**Key features:**
- **S3 storage** - files survive Kubernetes cluster or Jupyter Server failures. Supports any S3-compatible storage like AWS S3, MinIO, GCS with S3 API, and so on
- Parameter injection for notebook customization
- Multiple output formats (HTML, PDF, etc.)
- **Jobs-as-records** - execution Jobs serve as both workload AND database records (zero SQL dependencies)
- **Job history** - execution context, logs, and resource usage preserved
- **S3 storage** - files survive Kubernetes cluster or Jupyter Server failures
- Works with any Kubernetes cluster (Kind, minikube, EKS, GKE, AKS)
- Configurable resource limits (CPU/memory)

## Requirements

Expand Down Expand Up @@ -146,9 +146,12 @@ export S3_BUCKET="<your-test-bucket>"
export AWS_ACCESS_KEY_ID="<your-access-key>"
export AWS_SECRET_ACCESS_KEY="<your-secret-key>"

# Launch and test through JupyterLab UI
# Launch with K8s execution only
jupyter lab --Scheduler.execution_manager_class="jupyter_scheduler_k8s.K8sExecutionManager"

# Launch with K8s database + K8s execution
jupyter lab --SchedulerApp.db_url="k8s://default" --SchedulerApp.database_manager_class="jupyter_scheduler_k8s.K8sDatabaseManager" --Scheduler.execution_manager_class="jupyter_scheduler_k8s.K8sExecutionManager"

# Cleanup
make clean
```
Expand Down Expand Up @@ -197,15 +200,19 @@ make clean # Remove cluster and cleanup
## Implementation Status

### Working Features ✅
- Custom `K8sExecutionManager` that extends `jupyter-scheduler.ExecutionManager` and runs notebook jobs in Kubernetes pods
- Parameter injection and multiple output formats
- File handling for any notebook size with proven S3 operations
- Configurable CPU/memory limits
- Event-driven job monitoring with Watch API
- S3 storage: Files persist beyond kubernetes cluster or jupyter server failures using AWS CLI for reliable transfers
- **Jobs-as-Records Database**: `K8sDatabaseManager` stores job metadata in execution Jobs (zero SQL dependencies)
- **K8s Execution**: `K8sExecutionManager` runs notebook jobs in Kubernetes pods with context preservation
- **S3 Storage**: Files persist beyond Kubernetes cluster or Jupyter Server failures
- **Memory Management**: Configurable CPU/memory limits and requests
- **Event-driven Monitoring**: Watch API for real-time job status updates
- **Parameter Injection**: Dynamic notebook customization
- **Multiple Output Formats**: HTML, PDF, and other formats via nbconvert
- **File Handling**: Support for any notebook size with S3 operations

### Planned 🚧
- GPU resource configuration for k8s jobs from UI
- Kubernetes job stop/deletion from UI
- Kubernetes-native scheduling from UI
- PyPI package publishing
- **Custom Resource Definitions (CRDs)**: Optimized metadata storage for large-scale deployments
- **Job Archival**: Automated cleanup and archival of old execution Jobs
- **GPU Resource Configuration**: GPU allocation for ML workloads from UI
- **Job Management**: Stop/deletion of running Kubernetes jobs from UI
- **K8s-native Scheduling**: CronJobs integration from UI
- **PyPI Package Publishing**: Official package distribution
Loading