replicate
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 54 additions & 25 deletions b/‎ARCHITECTURE.md‎
Lines changed: 54 additions & 25 deletions
diff --git a/‎cmd/cog/main.go‎
Lines changed: 84 additions & 95 deletions b/‎cmd/cog/main.go‎
Lines changed: 84 additions & 95 deletions
@@ -25,17 +25,33 @@ This hybrid architecture combines Go's performance advantages with Python's mach
 
 ### Go HTTP server
 
-The HTTP server component handles all external communication and process management:
+The HTTP server component is organized into focused packages with clear separation of concerns:
 
+#### `internal/server` - HTTP API Layer
 **HTTP routing**: Implements the Cog prediction API with endpoints for predictions, health checks, cancellation, and shutdown. Routes are dynamically configured based on runtime mode (standard vs procedure).
 
-**Process management**: Manages Python runner processes including startup, shutdown, health monitoring, and crash recovery. In procedure mode, can manage multiple isolated runners with automatic eviction policies.
+**Request handling**: Processes incoming HTTP requests, validates payloads, and coordinates with the runner management layer. Handles both synchronous and asynchronous prediction workflows.
 
-**Request coordination**: Handles request queuing, concurrency limits, and response aggregation. Maps HTTP requests to appropriate Python runners and manages the full request lifecycle.
+**Response management**: Aggregates results from runner layer and formats responses according to API specifications. Manages streaming responses and webhook notifications.
 
-**File I/O handling**: Manages input file downloads and output file uploads, with support for both local storage and external upload endpoints. Handles path resolution and cleanup automatically.
+#### `internal/runner` - Process Management Layer
+**Centralized runner management**: The `Manager` component handles all Python process lifecycle including startup, shutdown, health monitoring, and crash recovery. Provides slot-based concurrency control and automatic resource cleanup.
 
-**IPC coordination**: Receives status updates from Python processes via HTTP and manages bidirectional communication through the filesystem.
+**Individual runner management**: The `Runner` component manages individual Python processes with proper context cancellation, log accumulation, and per-prediction response tracking.
+
+**Procedure support**: Dynamic runner creation for different source URLs with automatic eviction policies. Handles isolation requirements and resource allocation in multi-tenant scenarios.
+
+**Configuration management**: Handles cog.yaml parsing, environment setup, and runtime configuration for both standard and procedure modes.
+
+#### `internal/webhook` - Webhook Delivery
+**Webhook coordination**: Manages webhook delivery with deduplication, retry logic, and proper event filtering. Uses atomic operations to prevent duplicate terminal webhooks.
+
+**Event tracking**: Tracks webhook events per prediction with proper timing and log accumulation to ensure complete notification delivery.
+
+#### `internal/service` - Application Lifecycle
+**Service coordination**: Manages overall application lifecycle including graceful shutdown, signal handling, and component initialization.
+
+**Configuration integration**: Bridges CLI configuration with internal component configuration and handles service-level concerns like working directory management.
 
 ### Python model runner (coglet)
 
@@ -53,11 +69,24 @@ The `coglet` component focuses purely on model execution and introspection:
 
 ### Request flow architecture
 
-**Standard mode**: Single Python runner handling requests sequentially or with limited concurrency based on predictor capabilities.
+The architecture provides clean separation between HTTP handling, runner management, and process execution:
+
+#### Request Processing Flow
+
+1. **HTTP Request**: `internal/server` receives and validates incoming requests
+2. **Runner Assignment**: `internal/runner/Manager` assigns requests to available runners using slot-based concurrency control
+3. **Process Execution**: `internal/runner/Runner` manages individual Python process interaction via file-based IPC
+4. **Response Tracking**: Per-prediction watchers monitor Python process responses and handle log accumulation
+5. **Webhook Delivery**: `internal/webhook` manages asynchronous webhook notifications with deduplication
+6. **HTTP Response**: `internal/server` formats and returns final responses to clients
+
+#### Execution Modes
+
+**Standard mode**: Single Python runner managed by the system with configurable concurrency based on predictor capabilities. The Manager creates and maintains one long-lived runner process.
 
-**Procedure mode**: Dynamic runner management where the Go server creates Python processes on-demand for different source URLs, with automatic scaling and eviction based on usage patterns.
+**Procedure mode**: Dynamic runner management where the Manager creates Python processes on-demand for different source URLs. Implements LRU eviction, automatic scaling, and resource isolation between procedures.
 
-**Concurrency handling**: The Go server aggregates concurrency limits across all runners and provides global throttling while individual Python processes handle their own internal concurrency based on predictor type (sync vs async).
+**Concurrency handling**: The Manager provides global slot-based concurrency control while individual Runners handle per-process concurrency limits. Atomic operations ensure safe concurrent access to shared state.
 
 ## Communication patterns
 
@@ -87,38 +116,38 @@ The server exposes a RESTful API compatible with the original Cog specification:
 
 **Output processing**: Python runners write outputs to files when needed, and the Go server handles upload/base64 encoding based on client preferences.
 
-## Contrast with old Cog server
+## Architecture benefits
 
-The new architecture addresses several limitations of the original FastAPI-based implementation:
+The hybrid Go/Python architecture provides several key advantages:
 
-### Performance improvements
+### Performance characteristics
 
-**Go HTTP handling**: The Go server can handle much higher request throughput and lower latency compared to Python's uvicorn, especially for health checks and simple requests.
+**Go HTTP handling**: The Go server provides high request throughput and low latency, especially for health checks and management requests.
 
-**Process isolation**: Model crashes or hangs no longer affect the HTTP server, providing better availability and faster recovery.
+**Process isolation**: Model crashes or hangs do not affect the HTTP server, providing better availability and faster recovery.
 
-**Concurrent processing**: Better support for concurrent predictions with proper resource accounting and backpressure management.
+**Concurrent processing**: Supports concurrent predictions with proper resource accounting and backpressure management through slot-based concurrency control.
 
-### Reliability improvements
+### Reliability features
 
-**Fault tolerance**: Python process crashes are isolated and can be recovered without restarting the entire server.
+**Fault tolerance**: Python process crashes are isolated and recovered without affecting other operations or requiring server restart.
 
-**Resource management**: Better control over memory usage, file descriptor limits, and process lifecycle.
+**Resource management**: Provides precise control over memory usage, file descriptor limits, and process lifecycle with automatic cleanup.
 
 **Dependency isolation**: Zero Python dependencies in the runtime layer eliminates version conflicts with model requirements.
 
-### Operational improvements
+### Operational capabilities
 
-**Multi-tenancy**: Procedure mode allows serving multiple models/procedures from a single server instance with proper isolation.
+**Multi-tenancy**: Procedure mode serves multiple models/procedures from a single server instance with proper isolation and resource allocation.
 
-**Debuggability**: File-based IPC makes it easy to inspect request/response payloads and trace execution flow.
+**Debuggability**: File-based IPC enables easy inspection of request/response payloads and execution flow tracing.
 
-**Resource cleanup**: Automatic cleanup of temporary files, processes, and other resources with proper error handling.
+**Resource cleanup**: Automatic cleanup of temporary files, processes, and other resources with comprehensive error handling.
 
-### API compatibility
+### API design
 
-**Backward compatibility**: Maintains full API compatibility with existing Cog clients while providing performance and reliability improvements.
+**Client compatibility**: Maintains full API compatibility with existing Cog clients while providing enhanced performance and reliability.
 
-**Extended features**: Adds procedure mode capabilities while preserving the original single-model deployment pattern.
+**Flexible deployment**: Supports both single-model deployment patterns and multi-tenant procedure mode from the same codebase.
 
-The old server's single-process architecture required careful management of async/await patterns and could suffer from blocking operations affecting the entire service. The new architecture's process separation eliminates these concerns while providing better scalability and fault isolation.
+The process separation architecture eliminates concerns around blocking operations affecting the entire service while providing better scalability and fault isolation through clear component boundaries.
@@ -2,19 +2,20 @@ package main
 
 import (
 	"context"
-	"errors"
 	"fmt"
-	"net/http"
 	"os"
 	"os/exec"
-	"os/signal"
-	"strconv"
 	"syscall"
 	"time"
 
 	"github.com/alecthomas/kong"
+	"github.com/replicate/go/logging"
+	"go.uber.org/zap"
+	"go.uber.org/zap/zapcore"
 
-	"github.com/replicate/cog-runtime/internal/server"
+	"github.com/replicate/cog-runtime/internal/config"
+	"github.com/replicate/cog-runtime/internal/runner"
+	"github.com/replicate/cog-runtime/internal/service"
 	"github.com/replicate/cog-runtime/internal/util"
 )
 
@@ -28,6 +29,7 @@ type ServerCmd struct {
 	WorkingDirectory          string        `help:"Override the working directory for predictions" name:"working-directory" env:"COG_WORKING_DIRECTORY"`
 	RunnerShutdownGracePeriod time.Duration `help:"Grace period before force-killing prediction runners" name:"runner-shutdown-grace-period" default:"600s" env:"COG_RUNNER_SHUTDOWN_GRACE_PERIOD"`
 	CleanupTimeout            time.Duration `help:"Maximum time to wait for process cleanup before hard exit" name:"cleanup-timeout" default:"10s" env:"COG_CLEANUP_TIMEOUT"`
+	MaxRunners                int           `help:"Maximum number of runners to allow (0 for auto-detect)" name:"max-runners" env:"COG_MAX_RUNNERS" default:"0"`
 }
 
 type SchemaCmd struct{}
@@ -40,125 +42,114 @@ type CLI struct {
 	Test   TestCmd   `cmd:"" help:"Run model tests to verify functionality"`
 }
 
-var logger = util.CreateLogger("cog")
+// createBaseLogger creates a base logger with configurable level
+func createBaseLogger(name string) *zap.Logger {
+	logLevel := os.Getenv("COG_LOG_LEVEL")
+	if logLevel == "" {
+		logLevel = "info"
+	}
+	_, err := zapcore.ParseLevel(logLevel)
+	if err != nil {
+		fmt.Printf("Failed to parse log level \"%s\": %s\n", logLevel, err) //nolint:forbidigo // logger setup error reporting
+	}
 
-func (s *ServerCmd) Run() error {
-	log := logger.Sugar()
+	return logging.New(name).WithOptions(zap.IncreaseLevel(zapcore.DebugLevel))
+}
 
+// buildServiceConfig converts CLI ServerCmd to service configuration
+func buildServiceConfig(s *ServerCmd) (config.Config, error) {
+	log := createBaseLogger("cog-config").Sugar()
+
+	logLevel := log.Level()
+	log.Infow("log level", "level", logLevel)
+	log.Infow("env log level", "level", os.Getenv("COG_LOG_LEVEL"))
 	// One-shot mode requires procedure mode
 	if s.OneShot && !s.UseProcedureMode {
-		log.Errorw("one-shot mode requires procedure mode")
-		return fmt.Errorf("one-shot mode requires procedure mode, use --use-procedure-mode")
+		log.Error("one-shot mode requires procedure mode")
+		return config.Config{}, fmt.Errorf("one-shot mode requires procedure mode, use --use-procedure-mode")
 	}
 
 	// Procedure mode implies await explicit shutdown
-	// i.e. Python process exit should not trigger shutdown
+	awaitExplicitShutdown := s.AwaitExplicitShutdown
 	if s.UseProcedureMode {
-		s.AwaitExplicitShutdown = true
+		awaitExplicitShutdown = true
 	}
-	log.Infow("configuration",
-		"use-procedure-mode", s.UseProcedureMode,
-		"await-explicit-shutdown", s.AwaitExplicitShutdown,
-		"one-shot", s.OneShot,
-		"upload-url", s.UploadURL,
-	)
-
-	addr := fmt.Sprintf("%s:%d", s.Host, s.Port)
-	log.Infow("starting Cog HTTP server", "addr", addr, "version", util.Version(), "pid", os.Getpid())
 
-	var err error
-	currentWorkingDirectory := s.WorkingDirectory
-	if currentWorkingDirectory == "" {
-		currentWorkingDirectory, err = os.Getwd()
+	// Resolve working directory
+	workingDir := s.WorkingDirectory
+	if workingDir == "" {
+		var err error
+		workingDir, err = os.Getwd()
 		if err != nil {
 			log.Errorw("failed to get current working directory", "error", err)
-			return err
+			return config.Config{}, fmt.Errorf("failed to get current working directory: %w", err)
 		}
 	}
 
-	forceShutdown := make(chan struct{}, 1)
-
-	serverCfg := server.Config{
+	cfg := config.Config{
+		Host:                      s.Host,
+		Port:                      s.Port,
 		UseProcedureMode:          s.UseProcedureMode,
-		AwaitExplicitShutdown:     s.AwaitExplicitShutdown,
+		AwaitExplicitShutdown:     awaitExplicitShutdown,
 		OneShot:                   s.OneShot,
-		IPCUrl:                    fmt.Sprintf("http://localhost:%d/_ipc", s.Port),
+		WorkingDirectory:          workingDir,
 		UploadURL:                 s.UploadURL,
-		WorkingDirectory:          currentWorkingDirectory,
+		IPCUrl:                    fmt.Sprintf("http://localhost:%d/_ipc", s.Port),
+		MaxRunners:                s.MaxRunners,
 		RunnerShutdownGracePeriod: s.RunnerShutdownGracePeriod,
 		CleanupTimeout:            s.CleanupTimeout,
-		ForceShutdown:             forceShutdown,
-	}
-	// FIXME: in non-procedure mode we do not support concurrency in a meaningful way, we
-	// statically create the runner list sized at 1.
-	if maxRunners, ok := os.LookupEnv("COG_MAX_RUNNERS"); ok && s.UseProcedureMode {
-		if i, err := strconv.Atoi(maxRunners); err == nil {
-			serverCfg.MaxRunners = i
-		} else {
-			log.Errorw("failed to parse COG_MAX_RUNNERS", "value", maxRunners)
-		}
 	}
-	ctx, cancel := context.WithCancel(context.Background())
-	h, err := server.NewHandler(serverCfg, cancel) //nolint:contextcheck // context passing not viable in current architecture
+
+	log.Infow("service configuration",
+		"use_procedure_mode", cfg.UseProcedureMode,
+		"await_explicit_shutdown", cfg.AwaitExplicitShutdown,
+		"one_shot", cfg.OneShot,
+		"upload_url", cfg.UploadURL,
+		"working_directory", cfg.WorkingDirectory,
+		"max_runners", cfg.MaxRunners,
+	)
+
+	return cfg, nil
+}
+
+func (s *ServerCmd) Run() error {
+	// Create base logger
+	baseLogger := createBaseLogger("cog")
+	log := baseLogger.Sugar()
+
+	// Build service configuration
+	cfg, err := buildServiceConfig(s)
 	if err != nil {
-		log.Errorw("failed to create server handler", "error", err)
 		return err
 	}
-	mux := server.NewServeMux(h, s.UseProcedureMode)
-	httpServer := &http.Server{
-		Addr:              addr,
-		Handler:           mux,
-		ReadHeaderTimeout: 5 * time.Second, // TODO: is 5s too long? likely
-	}
-	go func() {
-		<-ctx.Done()
-		if err := httpServer.Shutdown(ctx); err != nil {
-			log.Errorw("failed to shutdown server", "error", err)
-			os.Exit(1)
-		}
-	}()
-	go func() {
-		ch := make(chan os.Signal, 1)
-		signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
-		for {
-			select {
-			case sig := <-ch:
-				if sig == syscall.SIGTERM && s.AwaitExplicitShutdown {
-					log.Warnw("ignoring signal to stop", "signal", sig)
-				} else {
-					log.Infow("stopping Cog HTTP server", "signal", sig)
-					if err := h.Stop(); err != nil {
-						log.Errorw("failed to stop server handler", "error", err)
-						os.Exit(1)
-					}
-				}
-			case <-forceShutdown:
-				log.Errorw("cleanup timeout reached, forcing ungraceful shutdown")
-				os.Exit(1)
-			}
-		}
-	}()
-	if err := httpServer.ListenAndServe(); errors.Is(err, http.ErrServerClosed) {
-		exitCode := h.ExitCode()
-		if exitCode == 0 {
-			log.Infow("shutdown completed normally")
-		} else {
-			log.Errorw("python runner exited with code", "code", exitCode)
-		}
-		return nil
+
+	addr := fmt.Sprintf("%s:%d", cfg.Host, cfg.Port)
+	log.Infow("starting Cog HTTP server", "addr", addr, "version", util.Version(), "pid", os.Getpid())
+
+	// Create service with base logger
+	svc := service.New(cfg, baseLogger)
+
+	// Create root context for the entire service
+	ctx, cancel := context.WithCancel(context.Background())
+	defer cancel()
+
+	// Initialize service components
+	if err := svc.Initialize(ctx); err != nil {
+		return err
 	}
-	return err
+
+	return svc.Run(ctx)
 }
 
 func (s *SchemaCmd) Run() error {
-	log := logger.Sugar()
+	log := createBaseLogger("cog-schema").Sugar()
 
 	wd, err := os.Getwd()
 	if err != nil {
 		log.Errorw("failed to get working directory", "error", err)
 		return err
 	}
-	y, err := util.ReadCogYaml(wd)
+	y, err := runner.ReadCogYaml(wd)
 	if err != nil {
 		log.Errorw("failed to read cog.yaml", "error", err)
 		return err
@@ -177,14 +168,14 @@ func (s *SchemaCmd) Run() error {
 }
 
 func (t *TestCmd) Run() error {
-	log := logger.Sugar()
+	log := createBaseLogger("cog-test").Sugar()
 
 	wd, err := os.Getwd()
 	if err != nil {
 		log.Errorw("failed to get working directory", "error", err)
 		return err
 	}
-	y, err := util.ReadCogYaml(wd)
+	y, err := runner.ReadCogYaml(wd)
 	if err != nil {
 		log.Errorw("failed to read cog.yaml", "error", err)
 		return err
@@ -203,8 +194,6 @@ func (t *TestCmd) Run() error {
 }
 
 func main() {
-	log := logger.Sugar()
-
 	var cli CLI
 	ctx := kong.Parse(&cli,
 		kong.Name("cog"),
@@ -214,7 +203,7 @@ func main() {
 
 	err := ctx.Run()
 	if err != nil {
-		log.Error(err)
+		fmt.Fprintf(os.Stderr, "Error: %v\n", err) //nolint:forbidigo // main function error handling
 		os.Exit(1)
 	}
 }