This document explains the design decisions, patterns, and architecture of the Launcher Go SDK.
- Overview
- Package Structure
- Protocol Layer
- Runtime Architecture
- Job Cache Design
- Testing Utilities
- Conformance Testing
- Design Patterns
- Design Decisions
- Trade-offs
- Future Considerations
The Launcher Go SDK provides a complete framework for building launcher plugins. The architecture is designed around several key principles:
- Simplicity: Plugin developers should focus on business logic, not protocol details
- Safety: Type-safe APIs prevent common mistakes
- Testability: Comprehensive testing utilities make plugin testing straightforward
- Performance: Efficient caching and streaming minimize overhead
- Extensibility: Clean interfaces allow for advanced features
The Launcher is a REST API service that provides a generic interface between Posit products (Posit Workbench, Posit Connect) and job scheduling systems. It is not specific to R processes -- it can launch arbitrary work through any scheduler for which a plugin exists.
On the front end, the Launcher exposes an HTTP API. Posit products send requests (e.g., GET /jobs, POST /jobs) to the Launcher, which handles authentication and authorization, distills the necessary information, and forwards it to the appropriate plugin.
On the back end, the Launcher manages plugins as child processes. It communicates with them via a JSON protocol over stdin/stdout. This means the Launcher and its plugins must run on the same machine, but the Posit product using the Launcher can run on a different machine (it only needs HTTP(S) access to the Launcher on port 5559 by default).
Multiple Launcher instances can be load balanced for improved throughput. In this configuration, every plugin instance must be able to return the same data, which typically means using the scheduler as the source of truth.
┌─────────────────────────────────────┐
│ Posit Workbench / Posit Connect │ ← Product layer (HTTP API)
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Launcher (Job Launcher Service) │ ← Launcher service (auth, routing)
└──────────────┬──────────────────────┘
│ JSON over
│ stdin/stdout
┌──────────────▼──────────────────────┐
│ Internal Protocol Layer │ ← internal/protocol
│ - Request/response serialization │
│ - Message framing │
│ - Stream management │
│ - stdin/stdout communication │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Public SDK Layer │ ← launcher, api, cache, logger
│ - Plugin interface │
│ - Job cache │
│ - Response writers │
│ - Type definitions │
└──────────────┬──────────────────────┘
│
┌──────────────▼──────────────────────┐
│ Plugin Implementation │ ← Your code
│ (implements Plugin interface) │
└──────────────┬──────────────────────┘
│
▼
Job Schedulers / Execution Environments
The launcher package is the main entry point:
// Core interfaces
type Plugin interface { /* 10 methods */ }
type ResponseWriter interface { /* 5 methods */ }
type StreamResponseWriter interface { /* 6 methods */ }
// Runtime for request handling
type Runtime struct { /* ... */ }
// Configuration
type DefaultOptions struct { /* ... */ }Design rationale: Keep the Plugin interface minimal and focused. All methods receive ResponseWriter and user context, making the API consistent and predictable.
Contains all types matching the Launcher Plugin API v3.6:
type Job struct { /* 30+ fields */ }
type JobFilter struct { /* ... */ }
type Error struct { /* ... */ }
// ... 40+ more typesDesign rationale: One package for all API types prevents circular dependencies and makes the type system discoverable. Types are kept flat (no deep nesting) for JSON serialization efficiency.
Provides in-memory job storage with pub/sub for status updates:
type JobCache struct { /* ... */ }Design rationale: Separate package allows plugins to opt out if they have their own storage. Pub/sub is built-in for status streaming rather than requiring external message queues.
Workbench-style logging utilities:
func NewLogger(name string, debug bool, dir string) (*slog.Logger, error)
func MustNewLogger(name string, debug bool, dir string) *slog.LoggerDesign rationale: Wraps slog with Posit product conventions (file rotation, formatting) so plugins match the logging style of Workbench and Connect.
Mocks, builders, and assertions:
type MockResponseWriter struct { /* ... */ }
type JobBuilder struct { /* ... */ }
func AssertNoError(t *testing.T, w *MockResponseWriter)Design rationale: Separate testing package follows Go conventions. Fluent builders make test data construction readable. Assertions provide helpful error messages.
Automated tests that verify plugin behavior against product contracts:
func Run(t *testing.T, p launcher.Plugin, user string, profile Profile)
func RunWorkflows(t *testing.T, p launcher.Plugin, user string, profile Profile)
func RunStopJob(t *testing.T, p launcher.Plugin, user string, opts StopOpts)Design rationale: Products (Workbench, Connect) expect specific behavioral contracts from plugins (e.g., Stop yields a terminal status, status streams deliver Running updates). Conformance tests codify these contracts so plugin authors get automated verification. The Profile struct parameterizes cross-scheduler behavioral deltas rather than hard-coding expectations.
Handles JSON serialization over stdin/stdout:
type Communicator struct { /* ... */ }
type Request struct { /* ... */ }
type Response struct { /* ... */ }Design rationale: Internal package prevents plugins from depending on protocol details. Protocol changes don't affect plugin code. Message framing prevents partial reads.
The SDK is structured into several logical components. Understanding their responsibilities helps with advanced implementation decisions.
The internal protocol layer receives and interprets requests from the Launcher, then translates and sends responses back. It listens for data on stdin in a background goroutine, parses and validates each request, and converts it into the appropriate typed request object for the Runtime to dispatch. When the plugin has a response, the protocol layer formats it and writes it to stdout.
The SDK fully implements this component. Plugin developers never interact with it directly.
The Runtime understands the meaning of each request and dispatches the correct action on the Plugin implementation. Given a request, it routes to the appropriate Plugin method, then converts the output to the appropriate response. Each request is processed in its own goroutine, so Plugin methods must be safe for concurrent use.
The cache maintains an in-memory store of jobs, provides pub/sub for status update notifications, enforces user permissions, and automatically expires old jobs. The cache acts as both the job repository and the status notification system -- when a job is updated via Update or AddOrUpdate, the cache notifies any active status stream subscribers automatically. The scheduler is always the source of truth for job state; the cache is a local working copy that plugins should populate during Bootstrap() and keep in sync via periodic polling.
Three types of requests produce streamed responses: job status, job output, and resource utilization. Each stream is independent -- the SDK constructs a new stream handler for each request. This means:
- Multiple output streams can be active for the same job simultaneously
- Individual stream instances don't need to coordinate with each other
- Each stream respects its own context for cancellation
- For job status streams, the cache's pub/sub handles fan-out automatically
When a plugin is launched by the Launcher, the following steps occur:
- The
mainfunction parses command-line options viaMustLoadOptions - The logger is created via
MustNewLogger - The job cache is created via
NewJobCache - The plugin implementation is constructed
NewRuntimecreates the runtime with the logger and pluginRuntime.Runis called, which: a. Initializes the protocol communicator (stdin/stdout) b. Receives and responds to the Bootstrap request (version negotiation) c. If the plugin implementsBootstrappedPlugin, callsBootstrap— this is where plugins should re-read active jobs from the scheduler into the cache d. Begins the heartbeat response loop e. Enters normal operation, dispatching requests to plugin methods- The plugin runs until the context is cancelled or a fatal error occurs
When the Launcher terminates the plugin (or the context is cancelled):
- The context cancellation propagates to all active goroutines
- All active streams receive context cancellation and should return
- The protocol communicator stops reading from stdin
- The Runtime waits for in-flight requests to complete
- The process exits
All messages are JSON with length prefix:
[4-byte length][JSON payload]
Why length-prefixed?
- Prevents partial message reads
- Allows streaming large payloads
- Simpler than delimiter-based protocols
- No escaping concerns
Each operation has typed request/response:
type SubmitJobRequest struct {
User string
Job *api.Job
}
type SubmitJobResponse struct {
Jobs []*api.Job
}Why separate types?
- Type safety prevents sending wrong data
- Clear documentation of what each operation needs
- Easy to add fields without breaking compatibility
Streaming methods use a different pattern:
- Initial request
- Multiple response messages
- Final close message
type StreamResponse struct {
Type string // "status", "output", "resource", "close"
Payload json.RawMessage
}Why separate stream type?
- Allows different payload types on same stream
- Client knows when stream is complete
- Error can be sent mid-stream
The Launcher can request multiple streams simultaneously (e.g., multiple users watching job output, or status streams for different jobs). The SDK handles this by:
- Creating a new goroutine for each streaming request
- Each goroutine receives its own
context.Contextfor cancellation - When the Launcher sends a cancel request for a stream, the SDK cancels the corresponding context
- Job status streams use the cache's pub/sub, so multiple subscribers can watch the same job without additional scheduler queries
- Output and resource utilization streams are independent instances -- each one queries the scheduler separately
Important: Because multiple streams can be active simultaneously, your plugin's scheduler interaction code should be safe for concurrent use.
The Runtime dispatches requests to plugin methods:
Request arrives → Parse type → Route to method → Execute → Send response
func (r *Runtime) handleRequest(ctx context.Context, req *protocol.Request) {
switch req.Type {
case "submit_job":
r.plugin.SubmitJob(...)
case "get_job":
r.plugin.GetJob(...)
// ... 8 more cases
}
}Why switch-based routing?
- Simple and explicit
- Easy to debug
- Fast (no reflection)
- Type-safe
Each request gets a context:
ctx, cancel := context.WithCancel(parentCtx)
defer cancel()
go r.plugin.GetJobOutput(ctx, w, user, id, outputType)Why context?
- Cancellation propagates (client disconnect stops work)
- Timeouts can be added at runtime level
- Standard Go pattern for cancellation
Errors are typed with error codes:
type Error struct {
Code int // CodeJobNotFound, CodeInvalidJobState, etc.
Message string
}Why error codes?
- Workbench and Connect can handle errors differently based on code
- Clients can retry transient errors (CodeTimeout)
- Better than string parsing
- Follows Launcher API specification
The cache uses in-memory storage. The scheduler is the source of truth for job state — the cache is a local working copy that plugins populate at startup and keep in sync during operation.
cache, _ := cache.NewJobCache(ctx, lgr)Plugins should implement BootstrappedPlugin and use Bootstrap() to re-read active jobs from the scheduler into the cache before accepting requests. A periodic sync loop (e.g., every 5 seconds) should then reconcile cache state with the scheduler during normal operation. This is consistent with how all existing Launcher plugins (Local, Kubernetes, Slurm) operate.
Cache includes built-in pub/sub:
// Subscriber
cache.StreamJobStatus(ctx, w, user, jobID)
// Publisher (different goroutine)
cache.Update(user, jobID, func(job *api.Job) *api.Job {
job.Status = api.StatusRunning // Triggers notification
return job
})Why built-in pub/sub?
- No external message queue needed
- In-process is fast
- Automatic cleanup when subscribers disconnect
- Matches Launcher's streaming model
Implementation:
type JobCache struct {
subscribers map[string][]chan *api.Job // jobID -> subscribers
mu sync.RWMutex
}Cache enforces user isolation:
cache.WriteJob(w, "alice", jobID) // Only succeeds if job.User == "alice"Why in cache?
- Prevents permission bugs in plugin code
- Consistent across all operations
- Single source of truth
Old jobs are automatically removed:
go cache.cleanupExpiredJobs(ctx, expiry)Why automatic?
- Prevents cache from growing unbounded
- Plugin doesn't need to track expiration
- Configurable per deployment
Captures all responses for assertions:
w := plugintest.NewMockResponseWriter()
plugin.SubmitJob(context.Background(), w, "alice", job)
assert.True(t, w.HasError() == false)
assert.Equal(t, 1, len(w.AllJobs()))Why capture all responses?
- Single call may write multiple times
- Helps test error handling
- Makes assertions straightforward
Thread-safety: Mock uses sync.Mutex because plugin methods might spawn goroutines.
Readable test data construction:
job := plugintest.NewJob().
WithUser("alice").
WithCommand("python train.py").
WithMemory("8GB").
Running().
Build()Why fluent API?
- Tests read like English
- Only specify what matters for the test
- Defaults handle required fields
- Discoverability via autocomplete
Descriptive error messages:
func AssertJobStatus(t *testing.T, job *api.Job, expected string) {
if job.Status != expected {
t.Errorf("Expected job status %s, got %s (job: %+v)",
expected, job.Status, job)
}
}Why custom assertions?
- Better error messages than
assert.Equal - Job-specific context in failures
- Reduce test boilerplate
Plugin authors implement the launcher.Plugin interface but historically had no automated way to verify their implementation will work correctly with Posit products. The plugintest package provides unit-test building blocks (mocks, builders, assertions) but no orchestrated test scenarios. Conformance tests fill this gap: automated behavioral tests that verify a plugin handles the request sequences Posit products produce.
// Tier 1: Universal invariants
conformance.Run(t, plugin, user, profile)
// Tier 2: Product workflow tests
conformance.RunWorkflows(t, plugin, user, profile)
// Tier 3: Individual scenarios
conformance.RunStopJob(t, plugin, user, opts)Why three tiers?
- Tier 1 catches fundamental contract violations (missing IDs, wrong error codes)
- Tier 2 verifies the end-to-end request sequences products rely on
- Tier 3 allows testing specific behaviors in isolation, useful for debugging
Each tier builds on the one below — RunWorkflows calls the Tier 3 scenario functions internally. Plugin authors can use any combination.
Different schedulers produce different outcomes for identical operations. For example, ControlJob(Stop) yields StatusFinished on Local/Kubernetes but StatusKilled on Slurm (because scancel reports a KILLED state). Rather than hard-coding expectations, the Profile struct parameterizes these deltas:
type Profile struct {
JobFactory func(user string) *api.Job // How to create a submittable job
LongRunningJob func(user string) *api.Job // How to create a job for control tests
StopStatus string // Terminal status after Stop
KillExitCodes []int // Acceptable exit codes after Kill
// ...
}Why a single struct? Both Run and RunWorkflows need the same behavioral parameters. A single struct avoids duplicating configuration across tiers and makes it clear that the deltas are a property of the plugin, not of the test tier.
Exported helpers (SubmitJob, WaitForStatus, WaitForTerminalStatus, etc.) follow two patterns:
- Fatal on prerequisite failure:
SubmitJobcallst.Fatalbecause subsequent test logic can't proceed without a job ID - Return errors for assertions:
GetJob,WaitForStatusreturn errors so callers can assert on error conditions or decide how to handle timeouts
Why poll-based waiting? WaitForStatus polls GetJob rather than using the streaming API. This is deliberate: it tests the same code path products use for quick status checks, and it avoids goroutine management complexity in test code. The streaming API is tested separately in RunStatusStream.
The conformance package depends on plugintest (mocks, assertions) but not the reverse. This maintains a clear dependency direction:
conformance → plugintest → launcher, api
Plugin authors use plugintest for unit tests and conformance for behavioral verification. The two packages complement each other.
Core types are interfaces:
type Plugin interface { /* ... */ }
type ResponseWriter interface { /* ... */ }Benefits:
- Easy to mock for testing
- Allows alternative implementations
- Clear contracts
- Follows Go conventions
Used in builders:
type JobBuilder struct { /* ... */ }
func (b *JobBuilder) WithUser(user string) *JobBuilder {
b.job.User = user
return b
}Benefits:
- Chainable
- Self-documenting
- Optional parameters
- Type-safe
Used in cache updates:
cache.Update(user, jobID, func(job *api.Job) *api.Job {
job.Status = api.StatusRunning
return job
})Benefits:
- Atomic updates
- Job only locked during callback
- Clear what's being modified
- Can abort by returning unchanged job
All streaming methods accept context:
func (p *Plugin) GetJobOutput(ctx context.Context, w StreamResponseWriter, ...) {
for {
select {
case <-ctx.Done():
return // Client disconnected
case data := <-outputChan:
w.WriteJobOutput(data, outputType)
}
}
}Benefits:
- Automatic cleanup on client disconnect
- Standard Go pattern
- Works with timeouts
- Composable
Chosen over C++, Python, Java:
Pros:
- Fast compilation and startup
- Small binary size
- Excellent concurrency primitives
- Strong standard library
- Easy deployment (single binary)
- Good HTTP/gRPC libraries
Cons:
- No generics in scheduler code (Go 1.18+)
- Less mature than C++ SDK
- Smaller ecosystem than Python
Verdict: Go's simplicity and deployment model outweigh the cons.
Alternative: gRPC, HTTP
Pros of stdin/stdout:
- Simple process model
- No port management
- Automatic cleanup (process death)
- Works in containers
- Matches existing C++ SDK
Cons:
- Can't debug with multiple plugins
- No connection multiplexing
Verdict: Simplicity wins for this use case.
Alternative: Types in launcher package
Pros of separate:
- No circular dependencies
- Clear API boundary
- Can import types without launcher
- Matches Posit product conventions
Cons:
- More packages to import
- Longer type names (
api.Job)
Verdict: Separation provides better structure.
Alternative: Export protocol types
Pros of hiding:
- Plugin can't depend on protocol details
- Free to change protocol implementation
- Cleaner API surface
- Prevents misuse
Cons:
- Can't customize protocol
- Can't reuse types
Verdict: Hiding gives flexibility for future changes.
Decision: Prefer type safety
// Type-safe (chosen)
type JobStatus string
const StatusPending JobStatus = "Pending"
// vs flexible (rejected)
type JobStatus string // any stringRationale: Catch errors at compile time, better IDE support.
Decision: Cache with expiration
Plugins cache jobs rather than querying scheduler every time. Stale data is acceptable for short periods.
Rationale: Reduces scheduler load, faster responses, acceptable for UI.
Decision: In-process pub/sub
Alternatives: Redis, NATS
Rationale: One plugin = one process. No cross-plugin communication needed. External pub/sub adds complexity and dependencies.
Decision: Polling for job status
Most schedulers (Slurm, PBS, LSF) don't provide event notifications.
Rationale: Matches scheduler capabilities, simpler implementation, consistent behavior.
Decision: Provide utilities, not framework
Alternatives: Full testing framework, test runner
Rationale: Plugins should use standard go test. Utilities (mocks, builders, assertions) are sufficient.
Current version: v0.x (pre-1.0)
Breaking changes allowed in minor versions with migration guides. After v1.0, semantic versioning with backwards compatibility guarantees.
Potential additions:
- Job Dependencies: Directed Acyclic Graph (DAG) execution
- Autoscaling: Scale cluster based on load
- Cost Tracking: Track compute costs per job
- Spot Instance Support: Preemptible jobs
All can be added via new interfaces without breaking existing plugins.
Current multi-cluster support is basic. Future improvements:
- Cluster Health: Monitor cluster availability
- Failover: Automatic cluster switching
- Load Balancing: Distribute jobs across clusters
- Cluster Groups: Logical grouping of clusters
Potential additions:
- Metrics: Prometheus endpoint for job metrics
- Tracing: OpenTelemetry support
- Health Checks: HTTP endpoint for monitoring
- Profiling: pprof endpoint for debugging
Potential improvements:
- Connection Pooling: Reuse scheduler connections
- Result Caching: Cache scheduler queries
- Batch Status Updates: Update multiple jobs atomically
- Lazy Loading: Load job details on demand
Potential additions:
- Audit Logging: Track all operations
- Secret Management: Integration with secret stores
- Role-Based Access Control (RBAC)
- Encryption: Encrypt job data at rest
The Launcher Go SDK architecture prioritizes:
- Developer experience - Simple, intuitive API
- Type safety - Catch errors at compile time
- Testability - Easy to write good tests
- Performance - Fast enough for production
- Extensibility - Can evolve without breaking changes
These principles guide all design decisions and will continue to guide future development.