Skip to content
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions .agents/skills/adding-dependencies/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
name: adding-dependencies
description: "Adding or updating crate dependencies in the Golem workspace. Use when adding a new Rust dependency, changing dependency versions, or configuring dependency features."
---

# Adding Dependencies

All crate dependencies in the Golem workspace are centrally managed. Versions and default features are specified **once** in the root `Cargo.toml` under `[workspace.dependencies]`, and workspace members reference them with `{ workspace = true }`.

## Adding a New Dependency

### Step 1: Add to root workspace Cargo.toml

Add the dependency under `[workspace.dependencies]` in the root `Cargo.toml`, specifying the version and any default features:

```toml
# Simple version
my-crate = "1.2.3"

# With features
my-crate = { version = "1.2.3", features = ["feature1", "feature2"] }

# With default-features disabled
my-crate = { version = "1.2.3", default-features = false }
```

Keep entries **alphabetically sorted** within the section. Internal workspace crates are listed first (with `path`), followed by external dependencies.

### Step 2: Reference from workspace member

In the member crate's `Cargo.toml`, add the dependency using `workspace = true`:

```toml
[dependencies]
my-crate = { workspace = true }

# To add extra features beyond what the workspace specifies
my-crate = { workspace = true, features = ["extra-feature"] }

# To make it optional
my-crate = { workspace = true, optional = true }
```

**Never** specify a version directly in a member crate's `Cargo.toml`. Always use `{ workspace = true }`.

The same pattern applies to `[dev-dependencies]` and `[build-dependencies]`.

### Step 3: Verify

```shell
cargo build -p <crate> # Build the specific crate
cargo make build # Full workspace build
```

## Updating a Dependency Version

Change the version **only** in the root `Cargo.toml` under `[workspace.dependencies]`. All workspace members automatically pick up the new version.

## Pinned and Patched Dependencies

Some dependencies use exact versions (`=x.y.z`) to ensure compatibility. Check the `[patch.crates-io]` section in the root `Cargo.toml` for git-overridden crates (e.g., `wasmtime`). When updating patched dependencies, both the version under `[workspace.dependencies]` and the corresponding `[patch.crates-io]` entry must be updated together.

## Checklist

1. Version specified in root `Cargo.toml` under `[workspace.dependencies]`
2. Member crate references it with `{ workspace = true }`
3. No version numbers in member crate `Cargo.toml` files
4. Entry is alphabetically sorted in the workspace dependencies list
5. `cargo make build` succeeds
89 changes: 89 additions & 0 deletions .agents/skills/debugging-hanging-tests/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
name: debugging-hanging-tests
description: "Diagnosing and fixing hanging worker executor or integration tests. Use when a test hangs indefinitely, times out, or appears stuck during execution."
---

# Debugging Hanging Tests

Worker executor and integration tests can hang indefinitely due to `unimplemented!()` panics in async tasks, deadlocks, missing shard assignments, or other async runtime issues. This skill provides a systematic workflow for diagnosing and resolving these hangs.

## Common Causes

| Cause | Symptom |
|-------|---------|
| `unimplemented!()` panic in async task | Test hangs after a log line mentioning the unimplemented feature |
| Deadlock | Test hangs with no further log output |
| Missing shard assignment | Worker never starts executing |
| Channel sender dropped | Receiver awaits forever with no error |
| Infinite retry loop | Repeated log lines with the same error |

## Step 1: Add a Timeout

Add a `#[timeout]` attribute so the test fails with a clear error instead of hanging forever:

```rust
use test_r::test;
use test_r::timeout;

#[test]
#[timeout("30s")]
async fn my_hanging_test() {
// ...
}
```

Choose a timeout generous enough for normal execution but short enough to fail quickly when hung (30s–60s for most tests, up to 120s for complex integration tests).

## Step 2: Capture Full Output

Run the test with `--nocapture` and save **all output** to a file. The root cause often appears far before the point where the test hangs:

```shell
cargo test -p <crate> <test_name> -- --nocapture > tmp/test_output.txt 2>&1
```

**Important:** Always redirect to a file. The output can be thousands of lines, and the relevant error may be near the beginning while the hang occurs at the end.

## Step 3: Search for Root Cause

Search the saved output file for these patterns, in order of likelihood:

```shell
grep -n "unimplemented" tmp/test_output.txt
grep -n "panic" tmp/test_output.txt
grep -n "ERROR" tmp/test_output.txt
grep -n "WARN" tmp/test_output.txt
```

### What to look for

- **`not yet implemented`** or **`unimplemented`**: An async task hit an unimplemented code path and panicked. The panic is silently swallowed by the async runtime, causing the caller to await forever.
- **`panic`**: Similar to above — a panic in a spawned task won't propagate to the test.
- **`ERROR` with retry**: A service call failing repeatedly, causing an infinite retry loop.
- **Repeated identical log lines**: Indicates a retry loop or polling cycle that never succeeds.

## Step 4: Fix the Root Cause

### If caused by `unimplemented!()`
Implement the missing functionality, or if it's a test-only issue, provide a stub/mock.

### If caused by a deadlock
Look for:
- Multiple `lock()` calls on the same mutex in nested scopes
- `await` while holding a lock guard
- Circular lock dependencies between tasks

### If caused by missing shard assignment
Check that the test setup properly initializes the shard manager and assigns shards before starting workers.

### If caused by a dropped sender
Ensure all channel senders are kept alive for the duration the receiver needs them. Check for early returns or error paths that drop the sender.

## Checklist

1. `#[timeout("30s")]` added to the hanging test
2. Test run with `--nocapture`, output saved to file
3. Output searched for `unimplemented`, `panic`, `ERROR`
4. Root cause identified and fixed
5. Test passes within the timeout
6. Remove the `#[timeout]` if it was only added for debugging (or keep it as a safety net)
101 changes: 101 additions & 0 deletions .agents/skills/modifying-http-endpoints/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
name: modifying-http-endpoints
description: "Adding or modifying HTTP REST API endpoints in Golem services. Use when creating new endpoints, changing existing API routes, or updating request/response types for the Golem REST API."
---

# Modifying HTTP Endpoints

## Framework

Golem uses **Poem** with **poem-openapi** for REST API endpoints. Endpoints are defined as methods on API structs annotated with `#[OpenApi]` and `#[oai]`.

## Where Endpoints Live

- **Worker service**: `golem-worker-service/src/api/` — worker lifecycle, invocation, oplog
- **Registry service**: `golem-registry-service/src/api/` — components, environments, deployments, plugins, accounts

Each service has an `api/mod.rs` that defines an `Apis` type tuple and a `make_open_api_service` function combining all API structs.

## Adding a New Endpoint

### 1. Define the endpoint method

Add a method to the appropriate API struct (e.g., `WorkerApi`, `ComponentsApi`):

```rust
#[oai(
path = "/:component_id/workers/:worker_name/my-action",
method = "post",
operation_id = "my_action"
)]
async fn my_action(
&self,
component_id: Path<ComponentId>,
worker_name: Path<String>,
request: Json<MyRequest>,
token: GolemSecurityScheme,
) -> Result<Json<MyResponse>> {
// ...
}
```

### 2. If adding a new API struct

1. Create a new file in the service's `api/` directory
2. Define a struct and impl block with `#[OpenApi(prefix_path = "/v1/...", tag = ApiTags::...)]`
3. Add it to the `Apis` type tuple in `api/mod.rs`
4. Instantiate it in `make_open_api_service`

### 3. Request/response types

- Define types in `golem-common/src/model/` with `poem_openapi::Object` derive
- If the type is used in the generated client, add it to the type mapping in `golem-client/build.rs`

## After Modifying Endpoints

After any endpoint change, you **must** regenerate and rebuild:

### Step 1: Regenerate OpenAPI specs

```shell
cargo make generate-openapi
```

This builds the services, dumps their OpenAPI YAML, merges them, and stores the result in `openapi/`.

### Step 2: Clean and rebuild golem-client

The `golem-client` crate auto-generates its code from the OpenAPI spec at build time via `build.rs`. After regenerating the specs:

```shell
cargo clean -p golem-client
cargo build -p golem-client
```

The clean step is necessary because the build script uses `rerun-if-changed` on the YAML file, but cargo may cache stale generated code.

### Step 3: If new types are used in the client

Add type mappings in `golem-client/build.rs` to the `gen()` call's type replacement list. This maps OpenAPI schema names to existing Rust types from `golem-common` or `golem-wasm`.

### Step 4: Build and verify

```shell
cargo make build
```

Then run the appropriate tests:

- HTTP API tests: `cargo make api-tests-http`
- gRPC API tests: `cargo make api-tests-grpc`

## Checklist

1. Endpoint method added with `#[oai]` annotation
2. New API struct registered in `api/mod.rs` `Apis` tuple and `make_open_api_service` (if applicable)
3. Request/response types defined in `golem-common` with `poem_openapi::Object`
4. Type mappings added in `golem-client/build.rs` (if applicable)
5. `cargo make generate-openapi` run
6. `cargo clean -p golem-client && cargo build -p golem-client` run
7. `cargo make build` succeeds
8. `cargo make fix` run before PR
80 changes: 80 additions & 0 deletions .agents/skills/modifying-service-configs/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
name: modifying-service-configs
description: "Modifying service configuration types or defaults. Use when changing config structs, adding config fields, or updating default values for any Golem service."
---

# Modifying Service Configs

Golem services use a configuration system built on [Figment](https://github.com/SergioBenitez/Figment) via a custom `ConfigLoader`. Configuration defaults are serialized to TOML and env-var reference files that are checked into the repository and validated in CI.

## How Configuration Works

Each service has a configuration struct that implements:
- `Default` — provides default values
- `Serialize` / `Deserialize` — for TOML and env-var serialization
- `SafeDisplay` — for logging without exposing secrets

Services load config by merging (in order): defaults → TOML file → environment variables.

## Service Config Locations

| Service | Config struct | File |
|---------|--------------|------|
| Worker Executor | `GolemConfig` | `golem-worker-executor/src/services/golem_config.rs` |
| Worker Service | `WorkerServiceConfig` | `golem-worker-service/src/config.rs` |
| Registry Service | `RegistryServiceConfig` | `golem-registry-service/src/config.rs` |
| Shard Manager | `ShardManagerConfig` | `golem-shard-manager/src/shard_manager_config.rs` |
| Compilation Service | `ServerConfig` | `golem-component-compilation-service/src/config.rs` |

The all-in-one `golem` binary has its own merged config that combines multiple service configs.

## Modifying a Config

### Step 1: Edit the config struct

Add, remove, or modify fields in the appropriate config struct. Update the `Default` implementation if default values change.

### Step 2: Regenerate config files

```shell
cargo make generate-configs
```

This builds the service binaries and runs them with `--dump-config-default-toml` and `--dump-config-default-env-var` flags, producing reference files that reflect the current `Default` implementation.

### Step 3: Verify

```shell
cargo make build
```

### Step 4: Check configs match

CI runs `cargo make check-configs` which regenerates configs and diffs them against committed files. If this fails, you forgot to run `cargo make generate-configs`.

## Adding a New Config Field

1. Add the field to the config struct with a `serde` attribute if needed
2. Set its default value in the `Default` impl
3. Run `cargo make generate-configs` to update reference files
4. If the field requires a new environment variable, the env-var mapping is derived automatically from the field path

## Removing a Config Field

1. Remove the field from the struct and `Default` impl
2. Run `cargo make generate-configs`
3. Check for any code that references the removed field

## Nested Config Types

Many config structs compose sub-configs (e.g., `GolemConfig` contains `WorkersServiceConfig`, `BlobStoreServiceConfig`, etc.). When modifying a sub-config type that's shared across services, regenerate configs for all affected services — `cargo make generate-configs` handles this automatically.

## Checklist

1. Config struct modified with appropriate `serde` attributes
2. `Default` implementation updated
3. `cargo make generate-configs` run
4. Generated TOML and env-var files committed
5. `cargo make build` succeeds
6. `cargo make check-configs` passes (CI validation)
7. `cargo make fix` run before PR
Loading
Loading