Skip to content

Commit e0b211d

Browse files
authored
(kiloclaw-admin): Add Destroy Machine functionality to the KiloClaw admin dash (#1550)
## Summary Adds a "Destroy Machine" button to the KiloClaw admin instance detail page that directly calls the Fly Machines API to force-destroy a machine — equivalent to `fly machine destroy <id> -a <app> --force`. **Why this exists:** The normal instance destroy flow goes through the DO's teardown pipeline (machine → volume → cleanup). This button is an admin escape hatch for when you need to delete the Fly machine directly without triggering the full teardown. **What it does NOT do:** This does not tear down the KiloClaw instance, delete the Fly volume, or mark the instance as destroyed in the database. Only the Fly machine is destroyed. ### Implementation - **Platform route** (`POST /api/platform/destroy-fly-machine`): Calls `DELETE /v1/apps/{app}/machines/{id}?force=true` on the Fly Machines API using the worker's `FLY_API_TOKEN`. After a successful destroy, triggers `forceRetryRecovery` on the DO to schedule an immediate reconcile alarm — the DO will discover the machine is destroyed and clear its `flyMachineId` state. - **tRPC mutation** (`destroyFlyMachine`): Admin-only mutation that proxies to the platform route and writes an audit log (`kiloclaw.machine.destroy_fly`). - **UI**: Orange "Destroy Machine" button in Machine Controls with a confirmation dialog showing the machine ID and app name. After success, the button shows a green "Machine Destroyed" state and is disabled. An error toast is shown if `flyAppName` is missing from the DO state. - **Observability**: `writeEvent` calls for success/failure (visible in the DO & Reconcile Events table on the admin page), plus `createKiloClawAdminAuditLog` for the audit trail. ### Reconciler fix: handle `"destroyed"` Fly state `syncStatusWithFly` previously had no branch for the `"destroyed"` Fly machine state. When a machine is force-destroyed via the Fly API, Fly returns the machine with `state: "destroyed"` before it eventually starts returning 404. Without this fix, the reconciler would no-op on the `"destroyed"` state, leaving the stale `flyMachineId` in DO storage until a later alarm cycle when Fly finally returns 404 and `handleMachineGone` fires. The new `"destroyed"` branch mirrors `handleMachineGone` — clears `flyMachineId`, sets status to `"stopped"`, and persists immediately. ### New audit action `kiloclaw.machine.destroy_fly` added to `KiloClawAdminAuditAction` enum. ## Verification - [x] TypeScript compiles cleanly (`npx tsc --noEmit`) - [x] Tested locally: destroy succeeds, Fly machine removed, DO reconciles after alarm fires - [x] 4 rounds of adversarial code review (code reviewer, silent failure hunter, security auditor) with verification agents — converged to zero new actionable findings - [x] E2E manual testing ## Loom — Kilo team only https://www.loom.com/share/7e898ec288da4e9a83d2554780622dd9 ## Reviewer Notes - The `forceRetryRecovery` call after destroy is fire-and-forget (caught and warned). If it fails, the DO will still reconcile on its next regular alarm. The response returns `ok: true` regardless — this was a deliberate scoping decision to avoid threading a `reconcileTriggered` flag through three layers. - Until the reconcile task following machine destruction finishes (should be fairly quick in the deployed env), the DO will log errors as it has lost connection with the controller. The noise should be relatively minimal and, as the reconcile alarm is set to _now_, it should be resolved quickly. - `isFlyMachineDestroyed` comes from `useMutation`'s `isSuccess` which resets on component re-mount. A second click after re-mount would get a Fly 404 error toast — not harmful, just redundant.
2 parents 4f4a670 + 561d233 commit e0b211d

File tree

9 files changed

+768
-1
lines changed

9 files changed

+768
-1
lines changed

kiloclaw/src/durable-objects/kiloclaw-instance.test.ts

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5122,6 +5122,40 @@ describe("syncStatusWithFly: backfill lastStartedAt on 'starting' → 'running'"
51225122
});
51235123
});
51245124

5125+
describe("syncStatusWithFly: 'destroyed' Fly state clears flyMachineId", () => {
5126+
it("clears flyMachineId and sets status to 'stopped' when Fly reports destroyed", async () => {
5127+
const { instance, storage } = createInstance();
5128+
await seedRunning(storage);
5129+
5130+
(flyClient.getMachine as Mock).mockResolvedValue({
5131+
state: 'destroyed',
5132+
config: { mounts: [] },
5133+
});
5134+
5135+
await instance.alarm();
5136+
5137+
expect(storage._store.get('flyMachineId')).toBeNull();
5138+
expect(storage._store.get('status')).toBe('stopped');
5139+
expect(storage._store.get('lastStoppedAt')).toBeGreaterThan(0);
5140+
expect(storage._store.get('healthCheckFailCount')).toBe(0);
5141+
});
5142+
5143+
it("clears flyMachineId from 'stopped' status when Fly reports destroyed", async () => {
5144+
const { instance, storage } = createInstance();
5145+
await seedRunning(storage, { status: 'stopped', lastStoppedAt: Date.now() - 60_000 });
5146+
5147+
(flyClient.getMachine as Mock).mockResolvedValue({
5148+
state: 'destroyed',
5149+
config: { mounts: [] },
5150+
});
5151+
5152+
await instance.alarm();
5153+
5154+
expect(storage._store.get('flyMachineId')).toBeNull();
5155+
expect(storage._store.get('status')).toBe('stopped');
5156+
});
5157+
});
5158+
51255159
describe('reconcileStarting: transient Fly API errors respect starting timeout', () => {
51265160
it("stays 'starting' on transient error when NOT timed out", async () => {
51275161
const { instance, storage } = createInstance();

kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,30 @@ export async function syncStatusWithFly(
777777
return;
778778
}
779779

780+
// destroyed means the Fly machine is gone — clear the stale ID immediately
781+
// so the DO doesn't keep referencing a dead machine.
782+
if (flyState === 'destroyed') {
783+
rctx.log('sync_status_destroyed', {
784+
old_state: state.status,
785+
new_state: 'stopped',
786+
fly_state: flyState,
787+
machine_id: state.flyMachineId,
788+
});
789+
state.flyMachineId = null;
790+
state.status = 'stopped';
791+
state.lastStoppedAt = Date.now();
792+
state.healthCheckFailCount = 0;
793+
await ctx.storage.put(
794+
storageUpdate({
795+
flyMachineId: null,
796+
status: 'stopped',
797+
lastStoppedAt: state.lastStoppedAt,
798+
healthCheckFailCount: 0,
799+
})
800+
);
801+
return;
802+
}
803+
780804
// failed is definitively terminal — transition immediately without waiting for
781805
// SELF_HEAL_THRESHOLD consecutive checks like we do for stopped/created.
782806
if (flyState === 'failed' && state.status !== 'stopped') {
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
import { describe, expect, it, vi, beforeEach, afterEach } from 'vitest';
2+
import { platform } from './platform';
3+
4+
vi.mock('cloudflare:workers', () => ({
5+
DurableObject: class {},
6+
waitUntil: (p: Promise<unknown>) => p,
7+
}));
8+
9+
const testUserId = 'user-1';
10+
const testAppName = 'acct-abc123';
11+
const testMachineId = 'd890abc123';
12+
13+
function makeEnv(overrides: Record<string, unknown> = {}) {
14+
const forceRetryRecovery = vi.fn().mockResolvedValue(undefined);
15+
return {
16+
env: {
17+
FLY_API_TOKEN: 'fly-test-token',
18+
KILOCLAW_INSTANCE: {
19+
idFromName: (id: string) => id,
20+
get: () => ({ forceRetryRecovery }),
21+
},
22+
KILOCLAW_AE: { writeDataPoint: vi.fn() },
23+
KV_CLAW_CACHE: {
24+
get: vi.fn().mockResolvedValue(null),
25+
put: vi.fn().mockResolvedValue(undefined),
26+
delete: vi.fn().mockResolvedValue(undefined),
27+
list: vi.fn().mockResolvedValue({ keys: [], list_complete: true }),
28+
getWithMetadata: vi.fn().mockResolvedValue({ value: null, metadata: null }),
29+
},
30+
...overrides,
31+
} as never,
32+
forceRetryRecovery,
33+
};
34+
}
35+
36+
async function jsonBody(res: Response): Promise<Record<string, unknown>> {
37+
return res.json();
38+
}
39+
40+
function postJson(path: string, body: Record<string, unknown>) {
41+
return {
42+
path,
43+
init: {
44+
method: 'POST',
45+
headers: { 'content-type': 'application/json' },
46+
body: JSON.stringify(body),
47+
},
48+
};
49+
}
50+
51+
describe('POST /destroy-fly-machine', () => {
52+
let fetchSpy: ReturnType<typeof vi.fn<() => Promise<Response>>>;
53+
54+
beforeEach(() => {
55+
fetchSpy = vi
56+
.spyOn(globalThis, 'fetch')
57+
.mockResolvedValue(new Response(null, { status: 200 })) as ReturnType<
58+
typeof vi.fn<() => Promise<Response>>
59+
>;
60+
});
61+
62+
afterEach(() => {
63+
vi.restoreAllMocks();
64+
});
65+
66+
it('returns 200 and calls Fly API DELETE with force=true', async () => {
67+
const { env } = makeEnv();
68+
const { path, init } = postJson('/destroy-fly-machine', {
69+
userId: testUserId,
70+
appName: testAppName,
71+
machineId: testMachineId,
72+
});
73+
const resp = await platform.request(path, init, env);
74+
75+
expect(resp.status).toBe(200);
76+
const json = await jsonBody(resp);
77+
expect(json).toEqual({ ok: true });
78+
79+
expect(fetchSpy).toHaveBeenCalledWith(
80+
`https://api.machines.dev/v1/apps/${testAppName}/machines/${testMachineId}?force=true`,
81+
{
82+
method: 'DELETE',
83+
headers: { Authorization: 'Bearer fly-test-token' },
84+
}
85+
);
86+
});
87+
88+
it('returns 400 for appName containing URI-special characters, proving the schema is the guard against URL injection', async () => {
89+
// encodeURIComponent is not needed on the URL construction because the Zod schema
90+
// never admits any character that would be percent-encoded. This test proves that
91+
// boundary: a value containing a URI-special character (space, %, +) is rejected
92+
// at the schema layer and never reaches the Fly API call.
93+
const { env } = makeEnv();
94+
for (const badAppName of ['acct abc', 'acct%20abc', 'acct+abc', 'ACCT-ABC']) {
95+
const { path, init } = postJson('/destroy-fly-machine', {
96+
userId: testUserId,
97+
appName: badAppName,
98+
machineId: testMachineId,
99+
});
100+
const resp = await platform.request(path, init, env);
101+
expect(resp.status).toBe(400);
102+
}
103+
// None of the bad inputs reached the Fly API
104+
expect(fetchSpy).not.toHaveBeenCalled();
105+
});
106+
107+
it('returns 400 for machineId containing URI-special characters, proving the schema is the guard against URL injection', async () => {
108+
const { env } = makeEnv();
109+
for (const badMachineId of ['d890 abc', 'd890%abc', 'd890+abc', 'D890ABC']) {
110+
const { path, init } = postJson('/destroy-fly-machine', {
111+
userId: testUserId,
112+
appName: testAppName,
113+
machineId: badMachineId,
114+
});
115+
const resp = await platform.request(path, init, env);
116+
expect(resp.status).toBe(400);
117+
}
118+
expect(fetchSpy).not.toHaveBeenCalled();
119+
});
120+
121+
it('triggers forceRetryRecovery after successful destroy', async () => {
122+
const { env, forceRetryRecovery } = makeEnv();
123+
const { path, init } = postJson('/destroy-fly-machine', {
124+
userId: testUserId,
125+
appName: testAppName,
126+
machineId: testMachineId,
127+
});
128+
await platform.request(path, init, env);
129+
130+
expect(forceRetryRecovery).toHaveBeenCalled();
131+
});
132+
133+
it('returns 503 when FLY_API_TOKEN is not configured', async () => {
134+
const { env } = makeEnv({ FLY_API_TOKEN: undefined });
135+
const { path, init } = postJson('/destroy-fly-machine', {
136+
userId: testUserId,
137+
appName: testAppName,
138+
machineId: testMachineId,
139+
});
140+
const resp = await platform.request(path, init, env);
141+
142+
expect(resp.status).toBe(503);
143+
const json = await jsonBody(resp);
144+
expect(json.error).toContain('FLY_API_TOKEN');
145+
});
146+
147+
it('wraps Fly API error status and body in error message', async () => {
148+
fetchSpy.mockResolvedValueOnce(new Response('machine not found', { status: 404 }));
149+
const { env } = makeEnv();
150+
const { path, init } = postJson('/destroy-fly-machine', {
151+
userId: testUserId,
152+
appName: testAppName,
153+
machineId: testMachineId,
154+
});
155+
const resp = await platform.request(path, init, env);
156+
157+
expect(resp.status).toBe(404);
158+
const json = await jsonBody(resp);
159+
// Implementation wraps the Fly response body: "Fly API error (${status}): ${body}"
160+
expect(json.error).toBe('Fly API error (404): machine not found');
161+
});
162+
163+
it('returns 400 for invalid appName format', async () => {
164+
const { env } = makeEnv();
165+
const { path, init } = postJson('/destroy-fly-machine', {
166+
userId: testUserId,
167+
appName: 'INVALID',
168+
machineId: testMachineId,
169+
});
170+
const resp = await platform.request(path, init, env);
171+
172+
expect(resp.status).toBe(400);
173+
expect(fetchSpy).not.toHaveBeenCalled();
174+
});
175+
176+
it('returns 400 for invalid machineId format', async () => {
177+
const { env } = makeEnv();
178+
const { path, init } = postJson('/destroy-fly-machine', {
179+
userId: testUserId,
180+
appName: testAppName,
181+
machineId: 'BAD-ID!',
182+
});
183+
const resp = await platform.request(path, init, env);
184+
185+
expect(resp.status).toBe(400);
186+
expect(fetchSpy).not.toHaveBeenCalled();
187+
});
188+
189+
it('returns 400 for missing userId', async () => {
190+
const { env } = makeEnv();
191+
const { path, init } = postJson('/destroy-fly-machine', {
192+
appName: testAppName,
193+
machineId: testMachineId,
194+
});
195+
const resp = await platform.request(path, init, env);
196+
197+
expect(resp.status).toBe(400);
198+
expect(fetchSpy).not.toHaveBeenCalled();
199+
});
200+
201+
it('still returns ok when forceRetryRecovery fails', async () => {
202+
const forceRetryRecovery = vi.fn().mockRejectedValue(new Error('DO unavailable'));
203+
const { env } = makeEnv({
204+
KILOCLAW_INSTANCE: {
205+
idFromName: (id: string) => id,
206+
get: () => ({ forceRetryRecovery }),
207+
},
208+
});
209+
const { path, init } = postJson('/destroy-fly-machine', {
210+
userId: testUserId,
211+
appName: testAppName,
212+
machineId: testMachineId,
213+
});
214+
const resp = await platform.request(path, init, env);
215+
216+
expect(resp.status).toBe(200);
217+
const json = await jsonBody(resp);
218+
expect(json).toEqual({ ok: true });
219+
expect(forceRetryRecovery).toHaveBeenCalled();
220+
});
221+
});

kiloclaw/src/routes/platform.ts

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1378,4 +1378,70 @@ platform.put('/regions', async c => {
13781378
return c.json({ ok: true, regions: result.data.regions, raw });
13791379
});
13801380

1381+
// POST /api/platform/destroy-fly-machine
1382+
// This is for admin cleanup only.
1383+
// It directly destroys a Fly machine via the Machines API (force=true).
1384+
// It does not destroy the Fly app or volume.
1385+
const DestroyFlyMachineSchema = z.object({
1386+
userId: z.string().min(1),
1387+
appName: z
1388+
.string()
1389+
.min(1)
1390+
.max(63)
1391+
.regex(/^[a-z0-9]([a-z0-9-]*[a-z0-9])?$/, 'Invalid Fly app name'),
1392+
machineId: z
1393+
.string()
1394+
.min(1)
1395+
.regex(/^[a-z0-9]+$/, 'Invalid Fly machine ID'),
1396+
});
1397+
1398+
platform.post('/destroy-fly-machine', async c => {
1399+
const result = await parseBody(c, DestroyFlyMachineSchema);
1400+
if ('error' in result) return result.error;
1401+
1402+
const { userId, appName, machineId } = result.data;
1403+
const apiToken = c.env.FLY_API_TOKEN;
1404+
if (!apiToken) {
1405+
return c.json({ error: 'FLY_API_TOKEN is not configured' }, 503);
1406+
}
1407+
1408+
const url = `https://api.machines.dev/v1/apps/${appName}/machines/${machineId}?force=true`;
1409+
try {
1410+
const resp = await fetch(url, {
1411+
method: 'DELETE',
1412+
headers: { Authorization: `Bearer ${apiToken}` },
1413+
});
1414+
1415+
if (!resp.ok) {
1416+
const body = await resp.text();
1417+
console.error(
1418+
`[platform] destroy-fly-machine failed (${resp.status}) app=${appName} machine=${machineId}:`,
1419+
body
1420+
);
1421+
return jsonError(`Fly API error (${resp.status}): ${body}`, resp.status);
1422+
}
1423+
1424+
console.log(`[platform] destroy-fly-machine ok: app=${appName} machine=${machineId}`);
1425+
1426+
// Trigger immediate reconcile so the DO discovers the machine is gone.
1427+
try {
1428+
await withDORetry(
1429+
instanceStubFactory(c.env, userId),
1430+
stub => stub.forceRetryRecovery(),
1431+
'forceRetryRecovery'
1432+
);
1433+
} catch (err) {
1434+
console.warn(
1435+
`[platform] destroy-fly-machine: forceRetryRecovery failed for user=${userId}:`,
1436+
err
1437+
);
1438+
}
1439+
1440+
return c.json({ ok: true });
1441+
} catch (err) {
1442+
const { message, status } = sanitizeError(err, 'destroy-fly-machine');
1443+
return jsonError(message, status);
1444+
}
1445+
});
1446+
13811447
export { platform };

packages/db/src/schema-types.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ export const KiloClawAdminAuditAction = z.enum([
162162
'kiloclaw.gateway.restart',
163163
'kiloclaw.config.restore',
164164
'kiloclaw.doctor.run',
165+
'kiloclaw.machine.destroy_fly',
165166
]);
166167

167168
export type KiloClawAdminAuditAction = z.infer<typeof KiloClawAdminAuditAction>;

0 commit comments

Comments
 (0)