Skip to content

Commit fde3cd6

Browse files
Feature: Sandboxes Monitoring - 1 - API/State Management Layer (#137)
Introduces the backend infrastructure and data layer for real-time monitoring capabilities of team sandbox usage at scale. This establishes the foundation for metrics collection, processing, and serving. ## Key Features **Team Metrics API** - New `/api/teams/[teamId]/metrics` endpoint for historical metrics - Request validation with 31-day maximum range and future date protection - Structured response with metrics array and calculated step intervals - Support for custom time ranges with proper aggregation **Server Actions & Data Processing** - `getTeamMetrics()` - Fetches and processes team sandbox metrics - `getTeamTierLimits()` - Retrieves tier-specific concurrent sandbox limits - Request-level caching with `getNowMemo()` for deduplication - Memoized functions for performance optimization **Data Transformation Utilities** - Time range calculation and step size determination - Metrics aggregation and gap-filling algorithms - Support for various time intervals (5s to 15min steps based on range) - Timezone-aware formatting utilities **Type System & Validation** - `ClientTeamMetrics` type for standardized metrics format - Request/response schemas with Zod validation - OpenAPI specification updates for new endpoints - Enhanced `sandboxes.types.ts` with monitoring-specific types ## Architecture Overview ### Metrics Data Pipeline ```mermaid graph TB subgraph "Data Sources" DB[(PostgreSQL<br/>team_metrics table)] TierDB[(tier_limits table)] end subgraph "Server Actions" GetMetrics["getTeamMetrics()<br/>Input:<br/>- teamId<br/>- startDate (timestamp)<br/>- endDate (timestamp)<br/><br/>Output:<br/>TeamMetricsResponse {<br/> metrics: ClientTeamMetric[]<br/> step: number<br/>}"] GetTierLimits["getTeamTierLimits()<br/>Input:<br/>- teamId<br/><br/>Output:<br/>{ concurrentInstances: number }"] end subgraph "Data Transformation" ServerTransform["Server Transform<br/>- Calculate step size<br/>- Aggregate by interval<br/>- Format timestamps"] ClientTransform["Client Transform<br/>fillMetricsWithZeros()<br/>- Fill gaps in data<br/>- Ensure continuous timeline<br/><br/>transformMetricsToLineData()<br/>- Convert to chart format<br/>- x: timestamp<br/>- y: value"] end subgraph "Data Structures" RawMetric["Raw DB Record<br/>{<br/> timestamp: Date<br/> concurrent_sandboxes: number<br/> started_sandboxes: number<br/> team_id: string<br/>}"] ClientMetric["ClientTeamMetric<br/>{<br/> timestamp: number<br/> concurrentSandboxes: number<br/> startedSandboxes: number<br/>}"] ChartData["LineSeries<br/>{<br/> name: string<br/> data: {<br/> x: number | Date<br/> y: number<br/> }[]<br/>}"] end DB --> RawMetric RawMetric --> GetMetrics GetMetrics --> ServerTransform ServerTransform --> ClientMetric ClientMetric --> ClientTransform ClientTransform --> ChartData TierDB --> GetTierLimits style DB fill:#e8f5e9 style ClientMetric fill:#fff9c4 style ChartData fill:#e1f5fe ``` ## Technical Implementation **Architecture Decisions** - Server-only utilities marked with `'server-only'` directive - Proper separation of client/server type definitions - Request validation at API boundary with detailed error messages - Configurable mock data support for development/testing **Key Files Added** - `src/app/api/teams/[teamId]/metrics/` - API endpoint and types - `src/server/sandboxes/get-team-metrics.ts` - Core metrics fetching logic - `src/server/team/get-team-tier-limits.ts` - Tier limits functionality - `src/lib/utils/timeframe.ts` - Time calculation utilities - `src/lib/utils/formatting.ts` - Formatting utilities (400+ lines) - `src/configs/intervals.ts`, `src/configs/keys.ts` - Configuration constants **Dependencies Added** - `echarts` - Chart library foundation - `date-fns-tz` - Timezone handling - `deepmerge` - Configuration merging - `sonner` - Toast notifications --- *This PR provides the complete backend foundation for sandbox monitoring. The API endpoints are functional and can be tested independently. No UI components are included - this is purely the data layer that subsequent PRs will build upon.*
1 parent d43ce38 commit fde3cd6

26 files changed

+1939
-433
lines changed

bun.lock

Lines changed: 355 additions & 404 deletions
Large diffs are not rendered by default.

package.json

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,13 @@
3030
"<<<<<< Testing": "",
3131
"test:run": "bun scripts:check-all-env && vitest run",
3232
"test:integration": "bun scripts:check-app-env && vitest run src/__test__/integration/",
33+
"test:unit": "bun scripts:check-app-env && vitest run src/__test__/unit/",
3334
"test:e2e": "bun scripts:check-all-env && vitest run src/__test__/e2e/",
3435
"test:watch": "bun scripts:check-all-env && vitest",
3536
"test:ui": "bun scripts:check-all-env && vitest --ui",
3637
"test:ui:integration": "bun scripts:check-app-env && vitest --ui src/__test__/integration/",
37-
"test:development:metrics": "vitest run src/__test__/development/metrics.test.ts"
38+
"test:development:burst": "vitest run src/__test__/development/burst.test.ts",
39+
"test:development:traffic": "vitest run src/__test__/development/traffic.test.ts"
3840
},
3941
"dependencies": {
4042
"@fumadocs/mdx-remote": "^1.2.0",
@@ -53,11 +55,12 @@
5355
"@opentelemetry/sdk-trace-node": "^2.0.1",
5456
"@opentelemetry/semantic-conventions": "^1.36.0",
5557
"@radix-ui/react-avatar": "^1.1.4",
56-
"@radix-ui/react-checkbox": "^1.3.2",
58+
"@radix-ui/react-checkbox": "^1.3.3",
5759
"@radix-ui/react-dialog": "^1.1.7",
5860
"@radix-ui/react-dropdown-menu": "^2.1.7",
5961
"@radix-ui/react-label": "^2.1.3",
6062
"@radix-ui/react-popover": "^1.1.7",
63+
"@radix-ui/react-radio-group": "^1.3.8",
6164
"@radix-ui/react-scroll-area": "^1.2.4",
6265
"@radix-ui/react-select": "^2.1.7",
6366
"@radix-ui/react-separator": "^1.1.3",
@@ -84,11 +87,16 @@
8487
"@vercel/otel": "^1.13.0",
8588
"ansis": "^3.17.0",
8689
"cheerio": "^1.0.0",
90+
"chrono-node": "^2.8.4",
8791
"class-variance-authority": "^0.7.1",
8892
"clsx": "^2.1.1",
8993
"cmdk": "^1.0.4",
9094
"date-fns": "^4.1.0",
91-
"e2b": "^1.10.0",
95+
"date-fns-tz": "^3.2.0",
96+
"deepmerge": "^4.3.1",
97+
"e2b": "1.10.0",
98+
"echarts": "^6.0.0",
99+
"echarts-for-react": "^3.0.2",
92100
"fast-xml-parser": "^4.5.1",
93101
"fumadocs-core": "^15.0.6",
94102
"fumadocs-mdx": "^11.5.3",
@@ -101,14 +109,14 @@
101109
"nanoid": "^5.0.9",
102110
"next": "15.3.0-canary.23",
103111
"next-safe-action": "^7.10.4",
104-
"next-themes": "^0.4.4",
112+
"next-themes": "^0.4.6",
105113
"openapi-fetch": "^0.14.0",
106114
"pathe": "^2.0.3",
107115
"pino": "^9.7.0",
108116
"postgres": "^3.4.5",
109117
"posthog-js": "^1.214.0",
110118
"react": "^19.1.0",
111-
"react-day-picker": "9.5.1",
119+
"react-day-picker": "^9.9.0",
112120
"react-dom": "^19.1.0",
113121
"react-error-boundary": "^5.0.0",
114122
"react-hook-form": "^7.54.2",
@@ -122,6 +130,7 @@
122130
"semver": "^7.7.2",
123131
"serialize-error": "^12.0.0",
124132
"shiki": "3.2.1",
133+
"sonner": "^2.0.7",
125134
"swr": "^2.3.4",
126135
"tailwind-merge": "^3.3.1",
127136
"tw-animate-css": "^1.3.6",

spec/openapi.yaml

Lines changed: 108 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -461,7 +461,7 @@ components:
461461
description: Time to live for the sandbox in seconds.
462462
autoPause:
463463
type: boolean
464-
default: false
464+
deprecated: true
465465
description: Automatically pauses the sandbox after the timeout
466466

467467
TeamMetric:
@@ -608,6 +608,72 @@ components:
608608
memoryMB:
609609
$ref: '#/components/schemas/MemoryMB'
610610

611+
FromImageRegistry:
612+
oneOf:
613+
- $ref: '#/components/schemas/AWSRegistry'
614+
- $ref: '#/components/schemas/GCPRegistry'
615+
- $ref: '#/components/schemas/GeneralRegistry'
616+
discriminator:
617+
propertyName: type
618+
mapping:
619+
aws: '#/components/schemas/AWSRegistry'
620+
gcp: '#/components/schemas/GCPRegistry'
621+
registry: '#/components/schemas/GeneralRegistry'
622+
623+
AWSRegistry:
624+
type: object
625+
required:
626+
- type
627+
- awsAccessKeyId
628+
- awsSecretAccessKey
629+
- awsRegion
630+
properties:
631+
type:
632+
type: string
633+
enum: [aws]
634+
description: Type of registry authentication
635+
awsAccessKeyId:
636+
type: string
637+
description: AWS Access Key ID for ECR authentication
638+
awsSecretAccessKey:
639+
type: string
640+
description: AWS Secret Access Key for ECR authentication
641+
awsRegion:
642+
type: string
643+
description: AWS Region where the ECR registry is located
644+
645+
GCPRegistry:
646+
type: object
647+
required:
648+
- type
649+
- serviceAccountJson
650+
properties:
651+
type:
652+
type: string
653+
enum: [gcp]
654+
description: Type of registry authentication
655+
serviceAccountJson:
656+
type: string
657+
description: Service Account JSON for GCP authentication
658+
659+
GeneralRegistry:
660+
type: object
661+
required:
662+
- type
663+
- username
664+
- password
665+
properties:
666+
type:
667+
type: string
668+
enum: [registry]
669+
description: Type of registry authentication
670+
username:
671+
type: string
672+
description: Username to use for the registry
673+
password:
674+
type: string
675+
description: Password to use for the registry
676+
611677
TemplateBuildStartV2:
612678
type: object
613679
properties:
@@ -617,6 +683,8 @@ components:
617683
fromTemplate:
618684
type: string
619685
description: Template to use as a base for the template build
686+
fromImageRegistry:
687+
$ref: '#/components/schemas/FromImageRegistry'
620688
force:
621689
default: false
622690
type: boolean
@@ -670,6 +738,17 @@ components:
670738
level:
671739
$ref: '#/components/schemas/LogLevel'
672740

741+
BuildStatusReason:
742+
required:
743+
- message
744+
properties:
745+
message:
746+
type: string
747+
description: Message with the status reason, currently reporting only for error status
748+
step:
749+
type: string
750+
description: Step that failed
751+
673752
TemplateBuild:
674753
required:
675754
- templateID
@@ -705,8 +784,7 @@ components:
705784
- ready
706785
- error
707786
reason:
708-
type: string
709-
description: Message with the status reason, currently reporting only for error status
787+
$ref: '#/components/schemas/BuildStatusReason'
710788

711789
NodeStatus:
712790
type: string
@@ -799,6 +877,8 @@ components:
799877
Node:
800878
required:
801879
- nodeID
880+
- id
881+
- serviceInstanceID
802882
- clusterID
803883
- status
804884
- sandboxCount
@@ -816,8 +896,15 @@ components:
816896
type: string
817897
description: Commit of the orchestrator
818898
nodeID:
899+
type: string
900+
deprecated: true
901+
description: Identifier of the nomad node
902+
id:
819903
type: string
820904
description: Identifier of the node
905+
serviceInstanceID:
906+
type: string
907+
description: Service instance identifier of the node
821908
clusterID:
822909
type: string
823910
description: Identifier of the cluster
@@ -845,6 +932,8 @@ components:
845932
NodeDetail:
846933
required:
847934
- nodeID
935+
- id
936+
- serviceInstanceID
848937
- clusterID
849938
- status
850939
- sandboxes
@@ -864,9 +953,16 @@ components:
864953
commit:
865954
type: string
866955
description: Commit of the orchestrator
867-
nodeID:
956+
id:
868957
type: string
869958
description: Identifier of the node
959+
serviceInstanceID:
960+
type: string
961+
description: Service instance identifier of the node
962+
nodeID:
963+
type: string
964+
deprecated: true
965+
description: Identifier of the nomad node
870966
status:
871967
$ref: '#/components/schemas/NodeStatus'
872968
sandboxes:
@@ -1214,8 +1310,7 @@ paths:
12141310
type: integer
12151311
format: int32
12161312
minimum: 1
1217-
default: 1000
1218-
maximum: 1000
1313+
default: 100
12191314
responses:
12201315
'200':
12211316
description: Successfully returned all running sandboxes
@@ -1656,6 +1751,7 @@ paths:
16561751
description: Delete a template
16571752
tags: [templates]
16581753
security:
1754+
- ApiKeyAuth: []
16591755
- AccessTokenAuth: []
16601756
- Supabase1TokenAuth: []
16611757
parameters:
@@ -1799,6 +1895,12 @@ paths:
17991895
- AdminTokenAuth: []
18001896
parameters:
18011897
- $ref: '#/components/parameters/nodeID'
1898+
- in: query
1899+
name: clusterID
1900+
description: Identifier of the cluster
1901+
required: false
1902+
schema:
1903+
type: string
18021904
responses:
18031905
'200':
18041906
description: Successfully returned the node
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
import 'server-cli-only'
2+
3+
import { SUPABASE_AUTH_HEADERS } from '@/configs/api'
4+
import { USE_MOCK_DATA } from '@/configs/flags'
5+
import {
6+
calculateTeamMetricsStep,
7+
MOCK_TEAM_METRICS_DATA,
8+
} from '@/configs/mock-data'
9+
import { infra } from '@/lib/clients/api'
10+
import { l } from '@/lib/clients/logger/logger'
11+
import { createClient } from '@/lib/clients/supabase/server'
12+
import { handleDefaultInfraError } from '@/lib/utils/action'
13+
import { TeamMetricsRequestSchema, TeamMetricsResponse } from './types'
14+
15+
export async function POST(
16+
request: Request,
17+
{ params }: { params: Promise<{ teamId: string }> }
18+
) {
19+
try {
20+
const { teamId } = await params
21+
22+
const { start, end } = TeamMetricsRequestSchema.parse(await request.json())
23+
24+
if (USE_MOCK_DATA) {
25+
const mockData = MOCK_TEAM_METRICS_DATA(start, end)
26+
return Response.json(mockData satisfies TeamMetricsResponse)
27+
}
28+
29+
const supabase = await createClient()
30+
const {
31+
data: { session },
32+
} = await supabase.auth.getSession()
33+
34+
if (!session) {
35+
return Response.json({ error: 'Unauthenticated' }, { status: 401 })
36+
}
37+
38+
const startSeconds = Math.floor(start / 1000)
39+
const endSeconds = Math.floor(end / 1000)
40+
41+
// calculate step to determine overfetch amount
42+
const step = calculateTeamMetricsStep(start, end)
43+
44+
const overfetchSeconds = Math.ceil(step / 1000) // overfetch by one step
45+
46+
try {
47+
const res = await infra.GET('/teams/{teamID}/metrics', {
48+
params: {
49+
path: {
50+
teamID: teamId,
51+
},
52+
query: {
53+
start: startSeconds,
54+
end: endSeconds + overfetchSeconds, // overfetch to capture boundary points
55+
},
56+
},
57+
headers: {
58+
...SUPABASE_AUTH_HEADERS(session.access_token, teamId),
59+
},
60+
cache: 'no-store',
61+
})
62+
63+
if (res.error) {
64+
throw res.error
65+
}
66+
67+
// transform timestamps and filter to requested range (with tolerance)
68+
// allow data points up to half a step beyond the end for boundary cases
69+
const tolerance = step * 0.5
70+
const metrics = res.data
71+
.map((d) => ({
72+
...d,
73+
timestamp: new Date(d.timestamp).getTime(),
74+
}))
75+
.filter((d) => d.timestamp >= start && d.timestamp <= end + tolerance)
76+
77+
l.info(
78+
{
79+
key: 'api_team_metrics:result',
80+
team_id: teamId,
81+
user_id: session.user.id,
82+
context: {
83+
path: '/api/teams/[teamId]/metrics',
84+
requested_range: { start, end },
85+
data_points: metrics.length,
86+
step,
87+
overfetch_seconds: overfetchSeconds,
88+
},
89+
},
90+
'Team metrics API response'
91+
)
92+
93+
return Response.json({
94+
metrics,
95+
step,
96+
} satisfies TeamMetricsResponse)
97+
} catch (error: unknown) {
98+
const status = error instanceof Response ? error.status : 500
99+
100+
l.error(
101+
{
102+
key: 'api_team_metrics:error',
103+
error: error,
104+
team_id: teamId,
105+
user_id: session.user.id,
106+
context: {
107+
path: '/api/teams/[teamId]/metrics',
108+
status,
109+
start,
110+
end,
111+
message: error instanceof Error ? error.message : 'Unknown error',
112+
},
113+
},
114+
`Failed to get team metrics: ${error instanceof Error ? error.message : 'Unknown error'}`
115+
)
116+
117+
return Response.json(
118+
{ error: handleDefaultInfraError(status) },
119+
{ status }
120+
)
121+
}
122+
} catch (error) {
123+
return Response.json({ error: 'Invalid request' }, { status: 400 })
124+
}
125+
}

0 commit comments

Comments
 (0)