Make edge api optional for OSS deployment by sitole · Pull Request #1638 · e2b-dev/infra

sitole · 2025-12-22T10:39:43Z

This pull request introduces a change that makes our API able to work and be deployed without the need for an edge API.
The main change is migrating queries for sandbox metrics/logs and build logs under the cluster resource provider, implemented as an interface backed by local (Loki, ClickHouse) and remote (edge API) implementations.

The second change is the use of local service discovery, which is now working only for template builders (orchestrators are still discovered in the old way to minimize changes in this pr).

The last change is that each instance now maintains its own gRPC connection. Before cluster instances, there was one shared connection handled by the cluster and reused by all instances. This was fine, but it forced us to provide metadata for each gRPC call, including the auth token and service instance ID metadata required by the gRPC proxy running on the edge API. Now, this is not needed, and each connection already holds this metadata, so we don't need to pass it in the context.

A change to connections was needed: now that we are not using the gRPC proxy for local cluster instances (template builders), we cannot use a shared connection, since it always points to different instances.

After this is deployed, the API can work without the edge API. It's no longer needed even in integration tests.

How To Review

I tried to make changes between commits as small as possible, only changing the needed stuff, and then describing it in the commit description. Each commit should be self-contained and should compile without issues. You can go one by one to make the overall review faster and easier for both sides. Some commits do a temporary job to pipe stuff together before the final solution is in place.

I am happy to receive feedback on this, as we want to do more of it.

Note

Enables deployments/tests without edge by abstracting cluster I/O and discovery.

Adds clusters package: per-instance gRPC connections, local/remote service discovery, instance/cluster sync, and a ClusterResource interface with local (ClickHouse/Loki) and remote (Edge API) implementations for sandbox metrics/logs and template build logs
Updates config to require LOKI_URL (optional LOKI_USER/LOKI_PASSWORD); removes use of LOCAL_CLUSTER_ENDPOINT/LOCAL_CLUSTER_TOKEN
Infra: propagate LOKI_URL to Nomad jobs and Terraform locals; CI stops starting Edge service and removes related env vars
API wiring moved from shared connection to instance-scoped clients; new log query windowing and unified log source selection with fallbacks
go.mod: add Loki and related deps

^{Written by Cursor Bugbot for commit 914d08b. This will update automatically on new commits. Configure here.}

sitole · 2025-12-22T15:49:36Z

@copilot please

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

packages/api/internal/clusters/resources_remote.go

packages/api/internal/clusters/instances_sync.go

packages/api/internal/clusters/instance.go

Client proxy previously used allocations job search but only for one jobs with jobs prefix name. This was not suitable for running in environment where it should be used for discover of both orchestrator and template manager instances. New filter logic search for template managers on build nodes and orchestrators on client default nodes. Now ENV variable for job prefix is no longer needed as its baked in discovery logic itself. Logic was also moved from client proxy package to shared as it will be later used for same job in API, that now uses edge api for local cluster template builders disocvery. This will be changed in following changes.

`ServiceDiscoveryItem` is now renamed to `DiscoveredInstance` as it better reflects value it holds and makes obvious that its ment just for internal cycle of discovery.

Lot of structs and variables is prefixed orchestrator even its already in package name so its not needed. Rename variables on structs/struct itself to start with lower letter to not export them to other packages if not really needed. Instance gRPC client is also made internal and we now have two function for exporting raw gRPC connection used in gRPC proxy and gRPC info service client that is used in few places.

Moved instance client holding gRPC connection and gRPC info client into separated file to make it cleaner and easier to work with. Orchestrator filename is renamed to instance and `NewOrchestratorInstance` is renamed to `newInstance` as orchestrator is duplicated here (package name) and dont need to be used ourside of package.

Renamed as orchestrators pool can be little bit configusing in case when pool contains both orchestrators and tempalte builders. Now we are calling them instances.

Removed to make it properly prefixed as other fields here.

As we are migrating away from using edge api for local cluster and going towards removing need to deploy edge api for OSS deployment renamed was needed here.

This is interface ment for unify all calls to resources we want to query from cluster (sandbox logs, metrics, build logs etc). This commit introduces basic implementation and getter on cluster that will be later used on all places where we need to query sandbox/build logs and/or metrics. Later this same interface will be used for local cluster to query local Loki and ClickHouse to provide same data. This way we should have unified interface.

This change removes local/remove cluster specific backends for query sandbox logs, metrics and build logs and replaces it with cluster resource backend that is now unified.

Separated cluster, cluster sync, instance and instance sync logic to separated files to make it more easy to read and navigate.

Previously we used shared gRPC connection that was re-used for cluster instances. This introduces issue where for each called we need to expose not just gRPC connection but also ctx with additional gRPC metadata or metadata itself so for each request proper service instance id metadata key will be added so gRPC proxy running in remote cluster will router request properly. Now as each instance holds its own gRPC connection we no longer need to do this. To minimize refactoring changes some methods that were returning context with metadata re now returning context received without additional changes. There is small changes for sandbox create/delete metadata event. Previously we merged metadata and added them, but now metadata (auth and service instance id) are stored in gRPC auth middleware and they will be duplicated. This is why we are now doing only append of new one (sandbox delete/create event).

packages/api/internal/clusters/instances_sync.go

When new instance was created, function callback did init sync instantly when created and then we sync again before inserting into instances pool. Second sync was removed and we are only doing one that will make sure instance values are filled as expected.

packages/api/internal/clusters/resources.go

# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go

packages/api/internal/clusters/instances_sync.go

This fixes issue where service discovery still returns instance so its still present in pool but for some reason its not possible to connect to it. When instance fails 3 times in row it will be marked as unhealthy until next sucess sync is executed, then counter is reset.

packages/api/internal/clusters/resources_local.go

Before there was shared sync timeout for whole of 5 seconds for whole call managed by sync managed. Now gRPC call is limited to 1 seconds.

# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go # packages/api/internal/handlers/sandboxes_list_metrics.go # packages/shared/pkg/feature-flags/flags.go

packages/api/internal/clusters/resources_local.go

# Conflicts: # packages/api/go.mod # packages/api/go.sum # packages/shared/go.mod

dobrac

is there any specific order for the production deployed?

iac/provider-gcp/nomad/main.tf

packages/api/internal/clusters/resources_remote.go

dobrac · 2026-01-09T09:48:31Z

packages/client-proxy/internal/service-discovery/static-service-discovery.go

-	items := make([]ServiceDiscoveryItem, 0)
+	items := make([]DiscoveredInstance, 0)

 	for _, result := range results {


we could do utils.Map( instead

packages/api/internal/template-manager/template_manager.go

packages/shared/pkg/consts/sandboxes.go

packages/api/internal/orchestrator/nodemanager/node.go

dobrac · 2026-01-09T10:00:30Z

packages/api/internal/orchestrator/nodemanager/node.go

@@ -167,22 +166,21 @@ func NewClusterNode(ctx context.Context, client *grpclient.GRPCClient, clusterID
 func (n *Node) Close(ctx context.Context) error {
 	if n.IsNomadManaged() {


this might be nicer to solve with interface and two implementations, but feel free to do in another PR

packages/api/internal/clusters/cluster.go

dobrac

lgtm in general, I would like to simplify the duplicit implementations for the clusters, but we can do that in another PR

sitole · 2026-01-12T10:17:43Z

is there any specific order for the production deployed?

There is no need to deploy services in a specific order for our cluster; the API is the only service that needs to be deployed here.

packages/api/internal/clusters/cluster.go

packages/api/internal/clusters/resources_local.go

Previously we used almost identical implementation for fetching build logs in both local and remote resources providers. Now we have one unifier implementation that only differes with persistent provider that si different for local vs remote - its providede as function callback.

packages/api/internal/clusters/resources.go

e2b-request-same-site-reviewers bot assigned dobrac Dec 22, 2025

sitole force-pushed the chore/make-edge-api-optional branch 2 times, most recently from 9187ea4 to 5899c96 Compare December 22, 2025 15:21

sitole changed the title ~~WIP: Make edge api optional~~ Make edge api optional for OSS deployment Dec 22, 2025

sitole force-pushed the chore/make-edge-api-optional branch from fdcdab7 to 17ef2be Compare December 22, 2025 15:47

Copilot AI mentioned this pull request Dec 22, 2025

Make edge API optional for OSS deployment #1641

Closed

This comment was marked as outdated.

Sign in to view

sitole force-pushed the chore/make-edge-api-optional branch from 093597f to 4dc359e Compare December 22, 2025 16:56

sitole marked this pull request as ready for review December 22, 2025 17:07

sitole requested review from ValentaTomas, dobrac and jakubno as code owners December 22, 2025 17:07

chatgpt-codex-connector bot reviewed Dec 22, 2025

View reviewed changes

packages/api/internal/clusters/resources_remote.go Outdated Show resolved Hide resolved

cursor bot reviewed Dec 22, 2025

View reviewed changes

packages/api/internal/clusters/resources_remote.go Show resolved Hide resolved

packages/api/internal/clusters/instances_sync.go Show resolved Hide resolved

packages/api/internal/clusters/instance.go Outdated Show resolved Hide resolved

sitole added 15 commits December 22, 2025 18:31

Renamed structs used for service discovery

4e697c9

`ServiceDiscoveryItem` is now renamed to `DiscoveredInstance` as it better reflects value it holds and makes obvious that its ment just for internal cycle of discovery.

Renamed orchestrators pool to instances pool

32f7006

Renamed as orchestrators pool can be little bit configusing in case when pool contains both orchestrators and tempalte builders. Now we are calling them instances.

Unused feature flag removed

f958441

Renamed execution id to sandbox execution id in sandbox info struct

2c9ea21

Removed to make it properly prefixed as other fields here.

Renamed package storing clusters

293329f

As we are migrating away from using edge api for local cluster and going towards removing need to deploy edge api for OSS deployment renamed was needed here.

Use cluster resource provider for getting sbx and build resources

718f50f

This change removes local/remove cluster specific backends for query sandbox logs, metrics and build logs and replaces it with cluster resource backend that is now unified.

Renamed cluster.ClusterInstance to cluster.Instance

beeb0a4

Cluster http client is no longer need to be exposed form cluster

f6222f9

Rename and rearrange clusters files

c205b18

Separated cluster, cluster sync, instance and instance sync logic to separated files to make it more easy to read and navigate.

Renamed clusterSynchronizationStore to instancesSyncStore

32cdfee

cursor bot reviewed Jan 5, 2026

View reviewed changes

packages/api/internal/clusters/instances_sync.go Outdated Show resolved Hide resolved

cursor bot reviewed Jan 5, 2026

View reviewed changes

packages/api/internal/clusters/resources.go Outdated Show resolved Hide resolved

sitole added 2 commits January 5, 2026 14:22

Fixed double fetch

d2fc2af

Merge branch 'main' into chore/make-edge-api-optional

05d0c67

# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/api/internal/clusters/instances_sync.go Show resolved Hide resolved

sitole added 2 commits January 6, 2026 12:10

Added log line during too many failed sync

7e39e0b

cursor bot reviewed Jan 6, 2026

View reviewed changes

packages/api/internal/clusters/resources_local.go Outdated Show resolved Hide resolved

packages/api/internal/clusters/resources_local.go Outdated Show resolved Hide resolved

sitole added 3 commits January 6, 2026 12:22

Make instance sync call lower timeouted

e6890ef

Before there was shared sync timeout for whole of 5 seconds for whole call managed by sync managed. Now gRPC call is limited to 1 seconds.

Merge branch 'main' into chore/make-edge-api-optional

714b217

# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go # packages/api/internal/handlers/sandboxes_list_metrics.go # packages/shared/pkg/feature-flags/flags.go

Fix after merge conflict

fe889cd

cursor bot reviewed Jan 7, 2026

View reviewed changes

packages/api/internal/clusters/resources_local.go Outdated Show resolved Hide resolved

sitole added 2 commits January 7, 2026 15:55

Merge branch 'main' into chore/make-edge-api-optional

3259ce1

# Conflicts: # packages/api/go.mod # packages/api/go.sum # packages/shared/go.mod

After conflict deps tidy

a2fbac6

dobrac reviewed Jan 9, 2026

View reviewed changes

dobrac approved these changes Jan 9, 2026

View reviewed changes

sitole added 4 commits January 12, 2026 11:12

Renamed const for orchestrator API port

2daf850

Build cluster api endpoint

ce46916

Use client instead of connection

3226d6c

Nits

d510fea

cursor bot reviewed Jan 12, 2026

View reviewed changes

packages/api/internal/clusters/cluster.go Show resolved Hide resolved

packages/api/internal/clusters/resources_local.go Outdated Show resolved Hide resolved

sitole added 2 commits January 12, 2026 18:43

For statis service discovery use list with predefined size

6422db3

sitole enabled auto-merge (squash) January 12, 2026 17:53

cursor bot reviewed Jan 12, 2026

View reviewed changes

packages/api/internal/clusters/resources.go Show resolved Hide resolved

sitole merged commit 9cc1504 into main Jan 12, 2026
29 checks passed

sitole deleted the chore/make-edge-api-optional branch January 12, 2026 18:04

sitole mentioned this pull request Jan 23, 2026

[Bug]: Cluster node sandbox routing fails - OrchestratorIP overwritten with empty string #1763

Closed

		@@ -167,22 +166,21 @@ func NewClusterNode(ctx context.Context, client *grpclient.GRPCClient, clusterID
		func (n *Node) Close(ctx context.Context) error {
		if n.IsNomadManaged() {

Conversation

sitole commented Dec 22, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How To Review

Uh oh!

sitole commented Dec 22, 2025

Uh oh!

This comment was marked as outdated.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobrac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dobrac Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobrac Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dobrac left a comment

Choose a reason for hiding this comment

Uh oh!

sitole commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sitole commented Dec 22, 2025 •

edited by cursor bot

Loading