Skip to content

Make edge api optional for OSS deployment#1638

Merged
sitole merged 64 commits intomainfrom
chore/make-edge-api-optional
Jan 12, 2026
Merged

Make edge api optional for OSS deployment#1638
sitole merged 64 commits intomainfrom
chore/make-edge-api-optional

Conversation

@sitole
Copy link
Copy Markdown
Member

@sitole sitole commented Dec 22, 2025

This pull request introduces a change that makes our API able to work and be deployed without the need for an edge API.
The main change is migrating queries for sandbox metrics/logs and build logs under the cluster resource provider, implemented as an interface backed by local (Loki, ClickHouse) and remote (edge API) implementations.

The second change is the use of local service discovery, which is now working only for template builders (orchestrators are still discovered in the old way to minimize changes in this pr).

The last change is that each instance now maintains its own gRPC connection. Before cluster instances, there was one shared connection handled by the cluster and reused by all instances. This was fine, but it forced us to provide metadata for each gRPC call, including the auth token and service instance ID metadata required by the gRPC proxy running on the edge API. Now, this is not needed, and each connection already holds this metadata, so we don't need to pass it in the context.

A change to connections was needed: now that we are not using the gRPC proxy for local cluster instances (template builders), we cannot use a shared connection, since it always points to different instances.

After this is deployed, the API can work without the edge API. It's no longer needed even in integration tests.

How To Review

I tried to make changes between commits as small as possible, only changing the needed stuff, and then describing it in the commit description. Each commit should be self-contained and should compile without issues. You can go one by one to make the overall review faster and easier for both sides. Some commits do a temporary job to pipe stuff together before the final solution is in place.

I am happy to receive feedback on this, as we want to do more of it.


Note

Enables deployments/tests without edge by abstracting cluster I/O and discovery.

  • Adds clusters package: per-instance gRPC connections, local/remote service discovery, instance/cluster sync, and a ClusterResource interface with local (ClickHouse/Loki) and remote (Edge API) implementations for sandbox metrics/logs and template build logs
  • Updates config to require LOKI_URL (optional LOKI_USER/LOKI_PASSWORD); removes use of LOCAL_CLUSTER_ENDPOINT/LOCAL_CLUSTER_TOKEN
  • Infra: propagate LOKI_URL to Nomad jobs and Terraform locals; CI stops starting Edge service and removes related env vars
  • API wiring moved from shared connection to instance-scoped clients; new log query windowing and unified log source selection with fallbacks
  • go.mod: add Loki and related deps

Written by Cursor Bugbot for commit 914d08b. This will update automatically on new commits. Configure here.

@sitole sitole force-pushed the chore/make-edge-api-optional branch 2 times, most recently from 9187ea4 to 5899c96 Compare December 22, 2025 15:21
@sitole sitole changed the title WIP: Make edge api optional Make edge api optional for OSS deployment Dec 22, 2025
@sitole sitole force-pushed the chore/make-edge-api-optional branch from fdcdab7 to 17ef2be Compare December 22, 2025 15:47
@sitole
Copy link
Copy Markdown
Member Author

sitole commented Dec 22, 2025

@copilot please

This comment was marked as outdated.

@sitole sitole force-pushed the chore/make-edge-api-optional branch from 093597f to 4dc359e Compare December 22, 2025 16:56
@sitole sitole marked this pull request as ready for review December 22, 2025 17:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Client proxy previously used allocations job search but only for one
jobs with jobs prefix name. This was not suitable for running in
environment where it should be used for discover of both orchestrator
and template manager instances.

New filter logic search for template managers on build nodes and
orchestrators on client default nodes. Now ENV variable for job prefix
is no longer needed as its baked in discovery logic itself.

Logic was also moved from client proxy package to shared as it will be
later used for same job in API, that now uses edge api for local cluster
template builders disocvery. This will be changed in following changes.
`ServiceDiscoveryItem` is now renamed to `DiscoveredInstance` as it
better reflects value it holds and makes obvious that its ment just for
internal cycle of discovery.
Lot of structs and variables is prefixed orchestrator even its already
in package name so its not needed.

Rename variables on structs/struct itself to start with lower letter to
not export them to other packages if not really needed.

Instance gRPC client is also made internal and we now have two function
for exporting raw gRPC connection used in gRPC proxy and gRPC info
service client that is used in few places.
Moved instance client holding gRPC connection and gRPC info client into
separated file to make it cleaner and easier to work with.

Orchestrator filename is renamed to instance and
`NewOrchestratorInstance` is renamed to `newInstance` as orchestrator is
duplicated here (package name) and dont need to be used ourside of
package.
Renamed as orchestrators pool can be little bit configusing in case when
pool contains both orchestrators and tempalte builders. Now we are
calling them instances.
Removed to make it properly prefixed as other fields here.
As we are migrating away from using edge api for local cluster and going
towards removing need to deploy edge api for OSS deployment renamed was
needed here.
This is interface ment for unify all calls to resources we want to query
from cluster (sandbox logs, metrics, build logs etc).

This commit introduces basic implementation and getter on cluster that
will be later used on all places where we need to query sandbox/build
logs and/or metrics.

Later this same interface will be used for local cluster to query local
Loki and ClickHouse to provide same data. This way we should have
unified interface.
This change removes local/remove cluster specific backends for query
sandbox logs, metrics and build logs and replaces it with cluster
resource backend that is now unified.
Separated cluster, cluster sync, instance and instance sync logic to
separated files to make it more easy to read and navigate.
Previously we used shared gRPC connection that was re-used for cluster
instances. This introduces issue where for each called we need to expose
not just gRPC connection but also ctx with additional gRPC metadata or
metadata itself so for each request proper service instance id metadata
key will be added so gRPC proxy running in remote cluster will router
request properly.

Now as each instance holds its own gRPC connection we no longer need to
do this. To minimize refactoring changes some methods that were
returning context with metadata re now returning context received
without additional changes.

There is small changes for sandbox create/delete metadata event.
Previously we merged metadata and added them, but now metadata (auth and
service instance id) are stored in gRPC auth middleware and they will be
duplicated. This is why we are now doing only append of new one (sandbox
delete/create event).
When new instance was created, function callback did init sync instantly
when created and then we sync again before inserting into instances
pool.

Second sync was removed and we are only doing one that will make sure
instance values are filled as expected.
sitole added 2 commits January 5, 2026 14:22
# Conflicts:
#	packages/api/internal/handlers/sandbox_metrics.go
sitole added 2 commits January 6, 2026 12:10
This fixes issue where service discovery still returns instance so its
still present in pool but for some reason its not possible to connect to
it.

When instance fails 3 times in row it will be marked as unhealthy until
next sucess sync is executed, then counter is reset.
sitole added 3 commits January 6, 2026 12:22
Before there was shared sync timeout for whole of 5 seconds for whole
call managed by sync managed. Now gRPC call is limited to 1 seconds.
# Conflicts:
#	packages/api/internal/handlers/sandbox_metrics.go
#	packages/api/internal/handlers/sandboxes_list_metrics.go
#	packages/shared/pkg/feature-flags/flags.go
sitole added 2 commits January 7, 2026 15:55
# Conflicts:
#	packages/api/go.mod
#	packages/api/go.sum
#	packages/shared/go.mod
Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any specific order for the production deployed?

items := make([]ServiceDiscoveryItem, 0)
items := make([]DiscoveredInstance, 0)

for _, result := range results {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could do utils.Map( instead

@@ -167,22 +166,21 @@ func NewClusterNode(ctx context.Context, client *grpclient.GRPCClient, clusterID
func (n *Node) Close(ctx context.Context) error {
if n.IsNomadManaged() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be nicer to solve with interface and two implementations, but feel free to do in another PR

Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general, I would like to simplify the duplicit implementations for the clusters, but we can do that in another PR

@sitole
Copy link
Copy Markdown
Member Author

sitole commented Jan 12, 2026

is there any specific order for the production deployed?

There is no need to deploy services in a specific order for our cluster; the API is the only service that needs to be deployed here.

Previously we used almost identical implementation for fetching build
logs in both local and remote resources providers. Now we have one
unifier implementation that only differes with persistent provider that
si different for local vs remote - its providede as function callback.
@sitole sitole enabled auto-merge (squash) January 12, 2026 17:53
@sitole sitole merged commit 9cc1504 into main Jan 12, 2026
29 checks passed
@sitole sitole deleted the chore/make-edge-api-optional branch January 12, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants