Make edge api optional for OSS deployment#1638
Conversation
9187ea4 to
5899c96
Compare
fdcdab7 to
17ef2be
Compare
|
@copilot please |
This comment was marked as outdated.
This comment was marked as outdated.
093597f to
4dc359e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Client proxy previously used allocations job search but only for one jobs with jobs prefix name. This was not suitable for running in environment where it should be used for discover of both orchestrator and template manager instances. New filter logic search for template managers on build nodes and orchestrators on client default nodes. Now ENV variable for job prefix is no longer needed as its baked in discovery logic itself. Logic was also moved from client proxy package to shared as it will be later used for same job in API, that now uses edge api for local cluster template builders disocvery. This will be changed in following changes.
`ServiceDiscoveryItem` is now renamed to `DiscoveredInstance` as it better reflects value it holds and makes obvious that its ment just for internal cycle of discovery.
Lot of structs and variables is prefixed orchestrator even its already in package name so its not needed. Rename variables on structs/struct itself to start with lower letter to not export them to other packages if not really needed. Instance gRPC client is also made internal and we now have two function for exporting raw gRPC connection used in gRPC proxy and gRPC info service client that is used in few places.
Moved instance client holding gRPC connection and gRPC info client into separated file to make it cleaner and easier to work with. Orchestrator filename is renamed to instance and `NewOrchestratorInstance` is renamed to `newInstance` as orchestrator is duplicated here (package name) and dont need to be used ourside of package.
Renamed as orchestrators pool can be little bit configusing in case when pool contains both orchestrators and tempalte builders. Now we are calling them instances.
Removed to make it properly prefixed as other fields here.
As we are migrating away from using edge api for local cluster and going towards removing need to deploy edge api for OSS deployment renamed was needed here.
This is interface ment for unify all calls to resources we want to query from cluster (sandbox logs, metrics, build logs etc). This commit introduces basic implementation and getter on cluster that will be later used on all places where we need to query sandbox/build logs and/or metrics. Later this same interface will be used for local cluster to query local Loki and ClickHouse to provide same data. This way we should have unified interface.
This change removes local/remove cluster specific backends for query sandbox logs, metrics and build logs and replaces it with cluster resource backend that is now unified.
Separated cluster, cluster sync, instance and instance sync logic to separated files to make it more easy to read and navigate.
Previously we used shared gRPC connection that was re-used for cluster instances. This introduces issue where for each called we need to expose not just gRPC connection but also ctx with additional gRPC metadata or metadata itself so for each request proper service instance id metadata key will be added so gRPC proxy running in remote cluster will router request properly. Now as each instance holds its own gRPC connection we no longer need to do this. To minimize refactoring changes some methods that were returning context with metadata re now returning context received without additional changes. There is small changes for sandbox create/delete metadata event. Previously we merged metadata and added them, but now metadata (auth and service instance id) are stored in gRPC auth middleware and they will be duplicated. This is why we are now doing only append of new one (sandbox delete/create event).
When new instance was created, function callback did init sync instantly when created and then we sync again before inserting into instances pool. Second sync was removed and we are only doing one that will make sure instance values are filled as expected.
# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go
This fixes issue where service discovery still returns instance so its still present in pool but for some reason its not possible to connect to it. When instance fails 3 times in row it will be marked as unhealthy until next sucess sync is executed, then counter is reset.
Before there was shared sync timeout for whole of 5 seconds for whole call managed by sync managed. Now gRPC call is limited to 1 seconds.
# Conflicts: # packages/api/internal/handlers/sandbox_metrics.go # packages/api/internal/handlers/sandboxes_list_metrics.go # packages/shared/pkg/feature-flags/flags.go
# Conflicts: # packages/api/go.mod # packages/api/go.sum # packages/shared/go.mod
dobrac
left a comment
There was a problem hiding this comment.
is there any specific order for the production deployed?
| items := make([]ServiceDiscoveryItem, 0) | ||
| items := make([]DiscoveredInstance, 0) | ||
|
|
||
| for _, result := range results { |
There was a problem hiding this comment.
we could do utils.Map( instead
| @@ -167,22 +166,21 @@ func NewClusterNode(ctx context.Context, client *grpclient.GRPCClient, clusterID | |||
| func (n *Node) Close(ctx context.Context) error { | |||
| if n.IsNomadManaged() { | |||
There was a problem hiding this comment.
this might be nicer to solve with interface and two implementations, but feel free to do in another PR
dobrac
left a comment
There was a problem hiding this comment.
lgtm in general, I would like to simplify the duplicit implementations for the clusters, but we can do that in another PR
There is no need to deploy services in a specific order for our cluster; the API is the only service that needs to be deployed here. |
Previously we used almost identical implementation for fetching build logs in both local and remote resources providers. Now we have one unifier implementation that only differes with persistent provider that si different for local vs remote - its providede as function callback.
This pull request introduces a change that makes our API able to work and be deployed without the need for an edge API.
The main change is migrating queries for sandbox metrics/logs and build logs under the cluster resource provider, implemented as an interface backed by local (Loki, ClickHouse) and remote (edge API) implementations.
The second change is the use of local service discovery, which is now working only for template builders (orchestrators are still discovered in the old way to minimize changes in this pr).
The last change is that each instance now maintains its own gRPC connection. Before cluster instances, there was one shared connection handled by the cluster and reused by all instances. This was fine, but it forced us to provide metadata for each gRPC call, including the auth token and service instance ID metadata required by the gRPC proxy running on the edge API. Now, this is not needed, and each connection already holds this metadata, so we don't need to pass it in the context.
A change to connections was needed: now that we are not using the gRPC proxy for local cluster instances (template builders), we cannot use a shared connection, since it always points to different instances.
After this is deployed, the API can work without the edge API. It's no longer needed even in integration tests.
How To Review
I tried to make changes between commits as small as possible, only changing the needed stuff, and then describing it in the commit description. Each commit should be self-contained and should compile without issues. You can go one by one to make the overall review faster and easier for both sides. Some commits do a temporary job to pipe stuff together before the final solution is in place.
I am happy to receive feedback on this, as we want to do more of it.
Note
Enables deployments/tests without
edgeby abstracting cluster I/O and discovery.clusterspackage: per-instance gRPC connections, local/remote service discovery, instance/cluster sync, and aClusterResourceinterface with local (ClickHouse/Loki) and remote (Edge API) implementations for sandbox metrics/logs and template build logsLOKI_URL(optionalLOKI_USER/LOKI_PASSWORD); removes use ofLOCAL_CLUSTER_ENDPOINT/LOCAL_CLUSTER_TOKENLOKI_URLto Nomad jobs and Terraform locals; CI stops starting Edge service and removes related env varsWritten by Cursor Bugbot for commit 914d08b. This will update automatically on new commits. Configure here.