Backport of explorer stack + some fixes and code cleanups #4878

ndr-ds · 2025-10-30T13:02:09Z

Motivation

Backport a few PRs:

Make sending traces to Tempo less confusing (Make sending traces to Tempo less confusing #4771)
Add otlp_exporter_endpoint option to proxy/server as well (Add otlp_exporter_endpoint option to proxy/server as well #4829)
Stop using init and entrypoint scripts (Stop using init and entrypoint scripts #4830)
Use release_max_level_trace instead (Use release_max_level_trace instead #4554)
Make all heap profile logs trace (Make all heap profile logs trace #4845)
exporter: Add retry logging to indexer connection (exporter: Add retry logging to indexer connection #4846)
Fix notification forwarding to avoid duplicate messages and handle lag (Fix notification forwarding to avoid duplicate messages and handle lag #4848)
Fix a few explorer frontend bugs (Fix a few explorer frontend bugs #4849)
Add support for exporter/indexer/explorer stack deployment (Add support for exporter/indexer/explorer stack deployment #4850)
Stop getting service monitor CRD from GitHub (Stop getting service monitor CRD from GitHub #4855)
Fix small bug on exporter config (Fix small bug on exporter config #4870)
Make sure that the Scylla config ConfigMap exists when Scylla gets created (Make sure that the Scylla config ConfigMap exists when Scylla gets created #4871)
Remove UNIQUE constraints – those should be tracked by the node anyway (Remove UNIQUE constraints – those should be tracked by the node anyway #4852)

Proposal

Backport

Test Plan

CI

Right now we have a confusing mix of features and naming around telemetry export. The code references both `otel` and `telemetry_only`, and it's unclear how to opt in or out of sending trace information to Tempo/OpenTelemetry backends. Additionally, there are a few bugs preventing OpenTelemetry from working correctly in production deployments. Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to opt out of sending specific spans to Tempo/OpenTelemetry. By default, all spans are sent to OpenTelemetry/Tempo backends (when the `opentelemetry` feature is enabled), regardless of log level. To exclude a span from OpenTelemetry export while keeping it in local traces (Chrome trace, logs), add `opentelemetry.skip` to the span fields: ```rust // This span will NOT be sent to Tempo fn internal_helper() { } // This span WILL be sent to Tempo (default behavior) fn public_api() { } ``` **Key Changes:** **Naming & Clarity:** - Renamed `otel` feature to `opentelemetry` for clarity - Renamed `--otel-exporter-otlp-endpoint` CLI flag to `--otlp-exporter-endpoint` - Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more accurate) - Standardized all environment variables to use `LINERA_OTLP_EXPORTER_ENDPOINT` - Chrome trace export includes ALL spans (useful for local debugging) - OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter **Bug Fixes:** - **Fixed empty endpoint validation**: Previously, passing an empty string as the OTLP endpoint would cause the exporter to fail with "Failed to parse endpoint". Now properly validates that the endpoint is non-empty before attempting to create the exporter. - **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being initialized before the Tokio runtime was created, causing a panic: "there is no reactor running, must be called from the context of a Tokio 1.x runtime". Moved tracing initialization to after runtime creation. **Test Improvements:** - Added proper feature guards (`with_testing`) for test-only functions - CI tests verify that spans with `opentelemetry.skip` are filtered from OpenTelemetry export but still appear in Chrome traces - CI tests verify that spans with `opentelemetry.skip` are filtered from OpenTelemetry export but still appear in Chrome traces - All existing tests pass with the renamed feature and CLI flags - Manual testing confirms OTLP traces now successfully export to Tempo without crashes - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix. Might be good to backport just so if we want to add more instrumentation on the testnet, things are not as confusing :)

## Motivation Currently, only the client supports the `--otlp-exporter-endpoint` flag and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable for OpenTelemetry configuration. The proxy and server binaries always default to checking the environment variable but don't expose the command-line option, which is inconsistent with the client. ## Proposal This PR adds the `otlp_exporter_endpoint` option to both proxy and server for consistency: - **Proxy** (`linera-service/src/proxy/main.rs`): - Added `otlp_exporter_endpoint: Option<String>` field to `ProxyOptions` - Updated `init_with_opentelemetry` call to pass the endpoint instead of `None` - **Server** (`linera-service/src/server.rs`): - Added `otlp_exporter_endpoint: Option<String>` field to `ServerCommand::Run` variant - Created helper function `otlp_exporter_endpoint_for()` to extract endpoint from command - Updated `init_with_opentelemetry` call to use the helper function - Boxed `CommonStorageOptions` in `ServerCommand::Run` to fix clippy `large_enum_variant` warning (the `Run` variant became too large after adding the new field) Both binaries now consistently accept `--otlp-exporter-endpoint` flag and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable, matching the client's behavior. ## Test Plan - Verified clippy passes with all features - Verified clippy passes with no default features - Confirmed both binaries compile successfully ## Release Plan - Nothing to do / These changes follow the usual release cycle.

## Motivation The init and entrypoint shell scripts (server-entrypoint.sh, proxy-entrypoint.sh, etc.) were thin wrappers around binary calls that passed arguments to the Linera binaries. This meant that any time we needed to change initialization or entrypoint arguments (e.g., adding new flags, changing default values), we had to rebuild the Docker images, which is inefficient and couples configuration to the image build process. ## Proposal This PR removes all init and entrypoint shell scripts and embeds their logic directly into the configuration files: - **Deleted 7 shell scripts** from the codebase: - `docker/server-entrypoint.sh`, `docker/server-init.sh` - `docker/proxy-entrypoint.sh`, `docker/proxy-init.sh` - `docker/compose-server-entrypoint.sh`, `docker/compose-proxy-entrypoint.sh`, `docker/compose-server-init.sh` - **Updated Kubernetes templates** (shards.yaml and proxy.yaml): - Replaced script calls with inline shell commands in both initContainers and main containers - All binary arguments are now specified directly in the YAML - **Updated Docker Compose** (docker-compose.yml): - Replaced script calls with direct binary invocations using command arrays - Init logic is now inline shell script in the shard-init service - **Updated Dockerfile**: - Removed COPY commands for all deleted scripts **Benefits:** - Binary configuration changes (flags, environment variables) no longer require Docker image rebuilds - Arguments are visible directly in the deployment configs - Simpler deployment pipeline - fewer moving parts - Easier to understand what arguments are being passed to binaries ## Test Plan - Verified all shell scripts are deleted from the repository - Verified Dockerfile no longer references the deleted scripts - Checked that Kubernetes YAML templates have valid syntax - Checked that Docker Compose YAML has valid syntax - All binary arguments are correctly passed via command-line flags in the configs ## Release Plan - These changes should be backported to the latest `testnet` branch, then be released in a validator hotfix.

## Motivation We currently compile out trace logs, even though the overhead is minimal (it's 2 CPU instructions https://www.reddit.com/r/rust/comments/x9nypb/comment/inutkiv/) ## Proposal Stop compiling it out, so we can have `trace` logs if we want for debugging with just a restart, without having to recompile the binary. ## Test Plan I'm benchmarking networks with this, saw no visible performance regression ## Release Plan - Nothing to do / These changes follow the usual release cycle.

## Motivation In a previous change, we lowered the log level for heap profile operations from INFO to TRACE to reduce log noise from frequent monitoring polls. However, we missed one log statement in the heap profile HTTP endpoint that still uses INFO level. This inconsistency means that while most heap profile operations log at TRACE level, the HTTP endpoint still creates excessive log noise with `"Serving heap profile via /debug/pprof"` messages appearing in default logs every time the endpoint is polled. ## Proposal Change the remaining log statement in `linera-metrics/src/memory_profiler.rs` from `info!` to `trace!` for the heap profile endpoint message, completing the consistency improvement started in the earlier change. ## Test Plan 1. Deploy network with memory profiling enabled 2. Verify heap profile endpoint still works (returns pprof data) 3. Observe that default logs no longer contain repetitive "Serving heap profile" messages 4. Enable TRACE logging and verify messages appear as expected 5. Confirm consistency with other heap profile logging ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

## Motivation When block exporters fail to connect to the indexer, they retry with exponential backoff but provide no visibility into the retry process. Operators cannot determine whether connection issues are transient (retrying successfully) or persistent (about to fail). This lack of observability makes it difficult to diagnose network connectivity issues and indexer availability problems during deployments and operations. ## Proposal Add comprehensive logging to the indexer connection retry logic: - **WARNING level** on each retry attempt, including: - Current retry count and max retries remaining - Calculated backoff delay in milliseconds - Error that triggered the retry - **ERROR level** when all retries are exhausted, clearly indicating: - Final retry count - Max retries configured - Error that caused final failure - Explicit statement that exporter task will exit This provides operators with actionable telemetry to distinguish transient network issues from persistent failures. ## Test Plan 1. Deploy network with intentionally unreachable indexer endpoint 2. Observe exporter logs show WARNING messages with increasing retry counts and backoff delays 3. Wait for max retries to be exhausted 4. Verify ERROR message appears with clear failure indication 5. Confirm logs provide enough information to diagnose the connection issue ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

#4848) ## Motivation The notification forwarding system has two bugs: 1. **Duplicate messages to exporters**: When validators have multiple proxies, each proxy forwards notifications to all exporters. This means exporters receive duplicate notifications (N copies for N proxies), causing redundant processing and potential data inconsistencies. 2. **Receiver lag crashes forwarding**: The broadcast receiver for notifications can lag if the consumer is slower than the producer. When lag occurs, the code treats `RecvError::Lagged` as a fatal error and exits the forwarding loop, stopping all notification forwarding permanently even though the channel is still functional. Both issues reduce system reliability and create operational problems that are difficult to diagnose. ## Proposal **Fix duplicate forwarding:** - Add `exporter_forwarded` flag to track whether exporters have been included - Only the first proxy forwards to exporters, subsequent proxies forward to an empty exporter list - Each proxy still forwards to its own validator endpoint **Fix lag handling:** - Replace `while let Ok()` pattern with explicit `match` on receiver result - Handle `RecvError::Lagged(skipped_count)` by logging a warning and continuing - Handle `RecvError::Closed` by logging and breaking (legitimate shutdown) - Only actually lagged/skipped messages trigger the warning, not every message This ensures exporters receive exactly one copy of each notification and forwarding continues even when temporary lag occurs. ## Test Plan 1. Deploy network with multiple proxies (2+) and exporters 2. Verify exporters receive exactly one copy of each notification (not N copies) 3. Generate high notification load to induce receiver lag 4. Verify warning logs appear for lagged messages 5. Confirm notification forwarding continues after lag events 6. Verify no duplicate processing in exporter/indexer ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

## Motivation The explorer frontend hardcodes API URLs to `http://localhost:3002`, which only works in local development when both frontend and backend run on localhost. This breaks in production deployments where: - Frontend is served through Kubernetes ingress - Backend API is accessed via the same ingress with `/api` path prefix - Browsers block mixed content (HTTPS frontend → HTTP localhost backend) - CORS issues arise from cross-origin requests The hardcoded URLs prevent the explorer from working in deployed environments. ## Proposal Replace hardcoded absolute URLs with relative paths that work in all environments. Add Vite dev server proxy configuration to maintain local development workflow while using relative paths. This approach works in all environments: - **Local dev**: Vite proxy forwards `/api` → `http://localhost:3002` - **Production**: Kubernetes ingress routes `/api` → backend service ## Test Plan 1. Test production deployment: - Deploy explorer with Kubernetes ingress configured - Verify frontend loads and connects to backend - Confirm no CORS or mixed content errors - Verify all API calls work through ingress ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

Deploying the exporter/indexer/explorer stack requires careful dependency ordering and runtime checks: 1. **Exporters need database to exist** before starting (ScyllaDB must be initialized) 2. **Exporters need indexer to be ready** before attempting connections 3. **Indexer image needs linera CLI** to perform database checks 4. **Configuration needs flexibility** to support both internal cluster DNS and external endpoints with TLS Without these capabilities, deployments fail with race conditions where exporters start before dependencies are ready, causing crash loops and manual intervention. Add comprehensive deployment support through init containers that wait for dependencies, include linera CLI in the indexer image for database checks, and make indexer endpoints configurable to support both internal and external deployments with automatic TLS handling. 1. Deploy fresh network with exporter/indexer/explorer stack: - Verify ScyllaDB init container waits for database - Verify indexer readiness init container waits for indexer - Confirm exporters start only after both checks pass 2. Test with external indexer endpoint: - Configure `blockExporter.indexerEndpoint` with external URL - Verify TLS configuration is applied - Confirm connectivity works 3. Test failure scenarios: - Deploy without indexer, verify exporter init container retries indefinitely - Kill indexer, verify exporters restart and wait for it to return - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

## Motivation Validator deployments can fail with HTTP 429 errors when helmfile tries to download the Prometheus ServiceMonitor CRD from raw.githubusercontent.com during the prepare hook. When this happens, it blocks the entire deployment. ## Proposal Bundle the ServiceMonitor CRD (v0.86.1) locally in the repository at `kubernetes/linera-validator/crds/servicemonitor.yaml` instead of downloading it from GitHub at deploy time. ServiceMonitor CRDs change infrequently and are designed to be backward compatible, so using a static version is appropriate. The CRD file includes documentation with version info and instructions for updating when needed (typically when upgrading prometheus-operator or quarterly checks). Changes: - Added ServiceMonitor CRD file with documentation header explaining version, source, and update instructions - Updated helmfile.yaml to apply the CRD from the local file instead of the GitHub URL - Maintains existing check to skip installation if CRD already exists on the cluster ## Test Plan - Manually tested: Deploy to fresh cluster - CRD installs successfully - Verified: helmfile hook completes without GitHub rate limit errors - Confirmed: ServiceMonitor resources are created correctly after deployment ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

## Motivation `sh` doesn't support `set -euo pipefail`, I missed this in a previous PR ## Proposal Use `bash` when available, otherwise use `set -eu` instead. ## Test Plan Deployed a network, exporter starts as expected now. ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

…eated (#4871) ## Motivation We're setting Scylla settings using this ConfigMap, but there's actually no guarantee that it will exist by the time Scylla starts. I've seen networks deployed that just ignore the ConfigMap and use default settings, for example, because the ConfigMap wasn't created yet when Scylla started. ## Proposal Enforce that the ConfigMap will be created before Scylla starts. ## Test Plan Deployed a network, it uses the ConfigMap options more reliably now ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

#4852) ## Motivation We saw UNIQUE constraints violations for the `operations` table when testing w/ the exporter. ## Proposal Remove all constraints – those should be maintained by the node anyway and add additional work for the SQlite (that has to check them). ## Test Plan manual. ## Release Plan - Nothing to do / These changes follow the usual release cycle.## Links - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)

MathieuDutSik

LGTM

ndr-ds added 12 commits October 30, 2025 10:10

ndr-ds force-pushed the backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871 branch from 136f1ab to 5cfcd59 Compare October 30, 2025 13:10

eldios assigned ndr-ds Oct 30, 2025

ndr-ds requested review from MathieuDutSik, Twey, afck, bart-linera, deuszx, eldios and ma2bd October 30, 2025 19:12

deuszx approved these changes Oct 30, 2025

View reviewed changes

MathieuDutSik approved these changes Oct 30, 2025

View reviewed changes

ndr-ds merged commit a58ad6f into testnet_conway Oct 30, 2025
31 of 32 checks passed

ndr-ds deleted the backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871 branch October 30, 2025 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backport of explorer stack + some fixes and code cleanups #4878

Backport of explorer stack + some fixes and code cleanups #4878

Uh oh!

ndr-ds commented Oct 30, 2025 •

edited

Loading

Uh oh!

MathieuDutSik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Backport of explorer stack + some fixes and code cleanups #4878

Backport of explorer stack + some fixes and code cleanups #4878

Uh oh!

Conversation

ndr-ds commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Uh oh!

MathieuDutSik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ndr-ds commented Oct 30, 2025 •

edited

Loading