-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Backport of explorer stack + some fixes and code cleanups #4878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ndr-ds
merged 13 commits into
testnet_conway
from
backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871
Oct 30, 2025
Merged
Backport of explorer stack + some fixes and code cleanups #4878
ndr-ds
merged 13 commits into
testnet_conway
from
backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871
Oct 30, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.
Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.
By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:
```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }
// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```
**Key Changes:**
**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter
**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.
**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
## Motivation Currently, only the client supports the `--otlp-exporter-endpoint` flag and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable for OpenTelemetry configuration. The proxy and server binaries always default to checking the environment variable but don't expose the command-line option, which is inconsistent with the client. ## Proposal This PR adds the `otlp_exporter_endpoint` option to both proxy and server for consistency: - **Proxy** (`linera-service/src/proxy/main.rs`): - Added `otlp_exporter_endpoint: Option<String>` field to `ProxyOptions` - Updated `init_with_opentelemetry` call to pass the endpoint instead of `None` - **Server** (`linera-service/src/server.rs`): - Added `otlp_exporter_endpoint: Option<String>` field to `ServerCommand::Run` variant - Created helper function `otlp_exporter_endpoint_for()` to extract endpoint from command - Updated `init_with_opentelemetry` call to use the helper function - Boxed `CommonStorageOptions` in `ServerCommand::Run` to fix clippy `large_enum_variant` warning (the `Run` variant became too large after adding the new field) Both binaries now consistently accept `--otlp-exporter-endpoint` flag and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable, matching the client's behavior. ## Test Plan - Verified clippy passes with all features - Verified clippy passes with no default features - Confirmed both binaries compile successfully ## Release Plan - Nothing to do / These changes follow the usual release cycle.
## Motivation The init and entrypoint shell scripts (server-entrypoint.sh, proxy-entrypoint.sh, etc.) were thin wrappers around binary calls that passed arguments to the Linera binaries. This meant that any time we needed to change initialization or entrypoint arguments (e.g., adding new flags, changing default values), we had to rebuild the Docker images, which is inefficient and couples configuration to the image build process. ## Proposal This PR removes all init and entrypoint shell scripts and embeds their logic directly into the configuration files: - **Deleted 7 shell scripts** from the codebase: - `docker/server-entrypoint.sh`, `docker/server-init.sh` - `docker/proxy-entrypoint.sh`, `docker/proxy-init.sh` - `docker/compose-server-entrypoint.sh`, `docker/compose-proxy-entrypoint.sh`, `docker/compose-server-init.sh` - **Updated Kubernetes templates** (shards.yaml and proxy.yaml): - Replaced script calls with inline shell commands in both initContainers and main containers - All binary arguments are now specified directly in the YAML - **Updated Docker Compose** (docker-compose.yml): - Replaced script calls with direct binary invocations using command arrays - Init logic is now inline shell script in the shard-init service - **Updated Dockerfile**: - Removed COPY commands for all deleted scripts **Benefits:** - Binary configuration changes (flags, environment variables) no longer require Docker image rebuilds - Arguments are visible directly in the deployment configs - Simpler deployment pipeline - fewer moving parts - Easier to understand what arguments are being passed to binaries ## Test Plan - Verified all shell scripts are deleted from the repository - Verified Dockerfile no longer references the deleted scripts - Checked that Kubernetes YAML templates have valid syntax - Checked that Docker Compose YAML has valid syntax - All binary arguments are correctly passed via command-line flags in the configs ## Release Plan - These changes should be backported to the latest `testnet` branch, then be released in a validator hotfix.
## Motivation We currently compile out trace logs, even though the overhead is minimal (it's 2 CPU instructions https://www.reddit.com/r/rust/comments/x9nypb/comment/inutkiv/) ## Proposal Stop compiling it out, so we can have `trace` logs if we want for debugging with just a restart, without having to recompile the binary. ## Test Plan I'm benchmarking networks with this, saw no visible performance regression ## Release Plan - Nothing to do / These changes follow the usual release cycle.
## Motivation
In a previous change, we lowered the log level for heap profile
operations from INFO to TRACE to reduce log noise from frequent
monitoring polls. However, we missed one log statement in the heap
profile HTTP endpoint that still uses INFO level.
This inconsistency means that while most heap profile operations log at
TRACE level, the HTTP endpoint still creates excessive log noise with
`"Serving heap profile via /debug/pprof"` messages appearing in default
logs every time the endpoint is polled.
## Proposal
Change the remaining log statement in
`linera-metrics/src/memory_profiler.rs` from `info!` to `trace!` for the
heap profile endpoint message, completing the consistency improvement
started in the earlier change.
## Test Plan
1. Deploy network with memory profiling enabled
2. Verify heap profile endpoint still works (returns pprof data)
3. Observe that default logs no longer contain repetitive "Serving heap
profile" messages
4. Enable TRACE logging and verify messages appear as expected
5. Confirm consistency with other heap profile logging
## Release Plan
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
## Motivation
When block exporters fail to connect to the indexer, they retry with
exponential backoff but provide no visibility into the retry process.
Operators cannot determine whether connection issues are transient
(retrying successfully) or persistent (about to fail). This lack of
observability makes it difficult to diagnose network connectivity issues
and indexer availability problems during deployments and operations.
## Proposal
Add comprehensive logging to the indexer connection retry logic:
- **WARNING level** on each retry attempt, including:
- Current retry count and max retries remaining
- Calculated backoff delay in milliseconds
- Error that triggered the retry
- **ERROR level** when all retries are exhausted, clearly indicating:
- Final retry count
- Max retries configured
- Error that caused final failure
- Explicit statement that exporter task will exit
This provides operators with actionable telemetry to distinguish
transient network issues from persistent failures.
## Test Plan
1. Deploy network with intentionally unreachable indexer endpoint
2. Observe exporter logs show WARNING messages with increasing retry
counts and backoff delays
3. Wait for max retries to be exhausted
4. Verify ERROR message appears with clear failure indication
5. Confirm logs provide enough information to diagnose the connection
issue
## Release Plan
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
#4848) ## Motivation The notification forwarding system has two bugs: 1. **Duplicate messages to exporters**: When validators have multiple proxies, each proxy forwards notifications to all exporters. This means exporters receive duplicate notifications (N copies for N proxies), causing redundant processing and potential data inconsistencies. 2. **Receiver lag crashes forwarding**: The broadcast receiver for notifications can lag if the consumer is slower than the producer. When lag occurs, the code treats `RecvError::Lagged` as a fatal error and exits the forwarding loop, stopping all notification forwarding permanently even though the channel is still functional. Both issues reduce system reliability and create operational problems that are difficult to diagnose. ## Proposal **Fix duplicate forwarding:** - Add `exporter_forwarded` flag to track whether exporters have been included - Only the first proxy forwards to exporters, subsequent proxies forward to an empty exporter list - Each proxy still forwards to its own validator endpoint **Fix lag handling:** - Replace `while let Ok()` pattern with explicit `match` on receiver result - Handle `RecvError::Lagged(skipped_count)` by logging a warning and continuing - Handle `RecvError::Closed` by logging and breaking (legitimate shutdown) - Only actually lagged/skipped messages trigger the warning, not every message This ensures exporters receive exactly one copy of each notification and forwarding continues even when temporary lag occurs. ## Test Plan 1. Deploy network with multiple proxies (2+) and exporters 2. Verify exporters receive exactly one copy of each notification (not N copies) 3. Generate high notification load to induce receiver lag 4. Verify warning logs appear for lagged messages 5. Confirm notification forwarding continues after lag events 6. Verify no duplicate processing in exporter/indexer ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.
## Motivation
The explorer frontend hardcodes API URLs to `http://localhost:3002`,
which only works in local development when both frontend and backend run
on localhost. This breaks in production deployments where:
- Frontend is served through Kubernetes ingress
- Backend API is accessed via the same ingress with `/api` path prefix
- Browsers block mixed content (HTTPS frontend → HTTP localhost backend)
- CORS issues arise from cross-origin requests
The hardcoded URLs prevent the explorer from working in deployed
environments.
## Proposal
Replace hardcoded absolute URLs with relative paths that work in all
environments. Add Vite dev server proxy configuration to maintain local
development workflow while using relative paths.
This approach works in all environments:
- **Local dev**: Vite proxy forwards `/api` → `http://localhost:3002`
- **Production**: Kubernetes ingress routes `/api` → backend service
## Test Plan
1. Test production deployment:
- Deploy explorer with Kubernetes ingress configured
- Verify frontend loads and connects to backend
- Confirm no CORS or mixed content errors
- Verify all API calls work through ingress
## Release Plan
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
Deploying the exporter/indexer/explorer stack requires careful
dependency ordering and runtime checks:
1. **Exporters need database to exist** before starting (ScyllaDB must
be initialized)
2. **Exporters need indexer to be ready** before attempting connections
3. **Indexer image needs linera CLI** to perform database checks
4. **Configuration needs flexibility** to support both internal cluster
DNS and external endpoints with TLS
Without these capabilities, deployments fail with race conditions where
exporters start before dependencies are ready, causing crash loops and
manual intervention.
Add comprehensive deployment support through init containers that wait
for dependencies, include linera CLI in the indexer image for database
checks, and make indexer endpoints configurable to support both internal
and external deployments with automatic TLS handling.
1. Deploy fresh network with exporter/indexer/explorer stack:
- Verify ScyllaDB init container waits for database
- Verify indexer readiness init container waits for indexer
- Confirm exporters start only after both checks pass
2. Test with external indexer endpoint:
- Configure `blockExporter.indexerEndpoint` with external URL
- Verify TLS configuration is applied
- Confirm connectivity works
3. Test failure scenarios:
- Deploy without indexer, verify exporter init container retries
indefinitely
- Kill indexer, verify exporters restart and wait for it to return
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
## Motivation
Validator deployments can fail with HTTP 429 errors when helmfile tries
to download the Prometheus ServiceMonitor CRD from
raw.githubusercontent.com during the prepare hook. When this happens, it
blocks the entire deployment.
## Proposal
Bundle the ServiceMonitor CRD (v0.86.1) locally in the repository at
`kubernetes/linera-validator/crds/servicemonitor.yaml` instead of
downloading it from GitHub at deploy time.
ServiceMonitor CRDs change infrequently and are designed to be backward
compatible, so using a static version is appropriate. The CRD file
includes documentation with version info and instructions for updating
when needed (typically when upgrading prometheus-operator or quarterly
checks).
Changes:
- Added ServiceMonitor CRD file with documentation header explaining
version, source, and update instructions
- Updated helmfile.yaml to apply the CRD from the local file instead of
the GitHub URL
- Maintains existing check to skip installation if CRD already exists on
the cluster
## Test Plan
- Manually tested: Deploy to fresh cluster - CRD installs successfully
- Verified: helmfile hook completes without GitHub rate limit errors
- Confirmed: ServiceMonitor resources are created correctly after
deployment
## Release Plan
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
## Motivation
`sh` doesn't support `set -euo pipefail`, I missed this in a previous PR
## Proposal
Use `bash` when available, otherwise use `set -eu` instead.
## Test Plan
Deployed a network, exporter starts as expected now.
## Release Plan
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
…eated (#4871) ## Motivation We're setting Scylla settings using this ConfigMap, but there's actually no guarantee that it will exist by the time Scylla starts. I've seen networks deployed that just ignore the ConfigMap and use default settings, for example, because the ConfigMap wasn't created yet when Scylla started. ## Proposal Enforce that the ConfigMap will be created before Scylla starts. ## Test Plan Deployed a network, it uses the ConfigMap options more reliably now ## Release Plan - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.
136f1ab to
5cfcd59
Compare
#4852) ## Motivation We saw UNIQUE constraints violations for the `operations` table when testing w/ the exporter. ## Proposal Remove all constraints – those should be maintained by the node anyway and add additional work for the SQlite (that has to check them). ## Test Plan manual. ## Release Plan - Nothing to do / These changes follow the usual release cycle.## Links - [reviewer checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
deuszx
approved these changes
Oct 30, 2025
MathieuDutSik
approved these changes
Oct 30, 2025
Contributor
MathieuDutSik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Backport a few PRs:
Proposal
Backport
Test Plan
CI