Skip to content

Conversation

@ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Oct 30, 2025

Motivation

Backport a few PRs:

Proposal

Backport

Test Plan

CI

ndr-ds added 12 commits October 30, 2025 10:10
Right now we have a confusing mix of features and naming around
telemetry export. The code references both `otel` and `telemetry_only`,
and it's unclear how to opt in or out of sending trace information to
Tempo/OpenTelemetry backends. Additionally, there are a few bugs
preventing OpenTelemetry from working correctly in production
deployments.

Instead of `telemetry_only`, use `opentelemetry.skip` as a span field to
opt out of sending specific spans to Tempo/OpenTelemetry.

By default, all spans are sent to OpenTelemetry/Tempo backends (when the
`opentelemetry` feature is enabled), regardless of log level. To exclude
a span from OpenTelemetry export while keeping it in local traces
(Chrome trace, logs), add `opentelemetry.skip` to the span fields:

```rust
// This span will NOT be sent to Tempo
fn internal_helper() { }

// This span WILL be sent to Tempo (default behavior)
fn public_api() { }
```

**Key Changes:**

**Naming & Clarity:**
- Renamed `otel` feature to `opentelemetry` for clarity
- Renamed `--otel-exporter-otlp-endpoint` CLI flag to
`--otlp-exporter-endpoint`
- Renamed `--otel-trace-file` CLI flag to `--chrome-trace-file` (more
accurate)
- Standardized all environment variables to use
`LINERA_OTLP_EXPORTER_ENDPOINT`
- Chrome trace export includes ALL spans (useful for local debugging)
- OpenTelemetry/Tempo export respects the `opentelemetry.skip` filter

**Bug Fixes:**
- **Fixed empty endpoint validation**: Previously, passing an empty
string as the OTLP endpoint would cause the exporter to fail with
"Failed to parse endpoint". Now properly validates that the endpoint is
non-empty before attempting to create the exporter.
- **Fixed faucet crash with OpenTelemetry**: The OTLP exporter was being
initialized before the Tokio runtime was created, causing a panic:
"there is no reactor running, must be called from the context of a Tokio
1.x runtime". Moved tracing initialization to after runtime creation.

**Test Improvements:**
- Added proper feature guards (`with_testing`) for test-only functions
- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces

- CI tests verify that spans with `opentelemetry.skip` are filtered from
OpenTelemetry export but still appear in Chrome traces
- All existing tests pass with the renamed feature and CLI flags
- Manual testing confirms OTLP traces now successfully export to Tempo
without crashes

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.

Might be good to backport just so if we want to add more instrumentation
on the testnet, things are not as confusing :)
## Motivation

Currently, only the client supports the `--otlp-exporter-endpoint` flag
and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable for
OpenTelemetry configuration. The proxy and server binaries always
default to checking the environment variable but don't expose the
command-line option, which is inconsistent with the client.

## Proposal

This PR adds the `otlp_exporter_endpoint` option to both proxy and
server for consistency:

- **Proxy** (`linera-service/src/proxy/main.rs`):
- Added `otlp_exporter_endpoint: Option<String>` field to `ProxyOptions`
- Updated `init_with_opentelemetry` call to pass the endpoint instead of
`None`

- **Server** (`linera-service/src/server.rs`):
- Added `otlp_exporter_endpoint: Option<String>` field to
`ServerCommand::Run` variant
- Created helper function `otlp_exporter_endpoint_for()` to extract
endpoint from command
  - Updated `init_with_opentelemetry` call to use the helper function
- Boxed `CommonStorageOptions` in `ServerCommand::Run` to fix clippy
`large_enum_variant` warning (the `Run` variant became too large after
adding the new field)

Both binaries now consistently accept `--otlp-exporter-endpoint` flag
and `LINERA_OTLP_EXPORTER_ENDPOINT` environment variable, matching the
client's behavior.

## Test Plan

- Verified clippy passes with all features
- Verified clippy passes with no default features
- Confirmed both binaries compile successfully

## Release Plan

- Nothing to do / These changes follow the usual release cycle.
## Motivation

The init and entrypoint shell scripts (server-entrypoint.sh,
proxy-entrypoint.sh, etc.) were thin wrappers around binary calls that
passed arguments to the Linera binaries. This meant that any time we
needed to change initialization or entrypoint arguments (e.g., adding
new flags, changing default values), we had to rebuild the Docker
images, which is inefficient and couples configuration to the image
build process.

## Proposal

This PR removes all init and entrypoint shell scripts and embeds their
logic directly into the configuration files:

- **Deleted 7 shell scripts** from the codebase:
  - `docker/server-entrypoint.sh`, `docker/server-init.sh`
  - `docker/proxy-entrypoint.sh`, `docker/proxy-init.sh`
- `docker/compose-server-entrypoint.sh`,
`docker/compose-proxy-entrypoint.sh`, `docker/compose-server-init.sh`

- **Updated Kubernetes templates** (shards.yaml and proxy.yaml):
- Replaced script calls with inline shell commands in both
initContainers and main containers
  - All binary arguments are now specified directly in the YAML

- **Updated Docker Compose** (docker-compose.yml):
- Replaced script calls with direct binary invocations using command
arrays
  - Init logic is now inline shell script in the shard-init service

- **Updated Dockerfile**:
  - Removed COPY commands for all deleted scripts

**Benefits:**
- Binary configuration changes (flags, environment variables) no longer
require Docker image rebuilds
- Arguments are visible directly in the deployment configs
- Simpler deployment pipeline - fewer moving parts
- Easier to understand what arguments are being passed to binaries

## Test Plan

- Verified all shell scripts are deleted from the repository
- Verified Dockerfile no longer references the deleted scripts
- Checked that Kubernetes YAML templates have valid syntax
- Checked that Docker Compose YAML has valid syntax
- All binary arguments are correctly passed via command-line flags in
the configs

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then be released in a validator hotfix.
## Motivation

We currently compile out trace logs, even though the overhead is minimal
(it's 2 CPU instructions
https://www.reddit.com/r/rust/comments/x9nypb/comment/inutkiv/)

## Proposal

Stop compiling it out, so we can have `trace` logs if we want for
debugging with just a restart, without having to recompile the binary.

## Test Plan

I'm benchmarking networks with this, saw no visible performance
regression

## Release Plan

- Nothing to do / These changes follow the usual release cycle.
## Motivation

In a previous change, we lowered the log level for heap profile
operations from INFO to TRACE to reduce log noise from frequent
monitoring polls. However, we missed one log statement in the heap
profile HTTP endpoint that still uses INFO level.

This inconsistency means that while most heap profile operations log at
TRACE level, the HTTP endpoint still creates excessive log noise with
`"Serving heap profile via /debug/pprof"` messages appearing in default
logs every time the endpoint is polled.

## Proposal

Change the remaining log statement in
`linera-metrics/src/memory_profiler.rs` from `info!` to `trace!` for the
heap profile endpoint message, completing the consistency improvement
started in the earlier change.

## Test Plan

1. Deploy network with memory profiling enabled
2. Verify heap profile endpoint still works (returns pprof data)
3. Observe that default logs no longer contain repetitive "Serving heap
profile" messages
4. Enable TRACE logging and verify messages appear as expected
5. Confirm consistency with other heap profile logging

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
## Motivation

When block exporters fail to connect to the indexer, they retry with
exponential backoff but provide no visibility into the retry process.
Operators cannot determine whether connection issues are transient
(retrying successfully) or persistent (about to fail). This lack of
observability makes it difficult to diagnose network connectivity issues
and indexer availability problems during deployments and operations.

## Proposal

Add comprehensive logging to the indexer connection retry logic:

- **WARNING level** on each retry attempt, including:
  - Current retry count and max retries remaining
  - Calculated backoff delay in milliseconds
  - Error that triggered the retry
- **ERROR level** when all retries are exhausted, clearly indicating:
  - Final retry count
  - Max retries configured
  - Error that caused final failure
  - Explicit statement that exporter task will exit

This provides operators with actionable telemetry to distinguish
transient network issues from persistent failures.

## Test Plan

1. Deploy network with intentionally unreachable indexer endpoint
2. Observe exporter logs show WARNING messages with increasing retry
counts and backoff delays
3. Wait for max retries to be exhausted
4. Verify ERROR message appears with clear failure indication
5. Confirm logs provide enough information to diagnose the connection
issue

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
#4848)

## Motivation

The notification forwarding system has two bugs:

1. **Duplicate messages to exporters**: When validators have multiple
proxies, each proxy forwards notifications to all exporters. This means
exporters receive duplicate notifications (N copies for N proxies),
causing redundant processing and potential data inconsistencies.

2. **Receiver lag crashes forwarding**: The broadcast receiver for
notifications can lag if the consumer is slower than the producer. When
lag occurs, the code treats `RecvError::Lagged` as a fatal error and
exits the forwarding loop, stopping all notification forwarding
permanently even though the channel is still functional.

Both issues reduce system reliability and create operational problems
that are difficult to diagnose.

## Proposal

**Fix duplicate forwarding:**
- Add `exporter_forwarded` flag to track whether exporters have been
included
- Only the first proxy forwards to exporters, subsequent proxies forward
to an empty exporter list
- Each proxy still forwards to its own validator endpoint

**Fix lag handling:**
- Replace `while let Ok()` pattern with explicit `match` on receiver
result
- Handle `RecvError::Lagged(skipped_count)` by logging a warning and
continuing
- Handle `RecvError::Closed` by logging and breaking (legitimate
shutdown)
- Only actually lagged/skipped messages trigger the warning, not every
message

This ensures exporters receive exactly one copy of each notification and
forwarding continues even when temporary lag occurs.

## Test Plan

1. Deploy network with multiple proxies (2+) and exporters
2. Verify exporters receive exactly one copy of each notification (not N
copies)
3. Generate high notification load to induce receiver lag
4. Verify warning logs appear for lagged messages
5. Confirm notification forwarding continues after lag events
6. Verify no duplicate processing in exporter/indexer

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
## Motivation

The explorer frontend hardcodes API URLs to `http://localhost:3002`,
which only works in local development when both frontend and backend run
on localhost. This breaks in production deployments where:

- Frontend is served through Kubernetes ingress
- Backend API is accessed via the same ingress with `/api` path prefix
- Browsers block mixed content (HTTPS frontend → HTTP localhost backend)
- CORS issues arise from cross-origin requests

The hardcoded URLs prevent the explorer from working in deployed
environments.

## Proposal

Replace hardcoded absolute URLs with relative paths that work in all
environments. Add Vite dev server proxy configuration to maintain local
development workflow while using relative paths.

This approach works in all environments:
- **Local dev**: Vite proxy forwards `/api` → `http://localhost:3002`
- **Production**: Kubernetes ingress routes `/api` → backend service

## Test Plan

1. Test production deployment:
   - Deploy explorer with Kubernetes ingress configured
   - Verify frontend loads and connects to backend
   - Confirm no CORS or mixed content errors
   - Verify all API calls work through ingress

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
Deploying the exporter/indexer/explorer stack requires careful
dependency ordering and runtime checks:

1. **Exporters need database to exist** before starting (ScyllaDB must
be initialized)
2. **Exporters need indexer to be ready** before attempting connections
3. **Indexer image needs linera CLI** to perform database checks
4. **Configuration needs flexibility** to support both internal cluster
DNS and external endpoints with TLS

Without these capabilities, deployments fail with race conditions where
exporters start before dependencies are ready, causing crash loops and
manual intervention.

Add comprehensive deployment support through init containers that wait
for dependencies, include linera CLI in the indexer image for database
checks, and make indexer endpoints configurable to support both internal
and external deployments with automatic TLS handling.

1. Deploy fresh network with exporter/indexer/explorer stack:
   - Verify ScyllaDB init container waits for database
   - Verify indexer readiness init container waits for indexer
   - Confirm exporters start only after both checks pass

2. Test with external indexer endpoint:
   - Configure `blockExporter.indexerEndpoint` with external URL
   - Verify TLS configuration is applied
   - Confirm connectivity works

3. Test failure scenarios:
- Deploy without indexer, verify exporter init container retries
indefinitely
   - Kill indexer, verify exporters restart and wait for it to return

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
## Motivation

Validator deployments can fail with HTTP 429 errors when helmfile tries
to download the Prometheus ServiceMonitor CRD from
raw.githubusercontent.com during the prepare hook. When this happens, it
blocks the entire deployment.

## Proposal

Bundle the ServiceMonitor CRD (v0.86.1) locally in the repository at
`kubernetes/linera-validator/crds/servicemonitor.yaml` instead of
downloading it from GitHub at deploy time.

ServiceMonitor CRDs change infrequently and are designed to be backward
compatible, so using a static version is appropriate. The CRD file
includes documentation with version info and instructions for updating
when needed (typically when upgrading prometheus-operator or quarterly
checks).

Changes:
- Added ServiceMonitor CRD file with documentation header explaining
version, source, and update instructions
- Updated helmfile.yaml to apply the CRD from the local file instead of
the GitHub URL
- Maintains existing check to skip installation if CRD already exists on
the cluster

## Test Plan

- Manually tested: Deploy to fresh cluster - CRD installs successfully
- Verified: helmfile hook completes without GitHub rate limit errors
- Confirmed: ServiceMonitor resources are created correctly after
deployment

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
## Motivation

`sh` doesn't support `set -euo pipefail`, I missed this in a previous PR

## Proposal

Use `bash` when available, otherwise use `set -eu` instead.

## Test Plan

Deployed a network, exporter starts as expected now.

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
…eated (#4871)

## Motivation

We're setting Scylla settings using this ConfigMap, but there's actually
no guarantee that it will exist by the time Scylla starts. I've seen
networks deployed that just ignore the ConfigMap and use default
settings, for example, because the ConfigMap wasn't created yet when
Scylla started.

## Proposal

Enforce that the ConfigMap will be created before Scylla starts.

## Test Plan

Deployed a network, it uses the ConfigMap options more reliably now

## Release Plan

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
@ndr-ds ndr-ds force-pushed the backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871 branch from 136f1ab to 5cfcd59 Compare October 30, 2025 13:10
#4852)

## Motivation

We saw UNIQUE constraints violations for the `operations` table when
testing w/ the exporter.

## Proposal

Remove all constraints – those should be maintained by the node anyway
and add additional work for the SQlite (that has to check them).

## Test Plan

manual.

## Release Plan

- Nothing to do / These changes follow the usual release cycle.## Links

- [reviewer
checklist](https://github.com/linera-io/linera-protocol/blob/main/CONTRIBUTING.md#reviewer-checklist)
Copy link
Contributor

@MathieuDutSik MathieuDutSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ndr-ds ndr-ds merged commit a58ad6f into testnet_conway Oct 30, 2025
31 of 32 checks passed
@ndr-ds ndr-ds deleted the backport_of_4771_4829_4830_4554_4845_4846_4848_4849_4850_4855_4870_4871 branch October 30, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants