refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) #4173

keivenchang · 2025-11-07T01:54:55Z

Overview:

This PR consolidates Dynamo's observability infrastructure to provide a more consistent and easier-to-follow experience. All observability documentation now follows a uniform structure (Overview → Environment Variables → Getting Started → Details), making it simple to find what you need whether you're setting up metrics, tracing, logging, or health checks.

The refactored structure separates concerns clearly: docker-compose.yml handles only core infrastructure (NATS & etcd), while docker-observability.yml provides the complete observability stack. All observability configurations are now centralized under deploy/observability/, eliminating the previous scattered structure across deploy/metrics/, deploy/logging/, and deploy/tracing/.

Documentation improvements include:

Consistent structure: Every observability doc follows the same pattern with Environment Variables tables and Getting Started sections
Single entry point: docs/observability/README.md serves as the unified gateway to all observability topics
Practical examples: Each guide now includes single-GPU Getting Started examples for quick testing
Clear separation: Prometheus/Grafana guide focuses on demo setup, while detailed metrics reference lives in metrics.md
Developer resources: New Metrics Developer Guide for creating custom metrics in Rust/Python

Details:

Refactor services into separate docker-compose.yml (NATS & etcd only) and docker-observability.yml (Prometheus, Grafana, Tempo, exporters)
Consolidated observability configs under deploy/observability/ (previously scattered across metrics/, logging/, tracing/)
Reorganized Kubernetes-specific observability configs under deploy/observability/k8s/
Standardized all observability docs with consistent sections: Overview, Environment Variables, Getting Started, and detailed reference
Created new docs/observability/README.md as unified entry point with navigation table
Refactored Prometheus/Grafana guide to focus on single-machine demos (removed detailed metrics explanations, now in metrics.md)
Added Metrics Developer Guide for creating custom metrics in Rust/Python
Enhanced all docs with Environment Variables tables for easy reference
Added practical Getting Started sections with single-GPU examples for quick testing
Enhanced tracing docs with x-request-id correlation guidance for easier debugging
Updated env_is_truthy utility usage for OTLP configuration consistency

Where should the reviewer start?

Review docs/observability/README.md, which serves as the entry point to all observability documentation. Notice how metrics.md, tracing.md, health-checks.md, and logging.md all follow the same consistent structure: Overview → Environment Variables → Getting Started → Details. This uniform pattern makes it easy to quickly find configuration options (always in a table) and get started with practical examples (always in a Getting Started section).

Then check deploy/observability/ to see how all observability configs are now centralized in one location instead of being scattered across multiple directories.

BEFORE:
=======

deploy/
├── docker-compose.yml (NATS + etcd + Prometheus + Grafana + exporters + monitoring network)
├── metrics/
│   ├── grafana-datasources.yml
│   ├── prometheus.yml
│   ├── grafana_dashboards/
│   │   ├── grafana-dashboard-providers.yml
│   │   ├── grafana-dcgm-metrics.json
│   │   ├── grafana-dynamo-dashboard.json
│   │   └── grafana-kvbm-dashboard.json
│   └── k8s/
│       ├── README.md
│       ├── frontend-podmonitor.yaml
│       ├── planner-podmonitor.yaml
│       ├── worker-podmonitor.yaml
│       └── grafana-dynamo-dashboard-configmap.yaml
├── logging/
│   ├── README.md
│   ├── grafana/
│   │   ├── dashboard.json
│   │   ├── logging-dashboard.yaml
│   │   └── loki-datasource.yaml
│   └── values/
│       ├── alloy-values.yaml
│       └── loki-values.yaml
└── tracing/
    ├── docker-compose.yml (Tempo + Grafana) [DELETED]
    ├── README.md
    ├── trace.png
    ├── tempo.yaml
    └── grafana/provisioning/datasources/
        └── tempo.yaml

docs/observability/
├── (no README.md)
├── metrics.md (basic)
├── health-checks.md (basic)
├── logging.md (basic)
├── prometheus-grafana.md (detailed)
└── (no tracing docs here)


AFTER (with +/- line counts):
==============================

deploy/
├── docker-compose.yml -120
├── docker-observability.yml +137
└── observability/
    ├── tempo-datasource.yml -1/+1
    ├── grafana_dashboards/
    │   └── grafana-dynamo-dashboard.json -1/+1
    └── k8s/
        ├── grafana-dynamo-dashboard-configmap.yaml -1/+1
        └── logging/
            └── README.md -1/+1

docs/observability/
├── README.md +33
├── metrics.md +112
├── metrics-developer-guide.md +450
├── health-checks.md +24
├── logging.md +1
├── tracing.md +60
└── prometheus-grafana.md -325

docs/kubernetes/observability/
├── logging.md -6/+6
└── metrics.md -3/+1

lib/runtime/src/logging.rs -2/+2

README.md +15

lib/runtime/examples/metrics_python/README.md -150/+4

DELETED:
deploy/tracing/docker-compose.yml -35

Related Issues:

Relates to DIS-980

/coderabbit profile chill

…bility.yml - Moved metrics (Prometheus, Grafana, DCGM, NATS exporter) and tracing (Tempo) into single docker-observability.yml - Simplified docker-compose.yml to only include core infrastructure (NATS, etcd) - Reorganized observability files: deploy/metrics/* and deploy/tracing/* -> deploy/observability/* - Updated documentation: deploy/tracing/README.md -> docs/observability/tracing.md - Unified Grafana configuration to support both Prometheus and Tempo datasources - Single observability stack now runs on unified 'server' network for better integration Signed-off-by: Keiven Chang <[email protected]>

- Move deploy/logging to deploy/observability/k8s/logging for better organization - Move trace.png to docs/observability/ to be alongside tracing.md - Fix vllm lazy import of kvbm to avoid Tokio runtime initialization issues - Add log level documentation explaining DEBUG vs INFO for trace visibility - Update all references to reflect new paths - Clarify OTEL environment variable defaults and behavior Signed-off-by: Keiven Chang <[email protected]>

- Create docs/observability/README.md as central hub - Split metrics-developer-guide.md from prometheus-grafana.md - Standardize all docs: Overview, Environment Variables, Getting Started - Update env variable parsing to accept truthy values (true/1/on/yes) - Consolidate prometheus-grafana.md as quick start guide - Improve metrics.md as reference document - Clarify tracing requirements and overlap with logging - Fix double space and grammatical issues Signed-off-by: Keiven Chang <[email protected]>

coderabbitai · 2025-11-07T02:01:19Z

Walkthrough

The changes reorganize observability infrastructure by separating observability services into a standalone compose file, consolidating and restructuring Kubernetes observability configurations, updating environment variable handling for OTEL exports to support truthy values, expanding observability documentation with guides for metrics, tracing, logging, and health checks, and streamlining Docker Compose setup instructions.

Changes

Cohort / File(s)	Summary
Docker Compose Observability Reorganization `README.md`, `deploy/docker-compose.yml`, `deploy/docker-observability.yml`, `deploy/tracing/docker-compose.yml`	Extracted observability services (Prometheus, Grafana, Tempo, DCGM exporter, NATS exporter) from main docker-compose.yml into dedicated deploy/docker-observability.yml; removed tracing services from deploy/tracing/docker-compose.yml; updated README with observability stack setup instructions using bash-specific code block.
Grafana Dashboard Configuration Updates `deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json`, `deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml`, `docs/kubernetes/observability/metrics.md`	Renamed Grafana dashboard title to "Dynamo Dashboard (generic)"; simplified Kubernetes metrics documentation to use single kubectl apply command with updated path.
Kubernetes Observability Path Restructuring `deploy/observability/k8s/logging/README.md`, `docs/kubernetes/observability/logging.md`, `docs/kubernetes/observability/metrics.md`	Updated relative path references from deploy/logging/* to deploy/observability/k8s/logging/*; adjusted Helm and configuration artifact paths for Loki, Alloy, and Grafana provisioning.
Observability Documentation Suite `docs/observability/README.md`, `docs/observability/health-checks.md`, `docs/observability/logging.md`, `docs/observability/metrics.md`, `docs/observability/metrics-developer-guide.md`, `docs/observability/prometheus-grafana.md`, `docs/observability/tracing.md`	Added comprehensive observability guides with environment variables, getting started sections, metric categories, runtime hierarchy, and practical examples; restructured tracing and logging documentation; condensed Prometheus/Grafana guide to single-machine demo setup; added new metrics developer guide for custom metrics creation across Rust and Python.
Tempo Datasource & Python Examples `deploy/observability/tempo-datasource.yml`, `lib/bindings/python/examples/metrics/README.md`	Changed Tempo datasource isDefault from true to false; pared down Python metrics example documentation to concise "Running Examples" section with link to developer guide.
Environment Variable Handling Refactor `lib/bindings/python/rust/lib.rs`, `lib/runtime/src/logging.rs`	Updated OTEL_EXPORT_ENABLED checks from string equality "1" to truthy value support using env_is_truthy() utility; updated corresponding warning messages and documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Docker observability file: New comprehensive service stack with multiple dependencies—verify image versions, port mappings, volume configurations, and network assignments are correct
Documentation scope: Extensive rewrites across 7+ observability docs with environmental variable tables and new sections—confirm consistency of variable names, defaults, and examples across all guides
Path migrations: Multiple files updated with new kubernetes observability paths—verify all paths are correctly updated and no dangling references remain
Environment variable refactoring: Changes from numeric "1" to truthy values across Rust code—ensure all call sites properly use the new env_is_truthy() utility and backward compatibility is maintained

Poem

🐰 Hop hop, the observability stack does gleam,
Separated services fulfilling every dream,
Metrics, traces, and logs now neatly arranged,
Truthiness and defaults—everything changed!
From chaos to clarity, what a fine sight,
Documentation and dashboards now burning so bright! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, following the template with clear Overview, Details, Where to start, and Related Issues sections.
Title check	✅ Passed	The title accurately reflects the main objective: consolidating observability files (docker-compose, markdown, configs) from scattered directories into organized deploy/observability and docs/observability structures.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

docs/kubernetes/observability/logging.md (1)

49-49: Path updates are correct; fix minor formatting at line 144.

All referenced configuration files exist in the new centralized paths under deploy/observability/k8s/logging/. The path structure is consistent across Loki values, Alloy values, and Grafana configuration files (lines 49, 63, 113, 116).

Minor correction needed at line 144: change "component type (e.g frontend, worker, etc)" to "component type (e.g., frontend, worker, etc.)" to follow American English style guidelines.

deploy/docker-observability.yml (1)

90-91: Minor: Tempo config file uses YAML extension, verify naming consistency.

The GPU reservation is correctly configured with count: all to reserve all GPUs on the host. Line 92 references ./observability/tempo.yaml (YAML extension). Verify this filename is consistent with your configuration management. Most tools use .yaml or .yml interchangeably, but consistency across the repo is preferred.
docs/observability/metrics.md (2)
62-74: Add language identifier to Prometheus exposition format code block.

Fenced code blocks with a language identifier provide the best readability and syntax highlighting. The code block showing Prometheus exposition format (lines 62-74) lacks a language identifier.
-```
+```text
 # HELP dynamo_component_requests_total Total requests processed
 # TYPE dynamo_component_requests_total counter
Alternatively, if Prometheus format highlighting is available, use prometheus as the language identifier.

172-194: Add language identifier to code blocks showing timeline and concurrency examples.

Lines 172 and 182 contain fenced code blocks without language identifiers. The timeline ASCII diagram (lines 182-194) should use text to preserve formatting without attempting syntax highlighting.
-```
+```text
 Timeline:    0, 1, ...
 Client ────> Frontend:8000 ...
docs/observability/tracing.md (2)
27-76: Clarify docker compose file path for consistency.

The guide uses docker compose -f docker-observability.yml (lines 33, 46, 162). While this works from the deploy/ directory, ensure the current working directory is clear in all instructions. The commands at lines 32-33 show cd deploy before running the compose command, which is good. However, line 162 in the "Stop Services" section does not show the cd deploy step.

Add consistent context at section 6 (lines 156-163) to clarify working directory:
 ### 6. Stop Services
 
 When done, stop the observability stack:
 
 ```bash
+cd deploy
 docker compose -f docker-observability.yml down
---

`90-116`: **Disaggregated deployment script reference needs clarification.**

Lines 78-116 provide a manual script for disaggregated deployment but reference modifying `disagg.sh`. The comment at line 90 states "You may need to modify `disagg.sh`" but the script provided (lines 92-116) is shown as a complete replacement example.

Clarify whether:
1. `disagg.sh` should be modified before running, or
2. The provided script should be used as a reference/replacement


Add a note clarifying the intent:

```diff
 **Note:** You may need to modify `disagg.sh` to export the tracing environment
 variables before starting each component:

+**Option A:** Modify your existing `disagg.sh` to add the following lines at the top, or
+**Option B:** Use the template below as a complete `disagg.sh` replacement:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 0b71cf8.

⛔ Files ignored due to path filters (1)

docs/observability/trace.png is excluded by !**/*.png

📒 Files selected for processing (20)

README.md (1 hunks)
deploy/docker-compose.yml (1 hunks)
deploy/docker-observability.yml (1 hunks)
deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1 hunks)
deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1 hunks)
deploy/observability/k8s/logging/README.md (1 hunks)
deploy/observability/tempo-datasource.yml (1 hunks)
deploy/tracing/docker-compose.yml (0 hunks)
docs/kubernetes/observability/logging.md (4 hunks)
docs/kubernetes/observability/metrics.md (1 hunks)
docs/observability/README.md (1 hunks)
docs/observability/health-checks.md (1 hunks)
docs/observability/logging.md (5 hunks)
docs/observability/metrics-developer-guide.md (1 hunks)
docs/observability/metrics.md (2 hunks)
docs/observability/prometheus-grafana.md (1 hunks)
docs/observability/tracing.md (4 hunks)
lib/bindings/python/examples/metrics/README.md (1 hunks)
lib/bindings/python/rust/lib.rs (2 hunks)
lib/runtime/src/logging.rs (1 hunks)

💤 Files with no reviewable changes (1)

deploy/tracing/docker-compose.yml

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

docs/observability/README.md
docs/observability/prometheus-grafana.md
docs/observability/metrics-developer-guide.md
deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
docs/observability/metrics.md
lib/bindings/python/examples/metrics/README.md

📚 Learning: 2025-07-14T21:25:56.930Z

Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

lib/bindings/python/rust/lib.rs

📚 Learning: 2025-07-18T16:04:31.771Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

docs/kubernetes/observability/logging.md

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.

Applied to files:

docs/observability/prometheus-grafana.md
docs/observability/metrics.md

📚 Learning: 2025-09-16T00:21:44.912Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: deploy/metrics/README.md:43-43
Timestamp: 2025-09-16T00:21:44.912Z
Learning: Graham (grahamking) has provided guidance in PR 2914 that metrics should end with _count or _total to indicate units, but this needs to be clarified whether it applies to all metric types or just counters, as Prometheus conventions differ between counters (should have _total) and gauges (should not have _total).

Applied to files:

docs/observability/metrics-developer-guide.md

📚 Learning: 2025-09-24T19:06:57.156Z

Learnt from: ryan-lempka
Repo: ai-dynamo/dynamo PR: 3062
File: lib/llm/src/audit/sink.rs:15-27
Timestamp: 2025-09-24T19:06:57.156Z
Learning: In the audit logging system, full request/response logging requires both DYN_AUDIT_ENABLED=1 environment variable and explicit store=true in the request. Without store=true, only usage statistics are logged (UsageOnly mode). The stderr sink is the initial implementation with plans for additional sinks in the future.

Applied to files:

docs/observability/logging.md

📚 Learning: 2025-06-04T13:09:53.416Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-06-05T01:46:15.509Z

Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-07-25T22:34:11.384Z

Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

docs/observability/health-checks.md

📚 Learning: 2025-09-16T00:27:43.992Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

docs/observability/metrics.md

🧬 Code graph analysis (2)

lib/runtime/src/logging.rs (1)

lib/runtime/src/config.rs (1)

env_is_truthy (422-427)

lib/bindings/python/rust/lib.rs (1)

lib/runtime/src/config.rs (1)

env_is_truthy (422-427)

🪛 GitHub Check: Check for broken markdown links

lib/bindings/python/examples/metrics/README.md

[failure] 10-10:
Broken link: Metrics Developer Guide - Python Section - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/lib/bindings/python/examples/metrics/README.md?plain=1#L10

🪛 LanguageTool

docs/kubernetes/observability/logging.md

[uncategorized] ~144-~144: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...loyment, namespace, and component type (e.g frontend, worker, etc).

(E_G)

[style] ~144-~144: In American English, abbreviations like “etc.” require a period.
Context: ...d component type (e.g frontend, worker, etc).

(ETC_PERIOD)

docs/observability/prometheus-grafana.md

[duplication] ~28-~28: Possible typo: you repeated a word.
Context: ...tes Install these on your machine: - Docker - [Docker Compose](https://docs.docker.com/compos...

(ENGLISH_WORD_REPEAT_RULE)

docs/observability/metrics-developer-guide.md

[uncategorized] ~36-~36: Loose punctuation mark.
Context: ...Methods - .metrics().create_counter(): Create a counter metric - `.metrics().c...