Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Nov 7, 2025

Overview:

This PR consolidates Dynamo's observability infrastructure to provide a more consistent and easier-to-follow experience. All observability documentation now follows a uniform structure (Overview → Environment Variables → Getting Started → Details), making it simple to find what you need whether you're setting up metrics, tracing, logging, or health checks.

The refactored structure separates concerns clearly: docker-compose.yml handles only core infrastructure (NATS & etcd), while docker-observability.yml provides the complete observability stack. All observability configurations are now centralized under deploy/observability/, eliminating the previous scattered structure across deploy/metrics/, deploy/logging/, and deploy/tracing/.

Documentation improvements include:

  • Consistent structure: Every observability doc follows the same pattern with Environment Variables tables and Getting Started sections
  • Single entry point: docs/observability/README.md serves as the unified gateway to all observability topics
  • Practical examples: Each guide now includes single-GPU Getting Started examples for quick testing
  • Clear separation: Prometheus/Grafana guide focuses on demo setup, while detailed metrics reference lives in metrics.md
  • Developer resources: New Metrics Developer Guide for creating custom metrics in Rust/Python

Details:

  • Refactor services into separate docker-compose.yml (NATS & etcd only) and docker-observability.yml (Prometheus, Grafana, Tempo, exporters)
  • Consolidated observability configs under deploy/observability/ (previously scattered across metrics/, logging/, tracing/)
  • Reorganized Kubernetes-specific observability configs under deploy/observability/k8s/
  • Standardized all observability docs with consistent sections: Overview, Environment Variables, Getting Started, and detailed reference
  • Created new docs/observability/README.md as unified entry point with navigation table
  • Refactored Prometheus/Grafana guide to focus on single-machine demos (removed detailed metrics explanations, now in metrics.md)
  • Added Metrics Developer Guide for creating custom metrics in Rust/Python
  • Enhanced all docs with Environment Variables tables for easy reference
  • Added practical Getting Started sections with single-GPU examples for quick testing
  • Enhanced tracing docs with x-request-id correlation guidance for easier debugging
  • Updated env_is_truthy utility usage for OTLP configuration consistency

Where should the reviewer start?

Review docs/observability/README.md, which serves as the entry point to all observability documentation. Notice how metrics.md, tracing.md, health-checks.md, and logging.md all follow the same consistent structure: Overview → Environment Variables → Getting Started → Details. This uniform pattern makes it easy to quickly find configuration options (always in a table) and get started with practical examples (always in a Getting Started section).

Then check deploy/observability/ to see how all observability configs are now centralized in one location instead of being scattered across multiple directories.

BEFORE:
=======

deploy/
├── docker-compose.yml (NATS + etcd + Prometheus + Grafana + exporters + monitoring network)
├── metrics/
│   ├── grafana-datasources.yml
│   ├── prometheus.yml
│   ├── grafana_dashboards/
│   │   ├── grafana-dashboard-providers.yml
│   │   ├── grafana-dcgm-metrics.json
│   │   ├── grafana-dynamo-dashboard.json
│   │   └── grafana-kvbm-dashboard.json
│   └── k8s/
│       ├── README.md
│       ├── frontend-podmonitor.yaml
│       ├── planner-podmonitor.yaml
│       ├── worker-podmonitor.yaml
│       └── grafana-dynamo-dashboard-configmap.yaml
├── logging/
│   ├── README.md
│   ├── grafana/
│   │   ├── dashboard.json
│   │   ├── logging-dashboard.yaml
│   │   └── loki-datasource.yaml
│   └── values/
│       ├── alloy-values.yaml
│       └── loki-values.yaml
└── tracing/
    ├── docker-compose.yml (Tempo + Grafana) [DELETED]
    ├── README.md
    ├── trace.png
    ├── tempo.yaml
    └── grafana/provisioning/datasources/
        └── tempo.yaml

docs/observability/
├── (no README.md)
├── metrics.md (basic)
├── health-checks.md (basic)
├── logging.md (basic)
├── prometheus-grafana.md (detailed)
└── (no tracing docs here)


AFTER (with +/- line counts):
==============================

deploy/
├── docker-compose.yml -120
├── docker-observability.yml +137
└── observability/
    ├── tempo-datasource.yml -1/+1
    ├── grafana_dashboards/
    │   └── grafana-dynamo-dashboard.json -1/+1
    └── k8s/
        ├── grafana-dynamo-dashboard-configmap.yaml -1/+1
        └── logging/
            └── README.md -1/+1

docs/observability/
├── README.md +33
├── metrics.md +112
├── metrics-developer-guide.md +450
├── health-checks.md +24
├── logging.md +1
├── tracing.md +60
└── prometheus-grafana.md -325

docs/kubernetes/observability/
├── logging.md -6/+6
└── metrics.md -3/+1

lib/runtime/src/logging.rs -2/+2

README.md +15

lib/runtime/examples/metrics_python/README.md -150/+4

DELETED:
deploy/tracing/docker-compose.yml -35

Related Issues:

Relates to DIS-980

/coderabbit profile chill

…bility.yml

- Moved metrics (Prometheus, Grafana, DCGM, NATS exporter) and tracing (Tempo) into single docker-observability.yml
- Simplified docker-compose.yml to only include core infrastructure (NATS, etcd)
- Reorganized observability files: deploy/metrics/* and deploy/tracing/* -> deploy/observability/*
- Updated documentation: deploy/tracing/README.md -> docs/observability/tracing.md
- Unified Grafana configuration to support both Prometheus and Tempo datasources
- Single observability stack now runs on unified 'server' network for better integration

Signed-off-by: Keiven Chang <[email protected]>
- Move deploy/logging to deploy/observability/k8s/logging for better organization
- Move trace.png to docs/observability/ to be alongside tracing.md
- Fix vllm lazy import of kvbm to avoid Tokio runtime initialization issues
- Add log level documentation explaining DEBUG vs INFO for trace visibility
- Update all references to reflect new paths
- Clarify OTEL environment variable defaults and behavior

Signed-off-by: Keiven Chang <[email protected]>
- Create docs/observability/README.md as central hub
- Split metrics-developer-guide.md from prometheus-grafana.md
- Standardize all docs: Overview, Environment Variables, Getting Started
- Update env variable parsing to accept truthy values (true/1/on/yes)
- Consolidate prometheus-grafana.md as quick start guide
- Improve metrics.md as reference document
- Clarify tracing requirements and overlap with logging
- Fix double space and grammatical issues

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang self-assigned this Nov 7, 2025
@keivenchang keivenchang requested review from a team as code owners November 7, 2025 01:54
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

The changes reorganize observability infrastructure by separating observability services into a standalone compose file, consolidating and restructuring Kubernetes observability configurations, updating environment variable handling for OTEL exports to support truthy values, expanding observability documentation with guides for metrics, tracing, logging, and health checks, and streamlining Docker Compose setup instructions.

Changes

Cohort / File(s) Summary
Docker Compose Observability Reorganization
README.md, deploy/docker-compose.yml, deploy/docker-observability.yml, deploy/tracing/docker-compose.yml
Extracted observability services (Prometheus, Grafana, Tempo, DCGM exporter, NATS exporter) from main docker-compose.yml into dedicated deploy/docker-observability.yml; removed tracing services from deploy/tracing/docker-compose.yml; updated README with observability stack setup instructions using bash-specific code block.
Grafana Dashboard Configuration Updates
deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json, deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml, docs/kubernetes/observability/metrics.md
Renamed Grafana dashboard title to "Dynamo Dashboard (generic)"; simplified Kubernetes metrics documentation to use single kubectl apply command with updated path.
Kubernetes Observability Path Restructuring
deploy/observability/k8s/logging/README.md, docs/kubernetes/observability/logging.md, docs/kubernetes/observability/metrics.md
Updated relative path references from deploy/logging/* to deploy/observability/k8s/logging/*; adjusted Helm and configuration artifact paths for Loki, Alloy, and Grafana provisioning.
Observability Documentation Suite
docs/observability/README.md, docs/observability/health-checks.md, docs/observability/logging.md, docs/observability/metrics.md, docs/observability/metrics-developer-guide.md, docs/observability/prometheus-grafana.md, docs/observability/tracing.md
Added comprehensive observability guides with environment variables, getting started sections, metric categories, runtime hierarchy, and practical examples; restructured tracing and logging documentation; condensed Prometheus/Grafana guide to single-machine demo setup; added new metrics developer guide for custom metrics creation across Rust and Python.
Tempo Datasource & Python Examples
deploy/observability/tempo-datasource.yml, lib/bindings/python/examples/metrics/README.md
Changed Tempo datasource isDefault from true to false; pared down Python metrics example documentation to concise "Running Examples" section with link to developer guide.
Environment Variable Handling Refactor
lib/bindings/python/rust/lib.rs, lib/runtime/src/logging.rs
Updated OTEL_EXPORT_ENABLED checks from string equality "1" to truthy value support using env_is_truthy() utility; updated corresponding warning messages and documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Docker observability file: New comprehensive service stack with multiple dependencies—verify image versions, port mappings, volume configurations, and network assignments are correct
  • Documentation scope: Extensive rewrites across 7+ observability docs with environmental variable tables and new sections—confirm consistency of variable names, defaults, and examples across all guides
  • Path migrations: Multiple files updated with new kubernetes observability paths—verify all paths are correctly updated and no dangling references remain
  • Environment variable refactoring: Changes from numeric "1" to truthy values across Rust code—ensure all call sites properly use the new env_is_truthy() utility and backward compatibility is maintained

Poem

🐰 Hop hop, the observability stack does gleam,
Separated services fulfilling every dream,
Metrics, traces, and logs now neatly arranged,
Truthiness and defaults—everything changed!
From chaos to clarity, what a fine sight,
Documentation and dashboards now burning so bright!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is comprehensive and well-structured, following the template with clear Overview, Details, Where to start, and Related Issues sections.
Title check ✅ Passed The title accurately reflects the main objective: consolidating observability files (docker-compose, markdown, configs) from scattered directories into organized deploy/observability and docs/observability structures.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
docs/kubernetes/observability/logging.md (1)

49-49: Path updates are correct; fix minor formatting at line 144.

All referenced configuration files exist in the new centralized paths under deploy/observability/k8s/logging/. The path structure is consistent across Loki values, Alloy values, and Grafana configuration files (lines 49, 63, 113, 116).

Minor correction needed at line 144: change "component type (e.g frontend, worker, etc)" to "component type (e.g., frontend, worker, etc.)" to follow American English style guidelines.

deploy/docker-observability.yml (1)

90-91: Minor: Tempo config file uses YAML extension, verify naming consistency.

The GPU reservation is correctly configured with count: all to reserve all GPUs on the host. Line 92 references ./observability/tempo.yaml (YAML extension). Verify this filename is consistent with your configuration management. Most tools use .yaml or .yml interchangeably, but consistency across the repo is preferred.

docs/observability/metrics.md (2)

62-74: Add language identifier to Prometheus exposition format code block.

Fenced code blocks with a language identifier provide the best readability and syntax highlighting. The code block showing Prometheus exposition format (lines 62-74) lacks a language identifier.

-```
+```text
 # HELP dynamo_component_requests_total Total requests processed
 # TYPE dynamo_component_requests_total counter

Alternatively, if Prometheus format highlighting is available, use prometheus as the language identifier.


172-194: Add language identifier to code blocks showing timeline and concurrency examples.

Lines 172 and 182 contain fenced code blocks without language identifiers. The timeline ASCII diagram (lines 182-194) should use text to preserve formatting without attempting syntax highlighting.

-```
+```text
 Timeline:    0, 1, ...
 Client ────> Frontend:8000 ...
docs/observability/tracing.md (2)

27-76: Clarify docker compose file path for consistency.

The guide uses docker compose -f docker-observability.yml (lines 33, 46, 162). While this works from the deploy/ directory, ensure the current working directory is clear in all instructions. The commands at lines 32-33 show cd deploy before running the compose command, which is good. However, line 162 in the "Stop Services" section does not show the cd deploy step.

Add consistent context at section 6 (lines 156-163) to clarify working directory:

 ### 6. Stop Services
 
 When done, stop the observability stack:
 
 ```bash
+cd deploy
 docker compose -f docker-observability.yml down

---

`90-116`: **Disaggregated deployment script reference needs clarification.**

Lines 78-116 provide a manual script for disaggregated deployment but reference modifying `disagg.sh`. The comment at line 90 states "You may need to modify `disagg.sh`" but the script provided (lines 92-116) is shown as a complete replacement example.

Clarify whether:
1. `disagg.sh` should be modified before running, or
2. The provided script should be used as a reference/replacement


Add a note clarifying the intent:

```diff
 **Note:** You may need to modify `disagg.sh` to export the tracing environment
 variables before starting each component:

+**Option A:** Modify your existing `disagg.sh` to add the following lines at the top, or
+**Option B:** Use the template below as a complete `disagg.sh` replacement:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 0b71cf8.

⛔ Files ignored due to path filters (1)
  • docs/observability/trace.png is excluded by !**/*.png
📒 Files selected for processing (20)
  • README.md (1 hunks)
  • deploy/docker-compose.yml (1 hunks)
  • deploy/docker-observability.yml (1 hunks)
  • deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1 hunks)
  • deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1 hunks)
  • deploy/observability/k8s/logging/README.md (1 hunks)
  • deploy/observability/tempo-datasource.yml (1 hunks)
  • deploy/tracing/docker-compose.yml (0 hunks)
  • docs/kubernetes/observability/logging.md (4 hunks)
  • docs/kubernetes/observability/metrics.md (1 hunks)
  • docs/observability/README.md (1 hunks)
  • docs/observability/health-checks.md (1 hunks)
  • docs/observability/logging.md (5 hunks)
  • docs/observability/metrics-developer-guide.md (1 hunks)
  • docs/observability/metrics.md (2 hunks)
  • docs/observability/prometheus-grafana.md (1 hunks)
  • docs/observability/tracing.md (4 hunks)
  • lib/bindings/python/examples/metrics/README.md (1 hunks)
  • lib/bindings/python/rust/lib.rs (2 hunks)
  • lib/runtime/src/logging.rs (1 hunks)
💤 Files with no reviewable changes (1)
  • deploy/tracing/docker-compose.yml
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • docs/observability/README.md
  • docs/observability/prometheus-grafana.md
  • docs/observability/metrics-developer-guide.md
  • deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml
  • docs/observability/metrics.md
  • lib/bindings/python/examples/metrics/README.md
📚 Learning: 2025-07-14T21:25:56.930Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.930Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.

Applied to files:

  • lib/bindings/python/rust/lib.rs
📚 Learning: 2025-07-18T16:04:31.771Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

Applied to files:

  • docs/kubernetes/observability/logging.md
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The ai-dynamo/dynamo team uses _total as a semantic unit suffix across all metric types (including gauges like INFLIGHT_REQUESTS_TOTAL) for internal consistency, as evidenced by patterns in prometheus_names.rs. This is a deliberate architectural choice to prioritize uniform naming conventions over strict Prometheus conventions that reserve _total only for counters.

Applied to files:

  • docs/observability/prometheus-grafana.md
  • docs/observability/metrics.md
📚 Learning: 2025-09-16T00:21:44.912Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: deploy/metrics/README.md:43-43
Timestamp: 2025-09-16T00:21:44.912Z
Learning: Graham (grahamking) has provided guidance in PR 2914 that metrics should end with _count or _total to indicate units, but this needs to be clarified whether it applies to all metric types or just counters, as Prometheus conventions differ between counters (should have _total) and gauges (should not have _total).

Applied to files:

  • docs/observability/metrics-developer-guide.md
📚 Learning: 2025-09-24T19:06:57.156Z
Learnt from: ryan-lempka
Repo: ai-dynamo/dynamo PR: 3062
File: lib/llm/src/audit/sink.rs:15-27
Timestamp: 2025-09-24T19:06:57.156Z
Learning: In the audit logging system, full request/response logging requires both DYN_AUDIT_ENABLED=1 environment variable and explicit store=true in the request. Without store=true, only usage statistics are logged (UsageOnly mode). The stderr sink is the initial implementation with plans for additional sinks in the future.

Applied to files:

  • docs/observability/logging.md
📚 Learning: 2025-06-04T13:09:53.416Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1365
File: deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go:171-178
Timestamp: 2025-06-04T13:09:53.416Z
Learning: The `DYN_DEPLOYMENT_CONFIG` environment variable (commonconsts.DynamoDeploymentConfigEnvVar) in the Dynamo operator will never be set via ValueFrom (secrets/config maps), only via direct Value assignment. The GetDynamoDeploymentConfig method correctly only checks env.Value for this specific environment variable.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-06-05T01:46:15.509Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-07-25T22:34:11.384Z
Learnt from: nnshah1
Repo: ai-dynamo/dynamo PR: 2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

  • docs/observability/health-checks.md
📚 Learning: 2025-09-16T00:27:43.992Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

  • docs/observability/metrics.md
🧬 Code graph analysis (2)
lib/runtime/src/logging.rs (1)
lib/runtime/src/config.rs (1)
  • env_is_truthy (422-427)
lib/bindings/python/rust/lib.rs (1)
lib/runtime/src/config.rs (1)
  • env_is_truthy (422-427)
🪛 GitHub Check: Check for broken markdown links
lib/bindings/python/examples/metrics/README.md

[failure] 10-10:
Broken link: Metrics Developer Guide - Python Section - View: https://github.com/ai-dynamo/dynamo/blob/HEAD/lib/bindings/python/examples/metrics/README.md?plain=1#L10

🪛 LanguageTool
docs/kubernetes/observability/logging.md

[uncategorized] ~144-~144: The abbreviation “e.g.” (= for example) requires two periods.
Context: ...loyment, namespace, and component type (e.g frontend, worker, etc).

(E_G)


[style] ~144-~144: In American English, abbreviations like “etc.” require a period.
Context: ...d component type (e.g frontend, worker, etc).

(ETC_PERIOD)

docs/observability/prometheus-grafana.md

[duplication] ~28-~28: Possible typo: you repeated a word.
Context: ...tes Install these on your machine: - Docker - [Docker Compose](https://docs.docker.com/compos...

(ENGLISH_WORD_REPEAT_RULE)

docs/observability/metrics-developer-guide.md

[uncategorized] ~36-~36: Loose punctuation mark.
Context: ...Methods - .metrics().create_counter(): Create a counter metric - `.metrics().c...

(UNLIKELY_OPENING_PUNCTUATION)

docs/observability/metrics.md

[uncategorized] ~102-~102: Loose punctuation mark.
Context: ... - dynamo_component_inflight_requests: Requests currently being processed (gau...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~113-~113: Loose punctuation mark.
Context: ...dynamo_component_kvstats_active_blocks: Number of active KV cache blocks curren...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...unctionality: - dynamo_preprocessor_*: Metrics specific to preprocessor compon...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...: - dynamo_frontend_inflight_requests: Inflight requests (gauge) - `dynamo_fro...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~150-~150: Loose punctuation mark.
Context: ... dynamo_frontend_model_total_kv_blocks: Total KV blocks available for a worker ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~157-~157: Loose punctuation mark.
Context: ...- dynamo_frontend_model_context_length: Maximum context length for a worker ser...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.18.1)
docs/observability/metrics.md

62-62: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


172-172: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


182-182: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (.)
🔇 Additional comments (27)
lib/runtime/src/logging.rs (1)

147-150: LGTM! Truthy value support improves usability.

The change from exact "1" matching to env_is_truthy() improves the developer experience by accepting common truthy values (1, true, on, yes). The documentation comment accurately reflects the new behavior.

lib/bindings/python/rust/lib.rs (2)

128-134: LGTM! Consistent truthy value support.

The change to use env_is_truthy() and the updated warning message are consistent with the changes in lib/runtime/src/logging.rs. The warning text correctly reflects that any truthy value (not just "=1") will trigger the deferred initialization.


449-453: LGTM! Proper deferred initialization.

The truthy check here correctly defers logging initialization until the Tokio runtime is available, which is required for the OTEL exporter. This mirrors the pattern in the _core module initialization.

docs/observability/metrics-developer-guide.md (1)

1-270: Excellent comprehensive developer guide!

This is a well-structured and thorough metrics developer guide that covers:

  • Clear getting started instructions with environment variables
  • Both Rust and Python API usage with practical examples
  • Vector metrics with labels
  • Advanced features (custom buckets, constant labels)
  • Update patterns and examples

The document effectively consolidates metrics guidance and provides a strong foundation for developers working with Dynamo metrics.

docs/observability/health-checks.md (2)

14-24: Well-documented environment variables.

The new environment variables table provides clear, comprehensive information with descriptions, defaults, and examples. This follows a consistent format with other observability documentation.


25-48: Helpful Getting Started section.

The new Getting Started section provides clear, practical examples for enabling and testing health checks on a single GPU. This aligns well with the PR's objective to improve observability documentation and provide better getting-started guidance.

docs/observability/prometheus-grafana.md (3)

1-21: Improved overview and environment variables documentation.

The updated overview clearly focuses on single-machine demo setup, and the environment variables table follows the consistent format used across other observability documentation.


22-84: Excellent Getting Started guide.

The new Getting Started section provides a comprehensive, step-by-step guide that covers:

  • Prerequisites and installation
  • Starting infrastructure and observability services
  • Launching Dynamo components with proper configuration
  • Testing with example requests
  • Accessing web interfaces with helpful notes about remote access

This is very user-friendly and practical.


88-167: Well-structured topology and configuration guidance.

The topology diagram provides a clear visualization of service relationships, and the configuration/troubleshooting sections offer practical guidance. The reference to the Metrics Developer Guide is a good addition for users who want to create custom metrics.

docs/observability/logging.md (3)

27-52: Improved environment variables documentation and Getting Started section.

The restructured environment variables table follows the consistent format used across observability documentation, and the new Getting Started section provides clear, practical examples for enabling JSONL logging.


102-141: Clear explanation of OpenTelemetry logging integration.

The renamed section "Logging of OpenTelemetry Tracing" better reflects its purpose, and the clarification about the overlap with distributed tracing documentation is helpful. The note distinguishing between seeing trace context in logs versus exporting to a trace backend is particularly useful.


142-258: Comprehensive trace and span information examples.

The updated section title accurately reflects the content, and the detailed examples effectively demonstrate how trace context appears in logs. The examples showing custom request IDs with the x-request-id header are particularly practical for correlating traces with application-level request tracking.

lib/bindings/python/examples/metrics/README.md (1)

10-10: Link is valid; no changes needed.

Verification confirms the documentation link is functional. The file docs/observability/metrics-developer-guide.md exists and contains the Python Metrics API section referenced by the anchor. The static analysis flag appears to be a false positive.

deploy/observability/k8s/grafana-dynamo-dashboard-configmap.yaml (1)

1005-1005: Dashboard title update is consistent with observability reorganization.

The title change aligns with the PR's generic naming convention and matches the corresponding update in deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json. This is a safe cosmetic change with no impact on dashboard functionality or metrics queries.

deploy/observability/tempo-datasource.yml (1)

12-12: Tempo datasource correctly set as non-default.

Setting isDefault: false appropriately makes Prometheus the default datasource while allowing Tempo queries when explicitly selected. This is the correct configuration for a multi-datasource observability setup where metrics are primary and tracing is supplementary.

deploy/observability/grafana_dashboards/grafana-dynamo-dashboard.json (1)

1023-1023: Dashboard title matches configmap and supports generic naming convention.

The title update is consistent with the parallel change in the ConfigMap and maintains the semantic meaning of the dashboard while supporting the new "generic" naming introduced in this PR. All dashboard panels and queries remain intact.

deploy/observability/k8s/logging/README.md (1)

3-3: Relative path correctly updated for directory relocation.

The path has been properly adjusted to reflect the directory restructuring from deploy/logging/ to deploy/observability/k8s/logging/. The new path ../../../../docs/kubernetes/observability/logging.md correctly navigates from the new file location to the documentation.

docs/kubernetes/observability/metrics.md (1)

130-132: Documentation simplified and clarified.

Replacing the multi-step pushd/popd pattern with a single direct kubectl apply command improves clarity and reduces cognitive load. The simplified instruction is easier to follow and less error-prone for users.

README.md (2)

104-107: Code block syntax highlighting improved.

Specifying bash as the code fence language enables proper syntax highlighting for users reading the README, improving clarity and professional appearance.


109-117: New observability section is clear and well-positioned.

The optional observability stack section appropriately documents the separate deployment path and clearly communicates:

  • What services are included (Prometheus, Grafana, Tempo, metrics exporters)
  • How to deploy (docker compose -f deploy/docker-observability.yml)
  • How to access it (Grafana credentials and port)

Marking this as optional correctly reflects that it's not required for basic Dynamo operation, improving documentation clarity for new users. Credentials (dynamo/dynamo) and port (3000) match the PR objectives.

docs/observability/README.md (1)

1-32: Observability documentation hub verified and complete.

The README effectively centralizes observability configuration with well-organized tables, shared variable annotations (†), and appropriate separation between user and developer guides. All referenced guide files have been confirmed to exist.

deploy/docker-observability.yml (2)

106-127: Clarify Grafana security configuration.

The Grafana service has several security-related settings that should be documented or reviewed:

  • user: root (Line 90 for Tempo is reasonable for file permissions, but Line 106-127 uses default user) — consider whether root privileges are necessary for Grafana
  • GF_SECURITY_DISABLE_INITIAL_ADMIN_CREATION=false (Line 125) — contradicts the intent of the other security flags; this will NOT disable admin creation despite the name
  • Default credentials dynamo/dynamo are acceptable for local development but should be rotated in production

Consider adding a comment to clarify these are development defaults and should not be used in production, or add safeguards for production deployments.


64-66: All referenced configuration files are present.

Verification confirms that all required configuration files and directories referenced in the compose file exist in the deploy/observability/ directory. The observability stack is properly configured and ready for use.

docs/observability/metrics.md (2)

29-54: Environment variable names and defaults are correct. Verification confirms DYN_SYSTEM_ENABLED and DYN_SYSTEM_PORT match the codebase exactly. The system metrics server is disabled by default and enabled when DYN_SYSTEM_ENABLED=true. The example port 8081 is consistent with actual usage throughout the codebase.


96-120: Fix documented metric names to match implementation.

Two documentation errors found:

  1. Line 106: dynamo_component_system_uptime_seconds should be dynamo_component_uptime_seconds (remove "system_" prefix)

  2. Lines 110-113: KVStats metrics are missing the dynamo_component_ prefix in documentation. The actual metric names are:

    • kvstats_active_blocks (not dynamo_component_kvstats_active_blocks)
    • kvstats_total_blocks (not dynamo_component_kvstats_total_blocks)
    • kvstats_gpu_cache_usage_percent (not dynamo_component_kvstats_gpu_cache_usage_percent)
    • kvstats_gpu_prefix_cache_hit_rate (not dynamo_component_kvstats_gpu_prefix_cache_hit_rate)

The work handler metrics (lines 102-105) are correctly named in the documentation.

⛔ Skipped due to learnings
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.
docs/observability/tracing.md (2)

18-25: All environment variable defaults verified and confirmed correct:

Variable Documentation Default Implementation Default Status
DYN_LOGGING_JSONL false env_is_truthy() → false
OTEL_EXPORT_ENABLED false env_is_truthy() → false
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT http://localhost:4317 http://localhost:4317
OTEL_SERVICE_NAME dynamo dynamo

Tempo OTLP gRPC port 4317 confirmed in deploy/docker-observability.yml (line 96).


56-58: No issues found—environment variable syntax is correct.

The OTEL_EXPORT_ENABLED=true syntax in docs/observability/tracing.md is valid. The application's is_truthy function (lib/config/src/lib.rs:20-22, lib/runtime/src/config.rs:396-398) explicitly accepts "true" as a truthy value via pattern matching: matches!(val.to_lowercase().as_str(), "1" | "true" | "on" | "yes"). The example aligns with the documented acceptable values listed in the code comment.

Comment on lines 4 to 5
# Bare minimum infrastructure services for Dynamo.
# For observability (metrics, tracing, dashboards), use docker-compose-observability.yml
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the observability compose file name.

The comment references docker-compose-observability.yml, but the AI summary mentions docker-observability.yml. Please verify the correct filename.


🏁 Script executed:

#!/bin/bash
# Description: Verify the observability compose file name

if [ -f "deploy/docker-observability.yml" ]; then
    echo "✓ Found: deploy/docker-observability.yml"
elif [ -f "deploy/docker-compose-observability.yml" ]; then
    echo "✓ Found: deploy/docker-compose-observability.yml"
else
    echo "✗ Neither deploy/docker-observability.yml nor deploy/docker-compose-observability.yml found"
fi

Length of output: 40


Update comment to reference correct observability compose file name.

The comment at lines 4-5 references docker-compose-observability.yml, but the actual file is docker-observability.yml. Update the comment to reference the correct filename:

# For observability (metrics, tracing, dashboards), use docker-observability.yml
🤖 Prompt for AI Agents
In deploy/docker-compose.yml around lines 4 to 5, the inline comment references
the wrong observability compose filename; update the comment to reference
docker-observability.yml instead of docker-compose-observability.yml so the
comment reads: "# For observability (metrics, tracing, dashboards), use
docker-observability.yml".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keivenchang Thank you for the quick fix! 🎉


🧠 Learnings used
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Comment on lines +125 to 134
curl -H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
-d '{
"model": "Qwen/Qwen3-0.6B",
"max_completion_tokens": 100,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' \
-H 'Content-Type: application/json' \
-H 'x-request-id: test-trace-001' \
http://localhost:8000/v1/chat/completions
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify x-request-id header is actually used for trace correlation.

The documentation at lines 125-126 highlights the x-request-id header for trace correlation in Grafana searches (line 147). Confirm this header is:

  1. Actually captured by the tracing system
  2. Properly mapped to the x_request_id tag mentioned at line 147
  3. Consistent with OpenTelemetry standards or custom implementation

🏁 Script executed:

#!/bin/bash
# Search for x-request-id header handling

echo "Searching for x-request-id header handling:"
rg -n 'x-request-id|x_request_id' --type=py --type=rust -C 2

echo ""
echo "Searching for OpenTelemetry attribute mapping:"
rg -n 'attribute|tag|trace' --type=py --type=rust | grep -i 'request.*id'

Length of output: 10886



---

🏁 Script executed:

```shell
#!/bin/bash
# Inspect the actual documentation to see what's claimed
echo "=== Documentation content around lines 125-150 ==="
sed -n '125,150p' docs/observability/tracing.md

echo ""
echo "=== Check for OpenTelemetry setup/configuration ==="
rg -n 'opentelemetry|otel|tracer|propagat' --type=rust --type=py -i -C 1 | head -50

Length of output: 4490



---

🏁 Script executed:

```shell
#!/bin/bash
# Check how tracing spans are connected to OpenTelemetry and if x_request_id becomes an OTEL attribute
echo "=== Looking for span attribute setting in logging.rs ==="
grep -n 'set_attribute\|record\|set_tag' lib/runtime/src/logging.rs | head -20

echo ""
echo "=== Check DistributedTracingContext struct definition ==="
grep -B5 -A15 'struct DistributedTracingContext' lib/runtime/src/logging.rs

echo ""
echo "=== Check how get_distributed_tracing_context extracts headers ==="
grep -B5 -A30 'fn get_distributed_tracing_context' lib/runtime/src/logging.rs

Length of output: 2109



---

🏁 Script executed:

```shell
#!/bin/bash
# Find DistributedTraceContext struct definition
echo "=== DistributedTraceContext struct full definition ==="
rg -A20 'pub struct DistributedTraceContext' lib/runtime/src/logging.rs

echo ""
echo "=== Check setup_logging function for OpenTelemetry layer ==="
grep -A50 'fn setup_logging()' lib/runtime/src/logging.rs | head -70

echo ""
echo "=== Check if OpenTelemetrySpanExt methods are used for x_request_id ==="
rg -n 'with_values|set_attribute|set_error|set_span_error' lib/runtime/src/logging.rs

echo ""
echo "=== Look for where x_request_id might be set as OTEL attribute ==="
rg -B3 -A3 'x_request_id' lib/llm/src/http/service/openai.rs | head -30

Length of output: 3948



---

🏁 Script executed:

```shell
#!/bin/bash
# Get complete setup_logging function
echo "=== Complete setup_logging function ==="
sed -n '/^fn setup_logging/,/^fn [a-z]/p' lib/runtime/src/logging.rs | head -150

echo ""
echo "=== Search for tracing_opentelemetry layer initialization ==="
rg -B2 -A5 'tracing_opentelemetry|TelemetryLayer' lib/runtime/src/logging.rs

Length of output: 2049


The x-request-id header is captured but not exported as a searchable OpenTelemetry attribute.

The implementation correctly extracts the x-request-id header (logging.rs:260-261) and includes it in tracing spans via info_span!. However, there is a gap between the documentation and implementation:

  • What works: The header is captured and used internally for request tracking and logging context
  • What's missing: The x_request_id field is never explicitly set as an OpenTelemetry attribute using set_attribute() or similar OTEL methods
  • Result: Grafana Tempo cannot search traces by x_request_id tag as documented at line 147, since custom fields in tracing spans are not automatically converted to searchable OTEL attributes by the tracing_opentelemetry layer

To make it searchable in Tempo as documented, the code needs to explicitly set x_request_id as an OTEL attribute on the active span, likely in the span extension setup (around logging.rs:250-280) or in the on_enter hook where the DistributedTraceContext is stored.

🤖 Prompt for AI Agents
In docs/observability/tracing.md around lines 125 to 134 (and implementation
area logging.rs ~250-280), the request ID header is added to tracing spans for
logging but never set as an explicit OpenTelemetry attribute, so Tempo cannot
search by it; update the span setup (e.g., in the span extension or on_enter
hook where DistributedTraceContext is stored) to retrieve the active
OpenTelemetry span and call the OTEL attribute setter to add a key like
"x_request_id" (or "http.request_id") with the extracted header value using the
appropriate OpenTelemetry KeyValue API; ensure the necessary OpenTelemetry
trait/import is available and do this immediately after creating or entering the
span so the attribute is exported with the trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am able to see this custom metric in Tempo even without set_attribute() or similar OTEL methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how tracing_opentelemetry converts span fields to OTEL attributes
echo "=== Check Cargo.toml for tracing_opentelemetry version ==="
rg -n 'tracing.?opentelemetry' --type=toml

echo ""
echo "=== Look for any documentation or comments about span field conversion ==="
rg -n -i 'span.*field.*attribute|field.*otel|automatic.*attribute' --type=rust -C 2

Length of output: 615


🌐 Web query:

tracing_opentelemetry span fields OpenTelemetry attributes automatic conversion

💡 Result:

Short summary — how tracing_opentelemetry turns tracing span fields into OpenTelemetry attributes

  • Reserved otel.* fields: fields prefixed with "otel." are handled specially (otel.name, otel.kind, otel.status_code, otel.status_description / otel.status_message). These override span name/kind/status when present. [1][3].
  • Semantic-convention pass-through: any field name that matches OpenTelemetry semantic convention keys (e.g. "http.method", "db.system", "server.port") is passed through as an OTEL attribute; you can also set those keys directly as tracing fields. [1][3].
  • Type conversion: tracing field values are converted to OpenTelemetry attribute values via the opentelemetry::Value/KeyValue APIs (set_attribute takesInto / Into). Supported value kinds follow the opentelemetry Value model (strings, booleans, integers, floats and arrays of those) — you use set_attribute or let tracing fields be converted automatically. [4][5].
  • Explicit API: OpenTelemetrySpanExt provides set_attribute, add_event, set_status, context, etc., to add OTEL attributes/events directly (bypassing tracing) when you need dynamic keys or >32 attributes. Tracing fields still supersede attributes set via set_attribute if they conflict. [4].
  • Optional propagation of extra metadata: the OpenTelemetryLayer has toggles to include exception fields, source location, thread info, verbosity level, target, tracked inactivity, and whether to activate OTEL Context on span entry — these control what extra tracing span fields become OTEL attributes. [2][3].

Useful references:

  • tracing_opentelemetry crate docs (special otel.* fields, semantic conventions, layer options). [1][2][3]
  • opentelemetry::Value / Span attribute docs (Value kinds and set_attribute behavior). [4][5]

References
[1] tracing_opentelemetry — crate overview / special fields.
[2] OpenTelemetryLayer — layer options (exceptions, location, threads, level, etc.).
[3] tracing_opentelemetry lib.rs source (special fields, semantic conventions).
[4] tracing_opentelemetry::OpenTelemetrySpanExt (set_attribute, add_event).
[5] opentelemetry::Value / Span attribute docs.


🏁 Script executed:

#!/bin/bash
# Check the complete setup_logging implementation to see OpenTelemetryLayer configuration
echo "=== Full setup_logging function with OpenTelemetry configuration ==="
sed -n '/^fn setup_logging/,/^}/p' lib/runtime/src/logging.rs | head -200

echo ""
echo "=== Check for any OpenTelemetryLayer builder/configuration ==="
rg -n 'OpenTelemetryLayer|with_|::new\(\)' lib/runtime/src/logging.rs -A 2 -B 2

Length of output: 10915


@keivenchang Thank you for testing and confirming! You're absolutely right—looking at the implementation, tracing_opentelemetry::layer() with the default configuration does automatically convert tracing span fields (like x_request_id set in info_span!) to OpenTelemetry attributes that Tempo can search by.

I see the standard setup at lines 794-796 in lib/runtime/src/logging.rs:

tracing_opentelemetry::layer()
    .with_tracer(tracer)
    .with_filter(otel_filter_layer),

This default behavior handles the field-to-attribute conversion without needing explicit set_attribute() calls. The documentation at lines 147 showing the search by x_request_id tag is accurate. Sorry for the false alarm—the implementation is working as documented! 🎉


🧠 Learnings used
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

@@ -1,35 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now subsumed in the deploy/docker-observability.yml file (plus other stuff).

networks:
- server
- monitoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving Prometheus/Grafana/dcgm out.

@@ -0,0 +1,137 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new file is a merge of the old deploy/tracing/docker-compose.yml and the old deploy/docker-compose.yml, containing only observability related services.

- Fix docker-compose.yml comment to reference correct filename (docker-observability.yml not docker-compose-observability.yml)
- Fix abbreviation formatting in kubernetes logging docs (e.g., etc.)

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang changed the title Keivenchang/dis 980 consolidate otel docker compose files refactor: consolidate OTEL docker compose files Nov 7, 2025
@keivenchang keivenchang changed the title refactor: consolidate OTEL docker compose files refactor: consolidate Observability files (e.g. OTEL docker-compose, md files) Nov 7, 2025
Correct relative path to metrics-developer-guide.md (needs 6 levels up, not 5)

Signed-off-by: Keiven Chang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants