[dyninst] Expose more dyninst debug data via flares#47953
[dyninst] Expose more dyninst debug data via flares#47953gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9cc76edaae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
pkg/dyninst/actuator/debug_info.go
Outdated
| discoveredTypes := make(map[string][]string, len(s.discoveredTypes)) | ||
| for svc, types := range s.discoveredTypes { | ||
| discoveredTypes[svc] = types |
There was a problem hiding this comment.
Deep-copy discovered type slices before returning debug state
state.debugInfo() copies s.discoveredTypes into a new map but reuses each slice (discoveredTypes[svc] = types), so /dynamic_instrumentation/debug/state can end up JSON-encoding slice memory that the actuator event loop mutates later (for example while handling eventMissingTypesReported). That makes the debug endpoint and flare collection racy and can produce corrupted or inconsistent discovered_types output under concurrent probe activity.
Useful? React with 👍 / 👎.
dustmop
left a comment
There was a problem hiding this comment.
LGTM for agent-configuration owned files
Files inventory check summaryFile checks results against ancestor d7186e78: Results for datadog-agent_7.79.0~devel.git.17.46a5098.pipeline.103739422-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
10 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 0af0a9f Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.90 | [-1.16, +4.96] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.90 | [-1.16, +4.96] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.52 | [+1.39, +1.65] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.91 | [+0.77, +1.06] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.76 | [+0.59, +0.92] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.20 | [-0.04, +0.44] | 1 | Logs bounds checks dashboard |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.15 | [+0.09, +0.21] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.12 | [+0.07, +0.17] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.11 | [-0.11, +0.34] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.07 | [-0.09, +0.24] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.04 | [-0.35, +0.42] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.03 | [-0.47, +0.53] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.21, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.01 | [-0.08, +0.06] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.01 | [-0.12, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.20, +0.18] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.02 | [-0.20, +0.17] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.04 | [-0.47, +0.38] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.24 | [-0.30, -0.19] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.32 | [-0.42, -0.21] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.41 | [-0.49, -0.34] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.43 | [-0.47, -0.39] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.56 | [-0.62, -0.50] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -2.28 | [-3.83, -0.73] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 681 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 280.50MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 721 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.24GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 173.73MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 490.04MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 207.03MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 351.76 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 411.86MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
ajwerner
left a comment
There was a problem hiding this comment.
LGTM thanks for doing it! Might be nice to have some basic testing if you felt so inclined just to make sure we're not doing something really dumb in future refactors.
pkg/dyninst/module/debug_info.go
Outdated
| // DiagnosticsDebugInfo contains the state of all diagnostic trackers. | ||
| type DiagnosticsDebugInfo struct { | ||
| Received []DiagnosticsEntry `json:"received"` | ||
| Installed []DiagnosticsEntry `json:"installed"` | ||
| Emitted []DiagnosticsEntry `json:"emitted"` | ||
| Errors []DiagnosticsEntry `json:"errors"` | ||
| } |
There was a problem hiding this comment.
I think this will end up being annoying to consume. How do you feel about transposing this so it is keyed on probe id and version and then has the diagnostics that have been sent?
pkg/flare/archive_linux.go
Outdated
| // Copy the dynamic instrumentation tombstone file if it exists. | ||
| // This file persists probe definitions across restarts and is | ||
| // valuable for debugging even when system-probe is not running. | ||
| const dyninstTombstonePath = "/tmp/datadog-agent/system-probe/dynamic-instrumentation/debugger-probes-tombstone.json" |
There was a problem hiding this comment.
did we hard-code this to /tmp and not os.TempDir? Seems like bad behavior by us!
pkg/dyninst/actuator/debug_info.go
Outdated
| discoveredTypes := make(map[string][]string, len(s.discoveredTypes)) | ||
| for svc, types := range s.discoveredTypes { | ||
| discoveredTypes[svc] = types |
156dcdc to
9fca295
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
devflow unqueued this merge request: It did not become mergeable within the expected time |
9fca295 to
11e07eb
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
devflow unqueued this merge request: It did not become mergeable within the expected time |
11e07eb to
5cae8bc
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
PR already in the queue with status waiting |
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
Build pipeline has failing jobs for a06590b: What to do next?
|
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
Build pipeline has failing jobs for 37aae0d: What to do next?
|
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
Build pipeline has failing jobs for 3d971ff: What to do next?
DetailsSince those jobs are not marked as being allowed to fail, the pipeline will most likely fail. |
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
Build pipeline has failing jobs for 42b528f: What to do next?
|
5cae8bc to
ba49d34
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
devflow unqueued this merge request: It did not become mergeable within the expected time |
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
|
|
/merge |
|
View all feedbacks in Devflow UI.
PR already in the queue with status waiting |
ba49d34 to
46a5098
Compare
No description provided.