Skip to content

Commit 9905977

Browse files
new feature: post run insights, query and index analysis
1 parent e15b073 commit 9905977

File tree

10 files changed

+2571
-58
lines changed

10 files changed

+2571
-58
lines changed

INSIGHTS.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Post-Run Insights: Slow Query and Index Analysis
2+
3+
This document explains PLGM's post-run insights layer in detail.
4+
5+
The feature provides a structured analysis after benchmark completion, including:
6+
- slow operation groups
7+
- affected collections
8+
- normalized query-shape groupings
9+
- cautious, evidence-based index guidance
10+
- export-ready JSON data for downstream dashboards
11+
12+
## What It Is
13+
14+
The insights layer is a **foundational analytics pass** designed to run after all iterations are complete.
15+
16+
It is intentionally separated from real-time charts to keep runtime overhead bounded and predictable.
17+
18+
## Where It Appears
19+
20+
After completion, insights are available in:
21+
- Web UI dashboard panel: `POST-RUN SLOW QUERY & INDEX ANALYSIS`
22+
- API endpoint: `GET /api/insights`
23+
- `Download Summary` JSON export under the `insights` section
24+
25+
## When It Runs
26+
27+
Insights are finalized only after workloads finish.
28+
29+
- While a run is active, `GET /api/insights` returns `metadata.status = pending`.
30+
- Once complete, the endpoint returns final analysis (`ready` / `empty` / `disabled`).
31+
32+
This behavior avoids presenting partial or misleading findings during execution.
33+
34+
## Data Collection Model
35+
36+
PLGM captures sampled operation events during workload execution, with bounded retention.
37+
38+
Each sampled event may include:
39+
- operation type
40+
- database and collection
41+
- normalized shape key and shape summary
42+
- extracted filter fields (when applicable)
43+
- duration
44+
- success/failure
45+
- iteration index
46+
- timestamp
47+
48+
Retention characteristics:
49+
- sampled (configurable sampling rate)
50+
- bounded ring buffer (`insights_max_events`)
51+
- bounded aggregation cardinality (`insights_max_groups`)
52+
53+
This prevents unbounded memory growth while preserving useful signal.
54+
55+
## What Insights Contains
56+
57+
Top-level sections in the final report:
58+
- `summary`
59+
- `slow_queries`
60+
- `affected_collections`
61+
- `query_shapes`
62+
- `potential_index_issues`
63+
- `recommendations`
64+
- `per_iteration`
65+
- `time_slices`
66+
- `metadata`
67+
68+
## Stable Shape IDs and Cross-Run Trends
69+
70+
Each shape group has a stable `shape_id` derived from:
71+
- operation
72+
- collection
73+
- normalized shape key
74+
75+
This enables consistent identity across runs.
76+
77+
PLGM also keeps a lightweight in-memory baseline to show trend hints (for matching shapes), e.g.:
78+
- improved
79+
- worse
80+
- flat
81+
82+
## Optional Explain Sampling (Off by Default)
83+
84+
An optional post-run explain mode can enrich evidence for top slow shapes.
85+
86+
Important design choices:
87+
- disabled by default
88+
- runs only post-run
89+
- limited to top-N shapes
90+
- bounded by max explain execution time
91+
- falls back to heuristic messaging if explain is unavailable
92+
93+
If explain sampling is enabled, index issue messages may be upgraded when evidence is observed (for example, explain indicating `COLLSCAN`).
94+
95+
## Index Advice Philosophy
96+
97+
PLGM uses confidence-aware wording and does not overstate certainty.
98+
99+
Possible evidence levels:
100+
- heuristic
101+
- heuristic with index-overlap/no-overlap signals
102+
- explain-based evidence (when enabled and successful)
103+
104+
Typical language intentionally uses cautious terms like:
105+
- "possible missing index"
106+
- "collection scan is possible"
107+
- "validate with explain"
108+
109+
## Web UI Configuration
110+
111+
Path: `Advanced -> Insights Analysis`
112+
113+
Available controls:
114+
- Enable Post-Run Insights Analysis
115+
- Enable Post-Run Explain Sampling (Optional)
116+
- Insights Sampling Rate
117+
- Slow Threshold (ms)
118+
- Max Retained Events
119+
- Max Group Entries
120+
- Explain Top N Shapes
121+
- Explain Max Time (ms)
122+
123+
All settings are applied per run and included in exported summary config.
124+
125+
## API Contract
126+
127+
`GET /api/insights`
128+
129+
Typical states:
130+
- `inactive`: no collector/run context
131+
- `pending`: run still active
132+
- `ready`: completed report available
133+
- `empty`: no sampled events in buffer
134+
- `disabled`: insights disabled via configuration
135+
136+
The payload is read-only and designed for UI or future dashboard consumers.
137+
138+
## Export Contract
139+
140+
`Download Summary` includes:
141+
- final benchmark summary fields
142+
- `insights` object identical to post-run API/UI model
143+
- redacted password handling preserved
144+
145+
## Configuration Reference
146+
147+
Config file keys:
148+
- `insights_enabled`
149+
- `insights_sampling_rate`
150+
- `insights_slow_threshold_ms`
151+
- `insights_max_events`
152+
- `insights_max_groups`
153+
- `insights_explain_enabled`
154+
- `insights_explain_top_n`
155+
- `insights_explain_max_time_ms`
156+
157+
Environment overrides:
158+
- `PLGM_INSIGHTS_ENABLED`
159+
- `PLGM_INSIGHTS_SAMPLING_RATE`
160+
- `PLGM_INSIGHTS_SLOW_THRESHOLD_MS`
161+
- `PLGM_INSIGHTS_MAX_EVENTS`
162+
- `PLGM_INSIGHTS_MAX_GROUPS`
163+
- `PLGM_INSIGHTS_EXPLAIN_ENABLED`
164+
- `PLGM_INSIGHTS_EXPLAIN_TOP_N`
165+
- `PLGM_INSIGHTS_EXPLAIN_MAX_TIME_MS`
166+
167+
## Recommended Starting Values
168+
169+
For general usage:
170+
- sampling rate: `0.10`
171+
- slow threshold: `200ms`
172+
- max events: `5000`
173+
- max groups: `300`
174+
- explain sampling: disabled
175+
176+
For deeper troubleshooting (short test windows):
177+
- sampling rate: `0.25` to `1.0`
178+
- explain sampling: enabled
179+
- top N shapes: `3` to `5`
180+
- explain max time: `1000` to `3000`
181+
182+
## Use Cases
183+
184+
1. Fast post-run triage
185+
- Identify top slow groups immediately after completion.
186+
187+
2. Collection hotspot detection
188+
- Detect which collections account for most slow patterns.
189+
190+
3. Safe index investigation shortlist
191+
- Generate candidate fields/patterns to validate with DBA workflows.
192+
193+
4. Iteration and timeline context
194+
- Compare behavior across iterations and time slices.
195+
196+
5. CI / automated benchmarking exports
197+
- Consume structured `insights` JSON for pipelines/reports.
198+
199+
## Known Limitations
200+
201+
- Sampling means results are representative, not exhaustive.
202+
- Heuristic index advice is not a guarantee of missing index root cause.
203+
- Explain enrichment depends on representative sample availability and access.
204+
- Trend persistence is in-memory; it does not survive process restarts.
205+
- Explanations are intentionally post-run only to protect active benchmark performance.
206+
207+
## Future Enhancements
208+
209+
Potential next steps for a full insights dashboard:
210+
- persistent historical run storage for long-term trend analysis
211+
- richer explain-plan capture and comparison views
212+
- cross-run diff reports and regression alerts
213+
- deeper per-shape drill-down and filter playback tools
214+

README.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,15 @@ Download the [`config.yaml`](./config.yaml) and make the necessary adjustments.
8787

8888
For unit and integration test instructions (including Docker-based MongoDB setup), see [`TESTING.md`](./TESTING.md).
8989

90+
### 5. Post-Run Insights (Shortcut)
91+
92+
PLGM includes a post-run **Slow Query and Index Analysis** layer that is configurable from the Web UI (Advanced tab).
93+
94+
Quick access:
95+
* Full guide: [`INSIGHTS.md`](./INSIGHTS.md)
96+
* Web UI path: `Advanced -> Insights Analysis`
97+
* Output: Dashboard insights panel + `Download Summary` JSON (`insights` section)
98+
9099
## The Interactive UI
91100

92101
`plgm` features a completely embedded Web UI. It allows you to configure your database connection, upload custom workload schemas, adjust operation ratios, and monitor real-time throughput and latency without ever touching a YAML file. It has the same functionality as the CLI version, but with an awesome UI.
@@ -133,6 +142,7 @@ When running `plgm` with the `--webui` flag, you get access to a rich, browser-b
133142
* **Live Telemetry & Dashboard:** Watch operations per second (Find, Insert, Update, Delete) and latencies update in real-time with sub-second precision.
134143
* **The "Time Machine" Scrubber:** Pause the live feed and scrub backward through the benchmark timeline to investigate specific latency spikes or throughput drops.
135144
* **Real-Time CSV Export:** Configure and stream metrics to a local CSV file directly from the Advanced tab. Use the "Append" feature to stitch multiple benchmark runs into a single dataset.
145+
* **Post-Run Insights Analysis:** Review slow-query groups, affected collections, potential index issues, and recommendations after all iterations complete.
136146
* **Graceful Shutdown:** Click the **EXIT** button in the header to safely terminate close the application directly from the browser, ensuring all background workers are cleaned up properly.
137147

138148
### 3. Configuration
@@ -603,7 +613,7 @@ Once the csv is exported, you can script your own method to plot its data. We ha
603613
## Post-Run JSON Summary Report
604614
If you forget to enable the real-time CSV export, or if you just want a clean summary of your final results, PLGM provides a Download Summary Report button in the Web UI that appears the moment a workload finishes.
605615
606-
This generates a downloadable JSON summary report that captures both the final performance metrics (total ops, average latencies, and throughput per operation type) alongside the exact configuration parameters used to achieve those results. Passwords are automatically redacted from this file for safe sharing.
616+
This generates a downloadable JSON summary report that captures both the final performance metrics (total ops, average latencies, and throughput per operation type), post-run insights analysis, and the exact configuration parameters used to achieve those results. Passwords are automatically redacted from this file for safe sharing.
607617
608618
Example Summary Snippet:
609619
@@ -621,6 +631,7 @@ Example Summary Snippet:
621631
},
622632
"operations": { ... },
623633
"average_latencies_ms": { ... },
634+
"insights": { ... },
624635
"configuration": {
625636
"concurrency": "4",
626637
"find_batch_size": "10",
@@ -690,6 +701,15 @@ You can override any setting in `config.yaml` using environment variables. This
690701
| `csv_export_enabled` ||Continuously stream workload throughput metrics to a CSV file| `false` |
691702
| `csv_export_append` ||If true, appends to the file. If false, overwrites it.| `false` |
692703
| `csv_export_path` ||Path and metrics file name| `/tmp/plgm_metrics_export.csv` |
704+
| **Post-Run Insights** | | | |
705+
| `insights_enabled` | `PLGM_INSIGHTS_ENABLED` | Enable post-run slow-query/index analysis | `true` |
706+
| `insights_sampling_rate` | `PLGM_INSIGHTS_SAMPLING_RATE` | Sample rate for captured operation events (`0.01`-`1.0`) | `0.10` |
707+
| `insights_slow_threshold_ms` | `PLGM_INSIGHTS_SLOW_THRESHOLD_MS` | Latency threshold used to classify operations as slow | `200` |
708+
| `insights_max_events` | `PLGM_INSIGHTS_MAX_EVENTS` | Max retained sampled events in memory | `5000` |
709+
| `insights_max_groups` | `PLGM_INSIGHTS_MAX_GROUPS` | Max aggregated slow-shape groups | `300` |
710+
| `insights_explain_enabled` | `PLGM_INSIGHTS_EXPLAIN_ENABLED` | Enable optional post-run explain sampling | `false` |
711+
| `insights_explain_top_n` | `PLGM_INSIGHTS_EXPLAIN_TOP_N` | Number of top slow shapes to attempt explain on | `5` |
712+
| `insights_explain_max_time_ms` | `PLGM_INSIGHTS_EXPLAIN_MAX_TIME_MS` | Max server time per explain command | `1000` |
693713
| **Workload Control** | | | |
694714
| `concurrency` | `PLGM_CONCURRENCY` | Number of active worker goroutines | `50` |
695715
| `duration` | `PLGM_DURATION` | Test duration (Go duration string) | `5m`, `60s` |

internal/benchmark/runner.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ func RunRawInjector(ctx context.Context, db *mongo.Database, cfg *config.AppConf
9999
} else {
100100
collector = stats.NewCollector()
101101
}
102+
collector.ConfigureInsights(cfg)
102103

103104
monitorDone := make(chan struct{})
104105

internal/config/config.go

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,15 @@ type AppConfig struct {
5353
CSVExportEnabled bool `yaml:"csv_export_enabled"`
5454
CSVExportAppend bool `yaml:"csv_export_append"`
5555
CSVExportPath string `yaml:"csv_export_path"`
56+
57+
InsightsEnabled bool `yaml:"insights_enabled"`
58+
InsightsSamplingRate float64 `yaml:"insights_sampling_rate"`
59+
InsightsSlowThresholdMs int `yaml:"insights_slow_threshold_ms"`
60+
InsightsMaxEvents int `yaml:"insights_max_events"`
61+
InsightsMaxGroups int `yaml:"insights_max_groups"`
62+
InsightsExplainEnabled bool `yaml:"insights_explain_enabled"`
63+
InsightsExplainTopN int `yaml:"insights_explain_top_n"`
64+
InsightsExplainMaxTimeMS int `yaml:"insights_explain_max_time_ms"`
5665
}
5766

5867
type WebUIConfig struct {
@@ -150,6 +159,16 @@ func applyUIDefaults(cfg *AppConfig) {
150159
cfg.CSVExportEnabled = false
151160
cfg.CSVExportAppend = false
152161
cfg.CSVExportPath = "plgm_metrics_export.csv"
162+
163+
// --- INSIGHTS DEFAULTS ---
164+
cfg.InsightsEnabled = true
165+
cfg.InsightsSamplingRate = 0.10
166+
cfg.InsightsSlowThresholdMs = 200
167+
cfg.InsightsMaxEvents = 5000
168+
cfg.InsightsMaxGroups = 300
169+
cfg.InsightsExplainEnabled = false
170+
cfg.InsightsExplainTopN = 5
171+
cfg.InsightsExplainMaxTimeMS = 1000
153172
}
154173

155174
// applyBaseDefaults sets low-level engine safety limits & remaining UI limits
@@ -167,6 +186,25 @@ func applyBaseDefaults(cfg *AppConfig) {
167186
cfg.CSVExportPath = "plgm_metrics_export.csv"
168187
}
169188

189+
if cfg.InsightsSamplingRate <= 0 || cfg.InsightsSamplingRate > 1 {
190+
cfg.InsightsSamplingRate = 0.10
191+
}
192+
if cfg.InsightsSlowThresholdMs <= 0 {
193+
cfg.InsightsSlowThresholdMs = 200
194+
}
195+
if cfg.InsightsMaxEvents <= 0 {
196+
cfg.InsightsMaxEvents = 5000
197+
}
198+
if cfg.InsightsMaxGroups <= 0 {
199+
cfg.InsightsMaxGroups = 300
200+
}
201+
if cfg.InsightsExplainTopN <= 0 {
202+
cfg.InsightsExplainTopN = 5
203+
}
204+
if cfg.InsightsExplainMaxTimeMS <= 0 {
205+
cfg.InsightsExplainMaxTimeMS = 1000
206+
}
207+
170208
// Web UI Port
171209
if cfg.WebUI.Port <= 0 {
172210
cfg.WebUI.Port = 9999 // default if not specified via flag
@@ -488,6 +526,48 @@ func applyEnvOverrides(cfg *AppConfig) map[string]bool {
488526
}
489527
}
490528

529+
// --- Insights Overrides ---
530+
if v := os.Getenv("PLGM_INSIGHTS_ENABLED"); v != "" {
531+
if b, err := strconv.ParseBool(v); err == nil {
532+
cfg.InsightsEnabled = b
533+
}
534+
}
535+
if v := os.Getenv("PLGM_INSIGHTS_SAMPLING_RATE"); v != "" {
536+
if f, err := strconv.ParseFloat(v, 64); err == nil && f > 0 && f <= 1 {
537+
cfg.InsightsSamplingRate = f
538+
}
539+
}
540+
if v := os.Getenv("PLGM_INSIGHTS_SLOW_THRESHOLD_MS"); v != "" {
541+
if n, err := strconv.Atoi(v); err == nil && n > 0 {
542+
cfg.InsightsSlowThresholdMs = n
543+
}
544+
}
545+
if v := os.Getenv("PLGM_INSIGHTS_MAX_EVENTS"); v != "" {
546+
if n, err := strconv.Atoi(v); err == nil && n > 0 {
547+
cfg.InsightsMaxEvents = n
548+
}
549+
}
550+
if v := os.Getenv("PLGM_INSIGHTS_MAX_GROUPS"); v != "" {
551+
if n, err := strconv.Atoi(v); err == nil && n > 0 {
552+
cfg.InsightsMaxGroups = n
553+
}
554+
}
555+
if v := os.Getenv("PLGM_INSIGHTS_EXPLAIN_ENABLED"); v != "" {
556+
if b, err := strconv.ParseBool(v); err == nil {
557+
cfg.InsightsExplainEnabled = b
558+
}
559+
}
560+
if v := os.Getenv("PLGM_INSIGHTS_EXPLAIN_TOP_N"); v != "" {
561+
if n, err := strconv.Atoi(v); err == nil && n > 0 {
562+
cfg.InsightsExplainTopN = n
563+
}
564+
}
565+
if v := os.Getenv("PLGM_INSIGHTS_EXPLAIN_MAX_TIME_MS"); v != "" {
566+
if n, err := strconv.Atoi(v); err == nil && n > 0 {
567+
cfg.InsightsExplainMaxTimeMS = n
568+
}
569+
}
570+
491571
return overrides
492572
}
493573

0 commit comments

Comments
 (0)