|
| 1 | +# Post-Run Insights: Slow Query and Index Analysis |
| 2 | + |
| 3 | +This document explains PLGM's post-run insights layer in detail. |
| 4 | + |
| 5 | +The feature provides a structured analysis after benchmark completion, including: |
| 6 | +- slow operation groups |
| 7 | +- affected collections |
| 8 | +- normalized query-shape groupings |
| 9 | +- cautious, evidence-based index guidance |
| 10 | +- export-ready JSON data for downstream dashboards |
| 11 | + |
| 12 | +## What It Is |
| 13 | + |
| 14 | +The insights layer is a **foundational analytics pass** designed to run after all iterations are complete. |
| 15 | + |
| 16 | +It is intentionally separated from real-time charts to keep runtime overhead bounded and predictable. |
| 17 | + |
| 18 | +## Where It Appears |
| 19 | + |
| 20 | +After completion, insights are available in: |
| 21 | +- Web UI dashboard panel: `POST-RUN SLOW QUERY & INDEX ANALYSIS` |
| 22 | +- API endpoint: `GET /api/insights` |
| 23 | +- `Download Summary` JSON export under the `insights` section |
| 24 | + |
| 25 | +## When It Runs |
| 26 | + |
| 27 | +Insights are finalized only after workloads finish. |
| 28 | + |
| 29 | +- While a run is active, `GET /api/insights` returns `metadata.status = pending`. |
| 30 | +- Once complete, the endpoint returns final analysis (`ready` / `empty` / `disabled`). |
| 31 | + |
| 32 | +This behavior avoids presenting partial or misleading findings during execution. |
| 33 | + |
| 34 | +## Data Collection Model |
| 35 | + |
| 36 | +PLGM captures sampled operation events during workload execution, with bounded retention. |
| 37 | + |
| 38 | +Each sampled event may include: |
| 39 | +- operation type |
| 40 | +- database and collection |
| 41 | +- normalized shape key and shape summary |
| 42 | +- extracted filter fields (when applicable) |
| 43 | +- duration |
| 44 | +- success/failure |
| 45 | +- iteration index |
| 46 | +- timestamp |
| 47 | + |
| 48 | +Retention characteristics: |
| 49 | +- sampled (configurable sampling rate) |
| 50 | +- bounded ring buffer (`insights_max_events`) |
| 51 | +- bounded aggregation cardinality (`insights_max_groups`) |
| 52 | + |
| 53 | +This prevents unbounded memory growth while preserving useful signal. |
| 54 | + |
| 55 | +## What Insights Contains |
| 56 | + |
| 57 | +Top-level sections in the final report: |
| 58 | +- `summary` |
| 59 | +- `slow_queries` |
| 60 | +- `affected_collections` |
| 61 | +- `query_shapes` |
| 62 | +- `potential_index_issues` |
| 63 | +- `recommendations` |
| 64 | +- `per_iteration` |
| 65 | +- `time_slices` |
| 66 | +- `metadata` |
| 67 | + |
| 68 | +## Stable Shape IDs and Cross-Run Trends |
| 69 | + |
| 70 | +Each shape group has a stable `shape_id` derived from: |
| 71 | +- operation |
| 72 | +- collection |
| 73 | +- normalized shape key |
| 74 | + |
| 75 | +This enables consistent identity across runs. |
| 76 | + |
| 77 | +PLGM also keeps a lightweight in-memory baseline to show trend hints (for matching shapes), e.g.: |
| 78 | +- improved |
| 79 | +- worse |
| 80 | +- flat |
| 81 | + |
| 82 | +## Optional Explain Sampling (Off by Default) |
| 83 | + |
| 84 | +An optional post-run explain mode can enrich evidence for top slow shapes. |
| 85 | + |
| 86 | +Important design choices: |
| 87 | +- disabled by default |
| 88 | +- runs only post-run |
| 89 | +- limited to top-N shapes |
| 90 | +- bounded by max explain execution time |
| 91 | +- falls back to heuristic messaging if explain is unavailable |
| 92 | + |
| 93 | +If explain sampling is enabled, index issue messages may be upgraded when evidence is observed (for example, explain indicating `COLLSCAN`). |
| 94 | + |
| 95 | +## Index Advice Philosophy |
| 96 | + |
| 97 | +PLGM uses confidence-aware wording and does not overstate certainty. |
| 98 | + |
| 99 | +Possible evidence levels: |
| 100 | +- heuristic |
| 101 | +- heuristic with index-overlap/no-overlap signals |
| 102 | +- explain-based evidence (when enabled and successful) |
| 103 | + |
| 104 | +Typical language intentionally uses cautious terms like: |
| 105 | +- "possible missing index" |
| 106 | +- "collection scan is possible" |
| 107 | +- "validate with explain" |
| 108 | + |
| 109 | +## Web UI Configuration |
| 110 | + |
| 111 | +Path: `Advanced -> Insights Analysis` |
| 112 | + |
| 113 | +Available controls: |
| 114 | +- Enable Post-Run Insights Analysis |
| 115 | +- Enable Post-Run Explain Sampling (Optional) |
| 116 | +- Insights Sampling Rate |
| 117 | +- Slow Threshold (ms) |
| 118 | +- Max Retained Events |
| 119 | +- Max Group Entries |
| 120 | +- Explain Top N Shapes |
| 121 | +- Explain Max Time (ms) |
| 122 | + |
| 123 | +All settings are applied per run and included in exported summary config. |
| 124 | + |
| 125 | +## API Contract |
| 126 | + |
| 127 | +`GET /api/insights` |
| 128 | + |
| 129 | +Typical states: |
| 130 | +- `inactive`: no collector/run context |
| 131 | +- `pending`: run still active |
| 132 | +- `ready`: completed report available |
| 133 | +- `empty`: no sampled events in buffer |
| 134 | +- `disabled`: insights disabled via configuration |
| 135 | + |
| 136 | +The payload is read-only and designed for UI or future dashboard consumers. |
| 137 | + |
| 138 | +## Export Contract |
| 139 | + |
| 140 | +`Download Summary` includes: |
| 141 | +- final benchmark summary fields |
| 142 | +- `insights` object identical to post-run API/UI model |
| 143 | +- redacted password handling preserved |
| 144 | + |
| 145 | +## Configuration Reference |
| 146 | + |
| 147 | +Config file keys: |
| 148 | +- `insights_enabled` |
| 149 | +- `insights_sampling_rate` |
| 150 | +- `insights_slow_threshold_ms` |
| 151 | +- `insights_max_events` |
| 152 | +- `insights_max_groups` |
| 153 | +- `insights_explain_enabled` |
| 154 | +- `insights_explain_top_n` |
| 155 | +- `insights_explain_max_time_ms` |
| 156 | + |
| 157 | +Environment overrides: |
| 158 | +- `PLGM_INSIGHTS_ENABLED` |
| 159 | +- `PLGM_INSIGHTS_SAMPLING_RATE` |
| 160 | +- `PLGM_INSIGHTS_SLOW_THRESHOLD_MS` |
| 161 | +- `PLGM_INSIGHTS_MAX_EVENTS` |
| 162 | +- `PLGM_INSIGHTS_MAX_GROUPS` |
| 163 | +- `PLGM_INSIGHTS_EXPLAIN_ENABLED` |
| 164 | +- `PLGM_INSIGHTS_EXPLAIN_TOP_N` |
| 165 | +- `PLGM_INSIGHTS_EXPLAIN_MAX_TIME_MS` |
| 166 | + |
| 167 | +## Recommended Starting Values |
| 168 | + |
| 169 | +For general usage: |
| 170 | +- sampling rate: `0.10` |
| 171 | +- slow threshold: `200ms` |
| 172 | +- max events: `5000` |
| 173 | +- max groups: `300` |
| 174 | +- explain sampling: disabled |
| 175 | + |
| 176 | +For deeper troubleshooting (short test windows): |
| 177 | +- sampling rate: `0.25` to `1.0` |
| 178 | +- explain sampling: enabled |
| 179 | +- top N shapes: `3` to `5` |
| 180 | +- explain max time: `1000` to `3000` |
| 181 | + |
| 182 | +## Use Cases |
| 183 | + |
| 184 | +1. Fast post-run triage |
| 185 | +- Identify top slow groups immediately after completion. |
| 186 | + |
| 187 | +2. Collection hotspot detection |
| 188 | +- Detect which collections account for most slow patterns. |
| 189 | + |
| 190 | +3. Safe index investigation shortlist |
| 191 | +- Generate candidate fields/patterns to validate with DBA workflows. |
| 192 | + |
| 193 | +4. Iteration and timeline context |
| 194 | +- Compare behavior across iterations and time slices. |
| 195 | + |
| 196 | +5. CI / automated benchmarking exports |
| 197 | +- Consume structured `insights` JSON for pipelines/reports. |
| 198 | + |
| 199 | +## Known Limitations |
| 200 | + |
| 201 | +- Sampling means results are representative, not exhaustive. |
| 202 | +- Heuristic index advice is not a guarantee of missing index root cause. |
| 203 | +- Explain enrichment depends on representative sample availability and access. |
| 204 | +- Trend persistence is in-memory; it does not survive process restarts. |
| 205 | +- Explanations are intentionally post-run only to protect active benchmark performance. |
| 206 | + |
| 207 | +## Future Enhancements |
| 208 | + |
| 209 | +Potential next steps for a full insights dashboard: |
| 210 | +- persistent historical run storage for long-term trend analysis |
| 211 | +- richer explain-plan capture and comparison views |
| 212 | +- cross-run diff reports and regression alerts |
| 213 | +- deeper per-shape drill-down and filter playback tools |
| 214 | + |
0 commit comments