ci(perf): Add Core Web Vitals measurement to benchmarks (INP, LCP, CLS)#39716
ci(perf): Add Core Web Vitals measurement to benchmarks (INP, LCP, CLS)#39716
Conversation
|
CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes. |
da2ee35 to
d1198dd
Compare
|
All alerts resolved. Learn more about Socket for GitHub. This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored. Ignoring alerts on:
|
d1198dd to
62ccd2a
Compare
|
@metamaskbot update-policies |
|
Policy update failed. You can review the logs or retry the policy update here |
62ccd2a to
58ba0e0
Compare
58ba0e0 to
ef3b66c
Compare
ef3b66c to
379a3a0
Compare
…ting
Implements INP, LCP, and CLS observers using the `web-vitals-attribution`
build. Each observer:
- Stores latest metric value + rating in module-level state
- Reports to Sentry via `globalThis.sentry` (`setMeasurement`, `setTag`,
`setContext` for attribution, breadcrumb for poor/needs-improvement)
- Computes ratings using Google's thresholds:
INP: good <200ms, poor >500ms
LCP: good <2500ms, poor >4000ms
CLS: good <0.1, poor >0.25
Exposes `getWebVitalsMetrics()` and `resetWebVitalsMetrics()` for
E2E benchmark retrieval via `stateHooks`.
Naming follows `benchmark.{metric}` / `{metric}.rating` convention
for Sentry dashboard compatibility.
Add Core Web Vitals instrumentation with attribution and Sentry reporting
Implements INP, LCP, and CLS observers using the `web-vitals-attribution`
build. Each observer:
- Stores latest metric value + rating in module-level state
- Reports to Sentry via `globalThis.sentry` (`setMeasurement`, `setTag`,
`setContext` for attribution, breadcrumb for poor/needs-improvement)
- Computes ratings using Google's thresholds:
INP: good <200ms, poor >500ms
LCP: good <2500ms, poor >4000ms
CLS: good <0.1, poor >0.25
Exposes `getWebVitalsMetrics()` and `resetWebVitalsMetrics()` for
E2E benchmark retrieval via `stateHooks`.
Naming follows `benchmark.{metric}` / `{metric}.rating` convention
for Sentry dashboard compatibility.
Covers all three observers (INP, LCP, CLS) and shared infrastructure: - Observer registration via `onINP`/`onLCP`/`onCLS` callbacks - Sentry reporting: `setMeasurement`, `setTag`, `setContext`, breadcrumbs - Rating thresholds: good, needs-improvement, poor for each metric - Metric storage: `getWebVitalsMetrics()` returns copy (no mutation leak) - `resetWebVitalsMetrics()` clears all values to null - Graceful degradation when `globalThis.sentry` is undefined
Calls `initWebVitals()` at the top of `startApp()`, before trace setup. This starts INP, LCP, and CLS observers as early as possible so LCP captures the initial render and CLS tracks shifts from first paint. INP only reports meaningful data after user interactions, so early init has no downside — it just ensures the observer is ready.
…peline Core measurement type mirroring `ui/helpers/utils/web-vitals.ts`. All fields are nullable — INP is null before any user interaction, LCP may be null on non-initial loads, CLS may be null if no layout shifts occurred. TODO: consolidate with the duplicate `WebVitalsMetrics` type in `ui/helpers/utils/web-vitals.ts` to a single shared definition.
`RatingDistribution`: categorical count of good/needs-improvement/poor/null across benchmark runs. Enables quality gate checks like "80% of runs rated 'good' for INP". `WebVitalsAggregated`: per-metric `TimerStatistics` (mean, percentiles, outlier counts) plus rating distributions. Reuses the existing `TimerStatistics` shape — CLS values work here because the upstream `calculateWebVitalsStatistics` uses metric-specific bounds. `WebVitalsSummary`: holds both per-run snapshots (for Sentry spans) and aggregated stats (for dashboards). Dual representation preserves granularity while providing quick summary.
Builds ready [80dec5a]
UI Startup Metrics (1376 ± 96 ms)
📊 Page Load Benchmark ResultsCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Builds ready [8be1199]
UI Startup Metrics (1422 ± 107 ms)
📊 Page Load Benchmark ResultsCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Builds ready [bfd0057]
⚡ Performance Benchmarks (1386 ± 118 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
… persona constant PR comment now renders a Core Web Vitals table under Performance Benchmarks (between Interaction Benchmarks and Startup Benchmarks): - INP: p75 + p95 (both meaningful for interaction latency) - LCP: p75 only (p95 is noise with small CI samples), marked CI-only - CLS: p75 only (unitless, rating distribution more informative than percentiles) - Rating distribution column (good/NI/poor) for quick regression signal `fetchAndBuildWebVitalsSection` reads from interaction preset artifacts which are the only benchmarks collecting web vitals via `collectWebVitals`. Also replaces hardcoded `'standard'` with `BENCHMARK_PERSONA.STANDARD` in `send-to-sentry.ts` for consistency with adjacent usage.
Builds ready [6b62a8c]
⚡ Performance Benchmarks (1365 ± 107 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
… persona constant PR comment now renders a Core Web Vitals table under Performance Benchmarks (between Interaction Benchmarks and Startup Benchmarks): - INP: p75 + p95 (both meaningful for interaction latency) - LCP: p75 only (p95 is noise with small CI samples), marked CI-only - CLS: p75 only (unitless, rating distribution more informative than percentiles) - Rating distribution column (good/NI/poor) for quick regression signal `fetchAndBuildWebVitalsSection` reads from interaction preset artifacts which are the only benchmarks collecting web vitals via `collectWebVitals`. Also replaces hardcoded `'standard'` with `BENCHMARK_PERSONA.STANDARD` in `send-to-sentry.ts` for consistency with adjacent usage.
…ting - `getRating` in `web-vitals.ts`: change `<` to `<=` to match web-vitals library behavior — value at threshold gets the better rating (e.g. 200ms INP is "good", not "needs-improvement") - Remove unused `createEmptyWebVitalsMetrics` from `web-vitals-collector.ts` and its barrel export - Remove `Sentry.setMeasurement` from `send-to-sentry.ts` — in Sentry v10 `setMeasurement` attaches to the root transaction, not the child span created by `startSpan`; values already captured via `span.setAttribute` - Simplify `WEB_VITALS_BOUNDS` in `statistics.ts`: replace `allowZero` flag with uniform `value < bounds.min` check — min=1 for INP/LCP now correctly includes 1ms (was excluded by `<=`), min=0 for CLS still allows 0 (perfect stability)
Builds ready [5097afc]
⚡ Performance Benchmarks (1361 ± 99 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
- `collectWebVitals` now calls `resetWebVitalsMetrics()` after reading, preventing stale metrics from carrying over when an observer doesn't fire in a later iteration (e.g. LCP only on initial load) - Replace `Array<T>` with `T[]` in `buildWebVitalsSection` (eslint) - Fix prettier formatting in `buildUiStartupSection` - Remove trailing newline in `web-vitals-collector.ts`
Builds ready [55ba7e2]
⚡ Performance Benchmarks (1382 ± 107 ms)
🌐 Dapp Page Load BenchmarksCurrent Commit: 📄 Localhost MetaMask Test DappSamples: 100 Summary
📈 Detailed Results
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Description
Problem
Sentry collects ~19.5M pageload transactions from the MetaMask extension, but every one has zero performance measurements.
browserTracingIntegrationis active at 0.75% sample rate, yet produces empty transaction shells because Chrome does not emit paint/navigation timing entries onchrome-extension://pages — no LCP, FCP, CLS, or TTFB.Timer-based benchmarks (page load, confirm tx) measure specific operations but can't answer: Is the extension janky? Did this release regress interaction responsiveness? Which element causes the worst layout shift?
Solution
Add Core Web Vitals instrumentation via Google's
web-vitalslibrary, targeting two distinct environments:1. Production (error correlation) — INP observers fire on
chrome-extension://pages (unlike LCP/CLS, INP usesPerformanceEventTimingwhich Chrome does support for extensions). When callbacks fire, they enrich the Sentry scope with rating tags (inp.rating:poor), attribution context (which element/interaction type), and breadcrumbs — attaching to existing transactions at zero additional cost. This lets us filter errors by performance context: "show me errors where the user was experiencing poor responsiveness."2. CI benchmarks (regression detection) — All three metrics (INP, LCP, CLS) fire reliably in E2E benchmark runs. Per-run snapshots are collected via
stateHooks, aggregated with the same IQR + z-score outlier detection used for timer statistics (with metric-specific sanity bounds — CLS is unitless, INP/LCP have different ceilings), and sent to Sentry as spans with measurements + structured logs with aggregated percentiles.Key intervention points
ui/index.js)initWebVitals()web-vitals.ts)setMeasurementbridge,confirm-tx,load-new-account)collectWebVitals(driver)after user actionsstatistics.ts,runner.ts)send-to-sentry.ts)metamaskbot-build-announce)buildWebVitalsSectionrenders CWV table inside Performance Benchmarksscripts.js)web-vitalsadded to Babelifyonlylist"type": "module"— Browserify resolves to the ESM stub and needs Babelify to transpile it to CJS (same pattern aslightweight-charts,firebase, etc.)What this unlocks
inp.rating:pooron existing transactionsinp_attribution.interactionTargetPlatform constraints (documented in source)
Chrome's
chrome-extension://protocol does not emitPerformancePaintTimingorlargest-contentful-paintentries. LCP and CLS observers will not fire in production — they exist for CI benchmark coverage only. INP (PerformanceEventTiming) is the only metric expected to report in production. This is documented in the module-level JSDoc and per-observer notes.Output examples
1. PR comment — Core Web Vitals table
Rendered inside the Performance Benchmarks collapsible, between Interaction Benchmarks and Startup Benchmarks. Uses
buildWebVitalsSectionfromdevelopment/metamaskbot-build-announce/utils.ts.📊 Core Web Vitals
Column semantics:
-) for CLS (unitless, distribution is more informative)Metrics with no data across all flows (all nulls) are omitted entirely — the section won't render if no web vitals were captured.
2. Benchmark JSON artifact
Each interaction benchmark preset produces a JSON artifact (e.g.
benchmark-chrome-browserify-interactionUserActions.json) containingwebVitalsalongside timer statistics. ThewebVitalsfield preserves per-run snapshots for Sentry spans and includes aggregated statistics:{ "confirmTx": { "testTitle": "benchmark-user-actions-confirm-tx", "persona": "standard", "benchmarkType": "userAction", "mean": { "confirm_tx": 6046, "total": 6046 }, "p75": { "confirm_tx": 6055, "total": 6055 }, "p95": { "confirm_tx": 6055, "total": 6055 }, "webVitals": { "runs": [ { "inp": 120, "inpRating": "good", "lcp": 1800, "lcpRating": "good", "cls": 0.01, "clsRating": "good", "iteration": 0 }, { "inp": 135, "inpRating": "good", "lcp": 1950, "lcpRating": "good", "cls": 0.008, "clsRating": "good", "iteration": 1 }, { "inp": 190, "inpRating": "good", "lcp": 2100, "lcpRating": "good", "cls": null, "clsRating": null, "iteration": 2 }, { "inp": 128, "inpRating": "good", "lcp": 1750, "lcpRating": "good", "cls": 0.012, "clsRating": "good", "iteration": 3 }, { "inp": 245, "inpRating": "needs-improvement", "lcp": 1820, "lcpRating": "good", "cls": 0.009, "clsRating": "good", "iteration": 4 } ], "aggregated": { "inp": { "id": "inp", "mean": 163.6, "min": 120, "max": 245, "stdDev": 50.2, "cv": 0.307, "p50": 135, "p75": 192, "p95": 245, "p99": 245, "samples": 5, "outliers": 0, "dataQuality": "good" }, "lcp": { "id": "lcp", "mean": 1884, "min": 1750, "max": 2100, "stdDev": 135.3, "cv": 0.072, "p50": 1820, "p75": 1950, "p95": 2100, "p99": 2100, "samples": 5, "outliers": 0, "dataQuality": "good" }, "cls": { "id": "cls", "mean": 0.0098, "min": 0.008, "max": 0.012, "stdDev": 0.0016, "cv": 0.167, "p50": 0.009, "p75": 0.012, "p95": 0.012, "p99": 0.012, "samples": 4, "outliers": 0, "dataQuality": "good" }, "ratings": { "inp": { "good": 4, "needs-improvement": 1, "poor": 0, "null": 0 }, "lcp": { "good": 5, "needs-improvement": 0, "poor": 0, "null": 0 }, "cls": { "good": 4, "needs-improvement": 0, "poor": 0, "null": 1 } } } } } }Note:
aggregatedstatistics use the sameTimerStatisticstype as timer data (mean, stdDev, p50–p99, outlier count, data quality) but with metric-specific sanity bounds — CLS capped at [0, 10], INP/LCP in milliseconds with higher ceilings.3. Sentry CI reporting (
send-to-sentry.ts)Web vitals are reported to Sentry via two separate paths, distinct from timer data:
Per-run spans — each benchmark iteration becomes a span with measurements:
Aggregated structured log — one per benchmark with summary statistics:
4. Production Sentry enrichment (
web-vitals.ts)In production, only INP fires on
chrome-extension://pages. When it does, the Sentry scope is enriched with no additional transactions or events — data piggybacks on the next error:Tag (filterable in Sentry Discover):
Context (attached to error events):
Breadcrumb (visible in error timeline, only for poor/needs-improvement):
Relation to other work
PerformanceObserver('longtask')for Total Blocking Time. Together, INP (symptom) + TBT (cause) provide a complete picture of main thread responsiveness.The
@ts-expect-error suppress CommonJS vs ECMAScript erroron theweb-vitals/attributionimport follows the codebase-wide convention used forchart.js,react-chartjs-2, andlightweight-charts(TS1479 — the repo is CJS but the package declares"type": "module"; Webpack resolves via theexportsdefaultcondition, Browserify falls back to the ESM stub which Babelify transpiles to CJS).Changelog
CHANGELOG entry: null
Related issues
Fixes: MetaMask/MetaMask-planning#6735
Fixes: MetaMask/MetaMask-planning#6736
Fixes: MetaMask/MetaMask-planning#6739
Related epic: MetaMask/MetaMask-planning#6741
Manual testing steps
yarn test:unit ui/helpers/utils/web-vitals.test.tsyarn startyarn build:testyarn test:e2e:benchmark(or a specific user-action benchmark)webVitalsappears in the JSON benchmark output alongsidetimersScreenshots/Recordings
Before
N/A — no UI changes. This is instrumentation-only.
After
N/A — no UI changes. This is instrumentation-only.
Pre-merge author checklist
Pre-merge reviewer checklist
Note
Medium Risk
Touches UI startup initialization and CI telemetry/reporting paths; issues would mainly affect performance instrumentation/CI outputs rather than core wallet behavior, but Sentry noise/overhead or observer errors could impact startup if misbehaving.
Overview
Adds Core Web Vitals (INP/LCP/CLS) instrumentation via the new
web-vitalsdependency, initializing observers on UI startup and exposing getters/resetters throughstateHooksfor test runs.Extends the E2E benchmark pipeline to collect per-iteration web vitals from interaction flows, aggregate them with metric-specific sanity bounds/outlier handling, and include the summaries in benchmark JSON output.
Updates CI reporting and PR feedback:
send-to-sentry.tsnow emits per-run web vitals as Sentry spans (plus an aggregated structured log), and the build announce bot renders a new “Core Web Vitals” table section; build/LavaMoat policies are adjusted to allow/transpile the ESM-onlyweb-vitalspackage.Written by Cursor Bugbot for commit 55ba7e2. This will update automatically on new commits. Configure here.