Date: 2025-10-31 Issue: Page loading is VERY SLOW Root Cause Analysis: Multiple compounding factors
Location: parquet_cesium.qmd lines 131-157
Current Behavior:
WITH geo_classification AS (
SELECT
geo.pid, geo.latitude, geo.longitude,
MAX(CASE WHEN e.p = 'sample_location' THEN 1 ELSE 0 END) as is_sample_location,
MAX(CASE WHEN e.p = 'site_location' THEN 1 ELSE 0 END) as is_site_location
FROM nodes geo
JOIN nodes e ON (geo.row_id = e.o[1])
WHERE geo.otype = 'GeospatialCoordLocation'
GROUP BY geo.pid, geo.latitude, geo.longitude
)
SELECT * FROM geo_classificationWhy It's Slow:
- Self-join of
nodestable with itself on array element match (e.o[1]) - Scans ALL GeospatialCoordLocation nodes (likely thousands)
- GROUP BY with aggregation (MAX + CASE) for classification
- Runs BEFORE user can interact with page
- DuckDB-WASM must load relevant parquet chunks via HTTP
Estimated Impact: π΄ 80% of perceived slowness
Three Queries (Eric's, Path 1, Path 2) all follow this pattern:
FROM nodes AS geo
JOIN nodes AS rel_se ON (rel_se.p = 'sample_location' AND list_contains(rel_se.o, geo.row_id))
JOIN nodes AS se ON (rel_se.s = se.row_id AND se.otype = 'SamplingEvent')
JOIN nodes AS rel_site ON (se.row_id = rel_site.s AND rel_site.p = 'sampling_site')
JOIN nodes AS site ON (rel_site.o[1] = site.row_id AND site.otype = 'SamplingSite')
JOIN nodes AS rel_samp ON (rel_samp.p = 'produced_by' AND list_contains(rel_samp.o, se.row_id))
JOIN nodes AS samp ON (rel_samp.s = samp.row_id AND samp.otype = 'MaterialSampleRecord')
WHERE geo.pid = ?Why It's Slow:
- 6 self-joins on the same
nodestable - 2 uses of
list_contains()for backward edge traversal (array scans) - Multi-hop graph traversal (5 hops: geo β event β site β event β sample)
- Repeated for EACH clicked point
Estimated Impact: π‘ 15% of perceived slowness (only after click)
Data Source: https://storage.googleapis.com/isamplesorg/data/oc_isamples_pqg.parquet
Why It's Inherently Slower:
- HTTP range requests for parquet chunks
- Network latency (Google Cloud Storage β browser)
- DuckDB-WASM must parse and cache chunks
- No local indexes or materialized views
Estimated Impact: π 5% of perceived slowness (well-optimized by DuckDB already)
Approach: Pre-compute the locations query result into a separate lightweight parquet file
Implementation:
-
Server-side preprocessing:
# Run ONCE when oc_isamples_pqg.parquet updates import duckdb con = duckdb.connect() con.execute(""" COPY ( WITH geo_classification AS ( SELECT geo.pid, geo.latitude, geo.longitude, MAX(CASE WHEN e.p = 'sample_location' THEN 1 ELSE 0 END) as is_sample_location, MAX(CASE WHEN e.p = 'site_location' THEN 1 ELSE 0 END) as is_site_location FROM read_parquet('oc_isamples_pqg.parquet') geo JOIN read_parquet('oc_isamples_pqg.parquet') e ON (geo.row_id = e.o[1]) WHERE geo.otype = 'GeospatialCoordLocation' GROUP BY geo.pid, geo.latitude, geo.longitude ) SELECT * FROM geo_classification ) TO 'oc_geocodes_classified.parquet' (FORMAT PARQUET, COMPRESSION ZSTD) """)
-
Client-side usage:
locations = { const query = `SELECT * FROM read_parquet('${geocodes_parquet_path}')`; const data = await loadData(query, [], "loading_1", "locations"); // ... render points }
Expected Speedup: β‘ 10-50x faster initial load (from 5-10 seconds β <1 second)
Tradeoffs:
- β Massive performance win
- β Simple to implement
- β No query rewrite needed
β οΈ Adds one more file to maintain (~50KB vs 700MB main file)β οΈ Must regenerate when main parquet updates
Approach: Let user interact with page BEFORE geocodes finish loading
Implementation:
- Show Cesium globe immediately (already works)
- Display loading indicator: "Loading 1,234 geocodes..."
- Render points in batches as they arrive (chunked processing)
- Enable search box immediately (independent of point rendering)
Code Pattern:
locations = {
const query = `...`; // existing query
const data = await loadData(query, [], "loading_1", "locations");
// Render in chunks of 500 to keep UI responsive
const CHUNK_SIZE = 500;
for (let i = 0; i < data.length; i += CHUNK_SIZE) {
const chunk = data.slice(i, i + CHUNK_SIZE);
for (const row of chunk) {
// ... add points
}
// Yield to browser between chunks
await new Promise(resolve => setTimeout(resolve, 0));
}
return data;
}Expected Improvement: β‘ Perceived performance 3-5x better (page feels interactive sooner)
Tradeoffs:
- β Better UX without query changes
- β Works with existing slow query
β οΈ More complex rendering logicβ οΈ Doesn't solve fundamental slowness
Approach: Pre-build reverse lookup tables for common traversals
Implementation:
-
Create separate index tables:
-- geo_to_events.parquet SELECT e.o[1] as geo_row_id, e.s as event_row_id, e.p as edge_type FROM nodes e WHERE e.p IN ('sample_location', 'site_location') -- event_to_samples.parquet SELECT rel.o[1] as event_row_id, rel.s as sample_row_id FROM nodes rel WHERE rel.p = 'produced_by'
-
Rewrite queries to use indexes:
SELECT samp.*, geo.latitude, geo.longitude FROM read_parquet('samples.parquet') samp JOIN read_parquet('event_to_samples.parquet') idx1 ON (samp.row_id = idx1.sample_row_id) JOIN read_parquet('geo_to_events.parquet') idx2 ON (idx1.event_row_id = idx2.event_row_id) JOIN read_parquet('geocodes.parquet') geo ON (idx2.geo_row_id = geo.row_id) WHERE geo.pid = ?
Expected Speedup: β‘ 5-10x faster queries (from 1-2 seconds β 200-400ms)
Tradeoffs:
- β
Eliminates
list_contains()array scans - β Reduces self-joins (separate tables = better indexes)
β οΈ Major refactor: Changes data modelβ οΈ Breaks compatibility with existing notebooksβ οΈ More complex build pipeline
Approach: Rewrite queries to help DuckDB optimizer
Techniques:
-
Push down filters earlier:
-- BEFORE: Filter at end FROM nodes AS geo JOIN nodes AS rel_se ON (...) WHERE geo.pid = ? -- AFTER: Filter geo first FROM (SELECT * FROM nodes WHERE otype = 'GeospatialCoordLocation' AND pid = ?) AS geo JOIN nodes AS rel_se ON (...)
-
Replace
list_contains()with EXISTS subqueries (if DuckDB optimizes better):-- BEFORE JOIN nodes AS rel_se ON (list_contains(rel_se.o, geo.row_id)) -- AFTER (test if faster) JOIN nodes AS rel_se ON (geo.row_id = ANY(rel_se.o))
-
Eliminate redundant JOINs:
- All 3 queries join to
sitejust forsite.labelandsite.pid - If not needed for filtering, could be a separate follow-up query
- All 3 queries join to
Expected Speedup: β‘ 1.2-2x faster (marginal gains)
Tradeoffs:
- β No data model changes
- β Easy to A/B test
β οΈ May not work due to DuckDB-WASM query planner limitations
Goal: Make page feel 3-5x faster without major refactoring
-
β Implement Strategy B (Lazy Loading)
- Show "Loading X geocodes..." progress indicator
- Render points in batches (500 at a time)
- Enable search box before points finish loading
-
β Add telemetry to understand actual timings
console.time('locations_query'); const data = await loadData(query, ...); console.timeEnd('locations_query');
Expected User Experience:
- Page interactive in 1-2 seconds (vs 5-10 seconds)
- Visual feedback (progress bar)
- Can search for specific geocode immediately
Goal: Achieve 10-50x speedup on initial load
-
β Implement Strategy A (Materialized Geocode Index)
- Create
oc_geocodes_classified.parquet(~50KB) - Update GitHub Actions workflow to regenerate on data updates
- Test with DuckDB-WASM in browser
- Create
-
β A/B Test Strategy D (SQL Micro-Optimizations)
- Try filter push-down
- Measure actual impact (may be negligible)
Expected User Experience:
- Initial load: <1 second for geocode points
- First click query: Still 1-2 seconds (acceptable)
Goal: Achieve 5-10x speedup on click-triggered queries
-
β οΈ Evaluate Strategy C (Denormalized Indexes)- Prototype with subset of data
- Measure actual gains in DuckDB-WASM
- Assess maintenance burden
-
β οΈ Consider alternative architectures:- Pre-compute ALL common queries β static JSON files
- Client-side caching (IndexedDB for query results)
- WebAssembly-based custom graph traversal (if DuckDB still too slow)
Only pursue if: Phase 2 gains aren't sufficient for user needs
Before optimization:
// Add to parquet_cesium.qmd
performance.mark('page-start');
locations = {
performance.mark('locations-start');
const data = await loadData(query, [], "loading_1", "locations");
performance.mark('locations-end');
performance.measure('locations-query', 'locations-start', 'locations-end');
console.log(performance.getEntriesByName('locations-query')[0].duration + 'ms');
return data;
}
// After first click
async function get_samples_1(pid) {
performance.mark('samples1-start');
const result = await loadData(q, [pid], "loading_s1", "samples_1");
performance.mark('samples1-end');
performance.measure('samples1-query', 'samples1-start', 'samples1-end');
console.log(performance.getEntriesByName('samples1-query')[0].duration + 'ms');
return result ?? [];
}Metrics to Track:
- Initial page load time (to interactive)
locationsquery execution time- First click response time (each of 3 queries)
- Data transfer size (Network tab)
- Memory usage (Performance tab)
A: ~15% of the problem for initial load, ~80% for click queries
- Initial load: The
locationsquery is inherently expensive (self-join + GROUP BY on all geocodes), BUT could be 10-50x faster with pre-aggregation (Strategy A) - Click queries: The 6 self-joins are unavoidable given the property graph model, BUT could be 5-10x faster with denormalized indexes (Strategy C)
A: We're only stuck if we insist on querying the raw property graph
The property graph model REQUIRES self-joins because:
- All nodes and edges in ONE table (
nodes) - Graph traversal = multiple joins to the same table
- No escape without changing data model
However, we have options:
- β Pre-aggregate common queries (Strategy A) - avoids re-computing on every page load
- β Denormalize hot paths (Strategy C) - trades storage for query speed
- β Cache results client-side - only run expensive queries once per browser session
β οΈ Abandon property graph for query layer - keep it for data ingestion, but publish separate optimized query tables
The self-joins are NOT the problem. The problem is:
- Running expensive aggregations on EVERY page load (fixable with Strategy A)
- No indexes on array-valued columns (
list_containsscans) (fixable with Strategy C) - No query result caching (fixable with client-side storage)
Before proceeding, clarify:
-
What's the user's pain threshold?
- Is 2 seconds initial load acceptable? (Then do Phase 1 only)
- Need <1 second? (Then do Phase 2)
- Need instant? (Then need Phase 3 or architectural rethink)
-
What's the maintenance budget?
- Phase 1: Zero maintenance (just code changes)
- Phase 2: Low maintenance (regenerate one small parquet file)
- Phase 3: High maintenance (multiple derived tables, complex build pipeline)
-
How often does source data update?
- Daily: Phase 2 is fine (automated regeneration)
- Hourly: Phase 3 may be problematic (cache invalidation complexity)
- Weekly: Even manual regeneration works
-
What's the priority: initial load or click response?
- If initial load is the main complaint: Focus on Strategy A
- If click queries are the main complaint: Focus on Strategy C
Immediate Action (recommend): Start with Phase 1 to get quick wins
- Add performance telemetry to quantify actual bottlenecks
- Implement lazy loading + progress indicators
- Measure improvement
- Re-assess if Phase 2 is needed
Optional: Prototype Strategy A (materialized geocode index) in parallel to see if it's worth pursuing
Let me know which direction you want to explore!