Skip to content

Commit 6c56470

Browse files
authored
Merge pull request #23 from rdhyee/issue-13-parquet-duckdb
Enhanced parquet analysis with object types and property distribution
2 parents 379af20 + 5475b5b commit 6c56470

File tree

4 files changed

+717
-80
lines changed

4 files changed

+717
-80
lines changed

.claude/settings.local.json

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
{
22
"permissions": {
33
"allow": [
4-
"Bash(git branch:*)"
4+
"Bash(git branch:*)",
5+
"WebFetch(domain:localhost)",
6+
"Bash(git add:*)",
7+
"Read(//Users/raymondyee/dev-journal/daily/**)",
8+
"Bash(git commit:*)",
9+
"Bash(git push:*)",
10+
"Bash(git pull:*)",
11+
"Bash(git fetch:*)"
512
],
613
"deny": [],
714
"ask": []

tutorials/index.qmd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ With iSamples Central currently unavailable, all tutorials now use **geoparquet
3333
## Why Geoparquet?
3434

3535
Our tutorials showcase how **geoparquet + DuckDB-WASM** enables:
36+
3637
-**Universal access**: No software installation required
3738
-**Fast analysis**: 5-10x faster than traditional approaches (e.g., downloading full CSV datasets and analyzing them locally). [See benchmark](https://duckdb.org/2023/05/10/duckdb-wasm.html)
3839
-**Memory efficient**: Analyze 300MB datasets using <100MB browser memory

tutorials/parquet_cesium.qmd

Lines changed: 196 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ One key development of the iSamples project centers on the demonstration of low-
88
This page demonstrates how geospatial data can be dynamically accessed from a remote parquet file in cloud storage. The page uses Cesium for browser visualization of these spatial data on a 3D global map. The data in this demonstration comes from [Open Context's](https://opencontext.org/) export of specimen (archaeological artifact and ecofact) records for iSamples. However, this demonstration can also work with any other iSamples compliant parquet data source made publicly accessible on the Web.
99

1010

11-
<script src="https://cesium.com/downloads/cesiumjs/releases/1.127/Build/Cesium/Cesium.js"></script>
12-
<link href="https://cesium.com/downloads/cesiumjs/releases/1.127/Build/Cesium/Widgets/widgets.css" rel="stylesheet"></link>
11+
<script src="https://cesium.com/downloads/cesiumjs/releases/1.133/Build/Cesium/Cesium.js"></script>
12+
<link href="https://cesium.com/downloads/cesiumjs/releases/1.133/Build/Cesium/Widgets/widgets.css" rel="stylesheet"></link>
1313
<style>
1414
div.cesium-topleft {
1515
display: block;
@@ -238,6 +238,8 @@ viewof pointdata = {
238238

239239
:::
240240

241+
The number of locations in the file is: ${pointdata.length}.
242+
241243
The click point ID is "${clickedPointId}".
242244

243245
```{ojs}
@@ -248,4 +250,196 @@ ${JSON.stringify(selectedGeoRecord, null, 2)}
248250
`
249251
```
250252

253+
## Table Structure Analysis
254+
255+
Understanding the structure and schema of the parquet file:
256+
257+
### Column Schema
258+
259+
```{ojs}
260+
//| code-fold: true
261+
tableSchema = {
262+
const query = `DESCRIBE nodes`;
263+
const data = await loadData(query, [], "loading_schema");
264+
return data;
265+
}
266+
```
267+
268+
<div id="loading_schema">Loading table schema...</div>
269+
270+
```{ojs}
271+
//| code-fold: true
272+
viewof schemaTable = {
273+
const data_table = Inputs.table(tableSchema, {
274+
header: {
275+
column_name: "Column Name",
276+
column_type: "Data Type",
277+
null: "Nullable",
278+
key: "Key",
279+
default: "Default",
280+
extra: "Extra"
281+
}
282+
});
283+
return data_table;
284+
}
285+
```
286+
287+
### Sample Data
288+
289+
First 10 rows of the dataset to understand the data structure:
290+
291+
```{ojs}
292+
//| code-fold: true
293+
sampleData = {
294+
const query = `SELECT * FROM nodes LIMIT 10`;
295+
const data = await loadData(query, [], "loading_sample");
296+
return data;
297+
}
298+
```
299+
300+
<div id="loading_sample">Loading sample data...</div>
301+
302+
```{ojs}
303+
//| code-fold: true
304+
viewof sampleTable = {
305+
const data_table = Inputs.table(sampleData, {
306+
layout: "auto",
307+
width: {
308+
pid: 200,
309+
otype: 150
310+
}
311+
});
312+
return data_table;
313+
}
314+
```
315+
316+
### Sample Data by Object Type
317+
318+
Examples of records for each object type to understand the data semantics:
319+
320+
```{ojs}
321+
//| code-fold: true
322+
sampleDataByOtype = {
323+
// First get the list of unique object types
324+
const otypeQuery = `SELECT DISTINCT otype FROM nodes ORDER BY otype`;
325+
const otypes = await loadData(otypeQuery, [], "loading_otype_samples");
326+
327+
const results = [];
328+
for (const otypeRow of otypes) {
329+
const otype = otypeRow.otype;
330+
// Get 3 sample records for each otype
331+
const sampleQuery = `SELECT * FROM nodes WHERE otype = ? LIMIT 3`;
332+
const samples = await db.query(sampleQuery, [otype]);
333+
334+
results.push({
335+
otype: otype,
336+
count: samples.length,
337+
samples: samples
338+
});
339+
}
340+
return results;
341+
}
342+
```
343+
344+
<div id="loading_otype_samples">Loading sample data by object type...</div>
345+
346+
```{ojs}
347+
//| code-fold: true
348+
viewof otypeSamplesDisplay = {
349+
const container = html`<div></div>`;
350+
351+
for (const otypeData of sampleDataByOtype) {
352+
const section = html`<div style="margin-bottom: 2rem;">
353+
<h4 style="color: #2563eb; margin-bottom: 0.5rem;">Object Type: ${otypeData.otype}</h4>
354+
<p style="margin: 0.5rem 0; font-style: italic;">Sample records (showing up to 3):</p>
355+
</div>`;
356+
357+
// Create a table for this otype's samples
358+
const table = Inputs.table(otypeData.samples, {
359+
layout: "auto",
360+
width: {
361+
pid: 150,
362+
otype: 120,
363+
latitude: 100,
364+
longitude: 100
365+
}
366+
});
367+
368+
section.appendChild(table);
369+
container.appendChild(section);
370+
}
371+
372+
return container;
373+
}
374+
```
375+
376+
## Object Type Counts
377+
378+
The distribution of object types (`otype`) in the dataset:
379+
380+
```{ojs}
381+
//| code-fold: true
382+
otypeCounts = {
383+
const query = `SELECT otype, COUNT(*) as count FROM nodes GROUP BY otype ORDER BY count DESC`;
384+
const data = await loadData(query, [], "loading_otype");
385+
return data;
386+
}
387+
```
388+
389+
<div id="loading_otype">Loading object type counts...</div>
390+
391+
```{ojs}
392+
//| code-fold: true
393+
viewof otypeTable = {
394+
const data_table = Inputs.table(otypeCounts, {
395+
header: {
396+
otype: "Object Type",
397+
count: "Count"
398+
},
399+
format: {
400+
count: d => d.toLocaleString()
401+
}
402+
});
403+
return data_table;
404+
}
405+
```
406+
407+
Total records by object type: ${otypeCounts.reduce((sum, row) => sum + row.count, 0).toLocaleString()}
408+
409+
## Property Distribution Analysis
410+
411+
Understanding the range of properties (predicates) in this graph database structure:
412+
413+
```{ojs}
414+
//| code-fold: true
415+
propertyDistribution = {
416+
const query = `SELECT p as property, COUNT(*) as count FROM nodes WHERE p IS NOT NULL GROUP BY p ORDER BY count DESC`;
417+
const data = await loadData(query, [], "loading_properties");
418+
return data;
419+
}
420+
```
421+
422+
<div id="loading_properties">Loading property distribution...</div>
423+
424+
```{ojs}
425+
//| code-fold: true
426+
viewof propertyTable = {
427+
const data_table = Inputs.table(propertyDistribution, {
428+
header: {
429+
property: "Property (Predicate)",
430+
count: "Count"
431+
},
432+
format: {
433+
count: d => d.toLocaleString()
434+
},
435+
layout: "auto"
436+
});
437+
return data_table;
438+
}
439+
```
440+
441+
Total records with properties: ${propertyDistribution.reduce((sum, row) => sum + row.count, 0).toLocaleString()}
442+
443+
Unique properties in the dataset: ${propertyDistribution.length.toLocaleString()}
444+
251445

0 commit comments

Comments
 (0)