Querying the Planet: Leveraging GeoParquet to work with global scale open geospatial data locally and in the cloud
Consisting of open data from OpenStreetMap, Meta, Esri, Microsoft, Google, and more, Overture Maps data is conflated and converted to a consistent schema before being distributed as geoparquet files in the cloud. This workshop will explore the advantages of GeoParquet and cloud-native geospatial technologies for researchers working with the data both locally and in the cloud.
| Name | Description |
|---|---|
| Overture Explore Page | Easiest place to get an overview of Overture data in an X-Ray map view |
| Overture Documentation | Schema definitiona nd examples of how to access and work with Overture data |
| Fused.io | A new cloud-based analytics platform with User Defined Functions |
| DuckDB | An fast in-process database system for analytics and data manipulation |
- Querying the Planet: Leveraging GeoParquet to work with global scale open geospatial data locally and in the cloud
- Workshop Agenda
- 1. What is Overture Maps?
- 2. Fused.io
- 3. DuckDB
- 4. Bring the Analysis to the Data in the cloud with Fused
The Overture Maps Foundation is an open data project within the Linux Foundation that aims to "Power current and next-generation map products by creating reliable, easy-to-use, and interoperable open map data."
Primarily, "Overture is for developers who build map services or use geospatial data." Additionally, Overture is a fantastic resource for researchers looking to work with one of the most complete and computationally efficient open geospatial datasets.
-
Visit explore.overturemaps.org and poke around. This site offers an "x-ray" view of Overture data.
-
Overture has 6 data themes:
- Divisions
- Base
- Transportation
- Buildings
- Places
- Addresses
The explore page lets you inspect the properties of each feature and links out to the overture schema: docs.overturemaps.org/schema where you can learn more about the attributes available for each theme.
The explore page helps us get an overview of what's in Overture by rendering pre-processed PMTiles archives on a web map. Next, we'll look at the different ways we can interact with Overture data in the raw, Geoparquet format.
Fused is a new analytical platform with powerful capabilities to read and visualize geoparquet right in your browser. The Fused workbench allows you to run any number of public User Defined Functions, or UDFs.
1. Getting started with Fused: The Overture Maps Example UDF
-
In a new browser window, navigate to: Overture Maps Example.
-
Click "Add to UDF Builder".
-
On the far left, adjust the parameters to view different types of data from Overture.
-
Hover over features on the map to see the complete, raw, Overture data.
-
Zoom all the way out to see the spatial partitioning:
Now that we've seen what's in Overture data, can we combine (or fuse) our Overture data with another dataset?
- Add the Overture Nsi UDF to your fused workbench.
- Notice the
join with NSIparameter in this UDF. Toggle this parameter and have a look around the map at a few different places. For example, here are buildings in Fargo, North Dakota:
After getting buildings from Overture, this UDF queries the National Structures Inventory for information about the various buildings. The NSI returns point geometries, which are joined to our Overture buildings with a spatial join in GeoPandas:
join = gdf_overture.sjoin(gdf, how='left')Next, if Overture does not have height information for a given building, we calculate a height based on the number of stories from the NSI.
join["metric"] = join.apply(lambda row: row.height if pd.notnull(row.height) else row.num_story*3, axis=1)Next, we'll turn to our local machines and look at ways to interact with Overture data from our local environment.
Since the data is hosted in the cloud as GeoParquet files, we can access it via DuckDB, which can take advantage of this cloud-native format.
First, Install DuckDB version >= 1.1.1
Next, you'll need a A GIS environment of your choice. Both QGIS and Esri ArcMap or similar should work. To load geoparquet directly into QGIS, you will need a version of QGIS with the latest GDAL. Alternatively, most of this workshop can be visualized with kepler.gl
If you do not want to install DuckDB locally, you can sign up for MotherDuck, a cloud-based DuckDB, however, you will not be able to use the COPY TO commands nor the h3 extension.
Tip: When launching DuckDB, specify a persistent DB, such as duckdb my_db.duckdb. This way if you create tables, you can access them later.
-
Obtain a bounding box of interest (https://boundingbox.klokantech.com) is a great tool for creating a bounding box. Specifically, it lets you copy the coordinates in the following format (DublinCore) which is very human-readable. Here is a bounding box for MontrΓ©al:
westlimit=-73.974157 southlimit=45.410076 eastlimit=-73.474295 northlimit=45.70479
(I recommend a smaller bounding box, like just a small city or neighborhood for now so you're not working with a lot of data in the example).
-
A basic places query looks like this:
SELECT id, names.primary as name, confidence, geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1) WHERE bbox.xmin BETWEEN X_WEST AND X_EAST AND bbox.ymin BETWEEN Y_SOUTH AND Y_NORTH LIMIT 10;
-
Update the query with the proper values in the
WHEREclause forXandYfrom your bounding box. Remember, east/west = longitude = X and north/south = latitude = Y. -
Paste your query into DuckDB and run it.
You should see something similar to this:
ββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β id β name β confidence β geometry β β varchar β varchar β double β geometry β ββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ€ β 08f2b81b7170b90803a1b4376438169b β HΓ΄tel de ville de Senneville β 0.9544565521095278 β POINT (-73.9600016 45.4136658) β β 08f2b81b7171c0db037e3818c87b6c02 β charles rivers β 0.30856423173803527 β POINT (-73.96118 45.41467) β β 08f2b81b714f2a2603f85a7ea76652ac β Club De Voile Senneville β 0.5592783505154639 β POINT (-73.9685221 45.4187574) β β 08f2b81b714ad4e5037c634d2494f50e β Vignoble Souffle de Vie β 0.77 β POINT (-73.9676663 45.4201718) β β 08f2b81b71583049039f215b4d3b9a7b β Souffle de Vie Vineyard β 0.9567577686259828 β POINT (-73.9678019 45.4228501) β β 08f2b81b7158c219034200ed79cfbc13 β Tenaquip Limited β 0.9826508620689655 β POINT (-73.9657541 45.4229399) β β 08f2b81b7152e4140398b266e3eab10b β CimetiΓ¨re et Complexe FunΓ©raire BelvΓ©dΓ¨re β 0.9797443181818182 β POINT (-73.9618394 45.4233901) β β 08f2b81b703aa099031e13858e363b78 β Les Γcuries de Senneville | Senneville QC β 0.9537408699085346 β POINT (-73.9693036 45.4291791) β β 08f2b81b700551a2039ae5eda64366c7 β Ferme GUSH Farm β 0.9213349225268176 β POINT (-73.96907 45.43491) β β 08f2b81b70b90148038bf5f719a95574 β Braeside Golf Club β 0.9588815789473685 β POINT (-73.9626141 45.4377414) β ββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ€ β 10 rows 4 columns β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββNotice the type of the geometry column is
geometry. DuckDB recognizes the geo metadata in the source parquet files and automatically converts the column to a geometry type. -
Consult the places schema to learn more about which columns can be accessed and their data types.
-
Ensure the
spatialextension is installed:install spatial;. -
Load the spatial extension with
load spatial;. -
Now we can use our same query again, but this time we add the
COPY TOcommand to write GeoJSON. We can also remove theLIMITargument. The complete query looks like this:INSTALL spatial; LOAD spatial; COPY( SELECT id, names.primary as name, confidence, geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1) WHERE bbox.xmin BETWEEN -73.974157 AND -73.474295 AND bbox.ymin BETWEEN 45.410076 AND 45.70479 ) TO 'montreal.geojson' WITH (FORMAT GDAL, DRIVER GeoJSON);
-
Now open that GeoJSON file in your preferred GIS environment to inspect the attributes (I recommend dragging-and-dropping the result directly into kepler.gl for fast visualization.)
-
Are there other columns that would be useful? Try adding
categories.primary as category,to the query to get the category for each place.
-
Overture contains more than 2B building footprints. Attempting to download them all to our local machine will be difficult. However, we can extract only a small subset of the buildings with a query:
Overutre data is available both on Amazon S3 and Microsoft Azure Blob Storage. In this example, we'll use the data from Azure:
INSTALL azure; LOAD azure; SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;AccountName=overturemapswestus2;AccountKey=;EndpointSuffix=core.windows.net'; LOAD spatial; COPY( SELECT id, names.primary as primary_name, height, sources[1].dataset AS primary_source, sources[1].record_id AS source_id, geometry FROM read_parquet('azure://release/2024-10-23.0/theme=buildings/type=building/*', filename=true, hive_partitioning=1) WHERE bbox.xmin BETWEEN -122.352055 AND -122.316697 AND bbox.ymin BETWEEN 47.593064 AND 47.619655 ) TO 'seattle_buildings.geojson' WITH (FORMAT GDAL, DRIVER GeoJSON);
Update that bounding box for anywhere else in the world, and you instantly have a global building database at your finger tips.
-
How about some spatial statistics? If we don't want to first download all 566,806 buildings in our MontrΓ©al bounding box, we can bring our statistics directly into our query. First, we'll install
h3extension for spatial aggregation by h3 hexagon.INSTALL h3 FROM community; LOAD h3; COPY( SELECT h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 10) as h3, count(1) as _count FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=buildings/type=building/*', filename=true, hive_partitioning=1) WHERE bbox.xmin BETWEEN -73.974157 AND -73.474295 AND bbox.ymin BETWEEN 45.410076 AND 45.70479 GROUP BY h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 10) ) TO 'montreal_buildings_h3.csv';
Load that image into Kepler.gl, and we have building densities by h3 resolution 10 cell across Montreal.

The transportation theme has 2 types of data, connectors and segments.
-
Let's start by looking at the segments in Paris:
COPY( SELECT id, names.primary as name, class, geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=transportation/type=segment/*', filename=true, hive_partitioning=1) WHERE bbox.xmin > 2.276 AND bbox.ymin > 48.865 AND bbox.xmax < 2.314 AND bbox.ymax < 48.882 ) TO 'paris_roads.geojson' WITH (FORMAT GDAL, DRIVER 'GeoJSON');Looks very similar to what we were seeing on the Explore map, but this time we're working with the raw data:
-
Connectors are a decent proxy of road network complexity and density. First we'll download a bunch of connectors to a local parquet file:
LOAD h3; COPY( SELECT h3_latlng_to_cell_string(ST_Y(geometry), ST_X(geometry), 8) as h3, id, geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=transportation/type=connector/*', filename=true, hive_partitioning=1) WHERE bbox.xmin > 8.82 AND bbox.ymin > 48.5 AND bbox.xmax < 13.36 AND bbox.ymax < 50.39 ) TO connectors.parquet; -
Now aggregate by h3 cell:
COPY( SELECT h3, count(id) FROM connectors.parquet GROUP BY h3 ) TO 'connectors_h3.csv';Bring that CSV into Kepler.gl again, and we have a nice visualization of road network density in Germany:

What is the base theme? Mostly landuse, infrastructure, and water features from OpenStreetMap that have been converted into a rigic schema depending on their initial tag values. For example:
-
Query Overture for all of the mountain peaks in North America:
SET s3_region='us-west-2'; COPY( SELECT id, names.primary as name, elevation, geometry FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=base/type=land/*', filename=true, hive_partitioning=1) WHERE subtype = 'physical' AND class IN ('peak','volcano') AND elevation IS NOT NULL AND bbox.xmin BETWEEN -175 AND -48 AND bbox.ymin BETWEEN 10 AND 85 ) TO 'north_american_peaks.parquet';
This query downloads about 1.1M peaks and the distribution looks like this:

-
We can build an h3-gridded DEM for North American high points from this file:
COPY( SELECT h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 6) as h3, max(elevation) as _max, min(elevation) as _min, avg(elevation) as _avg FROM north_american_peaks.parquet GROUP BY h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 6) ) TO 'na_peaks_h3.csv';
What other types of features from OSM are you interested in exploring? The logic for how features convert from OSM to Overture is here: https://docs.overturemaps.org/schema/concepts/by-theme/base/
Now that we've worked with the data locally, let's go back to the cloud. Since our data lives there, let's bring our analysis to the data, not the other way around.
- Load the Overture OakRidge Comparison into your Fused workbench.
- Compare the building footprints between (Oak Ridge National Lab) ORNL and Overture. Which one has more accurate building footprints?
- Now which dataset has more accurate class information?
- Fused lets us combine these datasets, taking the best footprints from Overture and rich class information from ORNL.
- We can also perform the same H3 aggregations in Fused: The Overture H3 Skyline UDF aggregates Overture buildings at any h3 resolution β allowing you to "approximate" a city skyline without having to render all of the buildings.











