Skip to content

Querying the Planet: Leveraging GeoParquet to work with global scale open geospatial data locally and in the cloud.

Notifications You must be signed in to change notification settings

pgzmnk/overturemaps-workshop

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Querying the Planet: Leveraging GeoParquet to work with global scale open geospatial data locally and in the cloud

Consisting of open data from OpenStreetMap, Meta, Esri, Microsoft, Google, and more, Overture Maps data is conflated and converted to a consistent schema before being distributed as geoparquet files in the cloud. This workshop will explore the advantages of GeoParquet and cloud-native geospatial technologies for researchers working with the data both locally and in the cloud.

Resources

Name Description
Overture Explore Page Easiest place to get an overview of Overture data in an X-Ray map view
Overture Documentation Schema definitiona nd examples of how to access and work with Overture data
Fused.io A new cloud-based analytics platform with User Defined Functions
DuckDB An fast in-process database system for analytics and data manipulation

Workshop Agenda







1. What is Overture Maps?

Back to Agenda

image

The Overture Maps Foundation is an open data project within the Linux Foundation that aims to "Power current and next-generation map products by creating reliable, easy-to-use, and interoperable open map data."

Primarily, "Overture is for developers who build map services or use geospatial data." Additionally, Overture is a fantastic resource for researchers looking to work with one of the most complete and computationally efficient open geospatial datasets.

image

Explore Overture Data

  1. Visit explore.overturemaps.org and poke around. This site offers an "x-ray" view of Overture data.

  2. Overture has 6 data themes:

    • Divisions
    • Base
    • Transportation
    • Buildings
    • Places
    • Addresses

    The explore page lets you inspect the properties of each feature and links out to the overture schema: docs.overturemaps.org/schema where you can learn more about the attributes available for each theme.

The explore page helps us get an overview of what's in Overture by rendering pre-processed PMTiles archives on a web map. Next, we'll look at the different ways we can interact with Overture data in the raw, Geoparquet format.







2. Fused.io

Back to Agenda

image

Fused is a new analytical platform with powerful capabilities to read and visualize geoparquet right in your browser. The Fused workbench allows you to run any number of public User Defined Functions, or UDFs.

1. Getting started with Fused: The Overture Maps Example UDF

image

  1. In a new browser window, navigate to: Overture Maps Example.

  2. Click "Add to UDF Builder".

  3. On the far left, adjust the parameters to view different types of data from Overture.

  4. Hover over features on the map to see the complete, raw, Overture data.

  5. Zoom all the way out to see the spatial partitioning:

    image

2. Fusing Datasets with Overture in the browser

Now that we've seen what's in Overture data, can we combine (or fuse) our Overture data with another dataset?

  1. Add the Overture Nsi UDF to your fused workbench.
  2. Notice the join with NSI parameter in this UDF. Toggle this parameter and have a look around the map at a few different places. For example, here are buildings in Fargo, North Dakota:

After getting buildings from Overture, this UDF queries the National Structures Inventory for information about the various buildings. The NSI returns point geometries, which are joined to our Overture buildings with a spatial join in GeoPandas:

join = gdf_overture.sjoin(gdf, how='left')

Next, if Overture does not have height information for a given building, we calculate a height based on the number of stories from the NSI.

join["metric"] = join.apply(lambda row: row.height if pd.notnull(row.height) else row.num_story*3, axis=1)

Next, we'll turn to our local machines and look at ways to interact with Overture data from our local environment.







3. DuckDB

Back to Agenda

Since the data is hosted in the cloud as GeoParquet files, we can access it via DuckDB, which can take advantage of this cloud-native format.

First, Install DuckDB version >= 1.1.1

Next, you'll need a A GIS environment of your choice. Both QGIS and Esri ArcMap or similar should work. To load geoparquet directly into QGIS, you will need a version of QGIS with the latest GDAL. Alternatively, most of this workshop can be visualized with kepler.gl

If you do not want to install DuckDB locally, you can sign up for MotherDuck, a cloud-based DuckDB, however, you will not be able to use the COPY TO commands nor the h3 extension.

Part I. Places Theme

Tip: When launching DuckDB, specify a persistent DB, such as duckdb my_db.duckdb. This way if you create tables, you can access them later.

Step 1: Query for places in a particular location

  1. Obtain a bounding box of interest (https://boundingbox.klokantech.com) is a great tool for creating a bounding box. Specifically, it lets you copy the coordinates in the following format (DublinCore) which is very human-readable. Here is a bounding box for MontrΓ©al:

    westlimit=-73.974157
    southlimit=45.410076
    eastlimit=-73.474295
    northlimit=45.70479

    (I recommend a smaller bounding box, like just a small city or neighborhood for now so you're not working with a lot of data in the example).

  2. A basic places query looks like this:

    SELECT
        id,
        names.primary as name,
        confidence,
        geometry
    FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1)
    WHERE
        bbox.xmin BETWEEN X_WEST AND X_EAST
        AND bbox.ymin BETWEEN Y_SOUTH AND Y_NORTH
    LIMIT 10;
  3. Update the query with the proper values in the WHERE clause for X and Y from your bounding box. Remember, east/west = longitude = X and north/south = latitude = Y.

  4. Paste your query into DuckDB and run it.

    You should see something similar to this:

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚                id                β”‚                   name                    β”‚     confidence      β”‚            geometry            β”‚
     β”‚             varchar              β”‚                  varchar                  β”‚       double        β”‚            geometry            β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
     β”‚ 08f2b81b7170b90803a1b4376438169b β”‚ HΓ΄tel de ville de Senneville              β”‚  0.9544565521095278 β”‚ POINT (-73.9600016 45.4136658) β”‚
     β”‚ 08f2b81b7171c0db037e3818c87b6c02 β”‚ charles rivers                            β”‚ 0.30856423173803527 β”‚ POINT (-73.96118 45.41467)     β”‚
     β”‚ 08f2b81b714f2a2603f85a7ea76652ac β”‚ Club De Voile Senneville                  β”‚  0.5592783505154639 β”‚ POINT (-73.9685221 45.4187574) β”‚
     β”‚ 08f2b81b714ad4e5037c634d2494f50e β”‚ Vignoble Souffle de Vie                   β”‚                0.77 β”‚ POINT (-73.9676663 45.4201718) β”‚
     β”‚ 08f2b81b71583049039f215b4d3b9a7b β”‚ Souffle de Vie Vineyard                   β”‚  0.9567577686259828 β”‚ POINT (-73.9678019 45.4228501) β”‚
     β”‚ 08f2b81b7158c219034200ed79cfbc13 β”‚ Tenaquip Limited                          β”‚  0.9826508620689655 β”‚ POINT (-73.9657541 45.4229399) β”‚
     │ 08f2b81b7152e4140398b266e3eab10b │ Cimetière et Complexe Funéraire Belvédère │  0.9797443181818182 │ POINT (-73.9618394 45.4233901) │
     β”‚ 08f2b81b703aa099031e13858e363b78 β”‚ Les Γ‰curies de Senneville | Senneville QC β”‚  0.9537408699085346 β”‚ POINT (-73.9693036 45.4291791) β”‚
     β”‚ 08f2b81b700551a2039ae5eda64366c7 β”‚ Ferme GUSH Farm                           β”‚  0.9213349225268176 β”‚ POINT (-73.96907 45.43491)     β”‚
     β”‚ 08f2b81b70b90148038bf5f719a95574 β”‚ Braeside Golf Club                        β”‚  0.9588815789473685 β”‚ POINT (-73.9626141 45.4377414) β”‚
     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
     β”‚ 10 rows                                                                                                                   4 columns β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

    Notice the type of the geometry column is geometry. DuckDB recognizes the geo metadata in the source parquet files and automatically converts the column to a geometry type.

  5. Consult the places schema to learn more about which columns can be accessed and their data types.

Step 2: Use DuckDB spatial extension to convert to common spatial data formats

  1. Ensure the spatial extension is installed: install spatial;.

  2. Load the spatial extension with load spatial;.

  3. Now we can use our same query again, but this time we add the COPY TO command to write GeoJSON. We can also remove the LIMIT argument. The complete query looks like this:

    INSTALL spatial;
    LOAD spatial;
    
    COPY(
    SELECT
        id,
        names.primary as name,
        confidence,
        geometry
    FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=places/type=place/*', filename=true, hive_partitioning=1)
    WHERE
        bbox.xmin BETWEEN -73.974157 AND -73.474295
        AND bbox.ymin BETWEEN 45.410076 AND 45.70479
    ) TO 'montreal.geojson' WITH (FORMAT GDAL, DRIVER GeoJSON);
  4. Now open that GeoJSON file in your preferred GIS environment to inspect the attributes (I recommend dragging-and-dropping the result directly into kepler.gl for fast visualization.)

    image

  5. Are there other columns that would be useful? Try adding categories.primary as category, to the query to get the category for each place.

Part II: Buildings Theme

  1. Overture contains more than 2B building footprints. Attempting to download them all to our local machine will be difficult. However, we can extract only a small subset of the buildings with a query:

    Overutre data is available both on Amazon S3 and Microsoft Azure Blob Storage. In this example, we'll use the data from Azure:

    INSTALL azure;
    LOAD azure;
    SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;AccountName=overturemapswestus2;AccountKey=;EndpointSuffix=core.windows.net';
    
    LOAD spatial;
    
    COPY(
        SELECT
            id,
            names.primary as primary_name,
            height,
            sources[1].dataset AS primary_source,
            sources[1].record_id AS source_id,
            geometry
        FROM read_parquet('azure://release/2024-10-23.0/theme=buildings/type=building/*', filename=true, hive_partitioning=1)
        WHERE bbox.xmin BETWEEN -122.352055 AND -122.316697
        AND bbox.ymin BETWEEN 47.593064 AND 47.619655
    ) TO 'seattle_buildings.geojson' WITH (FORMAT GDAL, DRIVER GeoJSON);

    Update that bounding box for anywhere else in the world, and you instantly have a global building database at your finger tips.

  2. How about some spatial statistics? If we don't want to first download all 566,806 buildings in our MontrΓ©al bounding box, we can bring our statistics directly into our query. First, we'll install h3 extension for spatial aggregation by h3 hexagon.

    INSTALL h3 FROM community;
    LOAD h3;
    COPY(
        SELECT
            h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 10) as h3,
            count(1) as _count
        FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=buildings/type=building/*', filename=true, hive_partitioning=1)
            WHERE bbox.xmin BETWEEN -73.974157 AND -73.474295
            AND bbox.ymin BETWEEN 45.410076 AND 45.70479
    GROUP BY h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 10)
    ) TO 'montreal_buildings_h3.csv';

    Load that image into Kepler.gl, and we have building densities by h3 resolution 10 cell across Montreal. image

Part III: Transportation Theme

The transportation theme has 2 types of data, connectors and segments.

  1. Let's start by looking at the segments in Paris:

    COPY(
        SELECT
            id,
            names.primary as name,
            class,
            geometry
        FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=transportation/type=segment/*', filename=true, hive_partitioning=1)
        WHERE bbox.xmin > 2.276
            AND bbox.ymin > 48.865
            AND bbox.xmax < 2.314
            AND bbox.ymax < 48.882
    ) TO 'paris_roads.geojson' WITH (FORMAT GDAL, DRIVER 'GeoJSON');

    Looks very similar to what we were seeing on the Explore map, but this time we're working with the raw data:

    image

  2. Connectors are a decent proxy of road network complexity and density. First we'll download a bunch of connectors to a local parquet file:

    LOAD h3;
    COPY(
        SELECT
            h3_latlng_to_cell_string(ST_Y(geometry), ST_X(geometry), 8) as h3,
        id,
        geometry
        FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=transportation/type=connector/*', filename=true, hive_partitioning=1)
        WHERE bbox.xmin > 8.82
        AND bbox.ymin > 48.5
        AND bbox.xmax < 13.36
        AND bbox.ymax < 50.39
    ) TO connectors.parquet;
  3. Now aggregate by h3 cell:

    COPY(
        SELECT
            h3,
            count(id)
        FROM connectors.parquet
        GROUP BY
            h3
    ) TO 'connectors_h3.csv';

    Bring that CSV into Kepler.gl again, and we have a nice visualization of road network density in Germany: image

Part IV: Base Theme

What is the base theme? Mostly landuse, infrastructure, and water features from OpenStreetMap that have been converted into a rigic schema depending on their initial tag values. For example:

  1. Query Overture for all of the mountain peaks in North America:

    SET s3_region='us-west-2';
    
    COPY(
        SELECT
        id,
        names.primary as name,
        elevation,
        geometry
        FROM read_parquet('s3://overturemaps-us-west-2/release/2024-10-23.0/theme=base/type=land/*', filename=true, hive_partitioning=1)
        WHERE subtype = 'physical' AND class IN ('peak','volcano') AND elevation IS NOT NULL
        AND bbox.xmin BETWEEN -175 AND -48
        AND bbox.ymin BETWEEN 10 AND 85
    ) TO 'north_american_peaks.parquet';

    This query downloads about 1.1M peaks and the distribution looks like this: image

  2. We can build an h3-gridded DEM for North American high points from this file:

    COPY(
        SELECT
            h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 6) as h3,
            max(elevation) as _max,
            min(elevation) as _min,
            avg(elevation) as _avg
    FROM north_american_peaks.parquet
    GROUP BY h3_latlng_to_cell_string(ST_Y(ST_CENTROID(geometry)), ST_X(ST_CENTROID(geometry)), 6)
    ) TO 'na_peaks_h3.csv';

    image

What other types of features from OSM are you interested in exploring? The logic for how features convert from OSM to Overture is here: https://docs.overturemaps.org/schema/concepts/by-theme/base/







4. Bring the Analysis to the Data in the cloud with Fused

Back to Agenda

Now that we've worked with the data locally, let's go back to the cloud. Since our data lives there, let's bring our analysis to the data, not the other way around.

Overture & Oakridge Comparision

image

  1. Load the Overture OakRidge Comparison into your Fused workbench.
  2. Compare the building footprints between (Oak Ridge National Lab) ORNL and Overture. Which one has more accurate building footprints?
  3. Now which dataset has more accurate class information?
  4. Fused lets us combine these datasets, taking the best footprints from Overture and rich class information from ORNL.

H3 Aggregated Skyline

image

  1. We can also perform the same H3 aggregations in Fused: The Overture H3 Skyline UDF aggregates Overture buildings at any h3 resolution β€” allowing you to "approximate" a city skyline without having to render all of the buildings.

About

Querying the Planet: Leveraging GeoParquet to work with global scale open geospatial data locally and in the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published