Skip to content

GeoParquet files with GEOGRAPHY type produce incompatible Parquet metadata for BigQuery #23

@mjumbewu

Description

@mjumbewu

When using hyparquet-writer to write GeoParquet files with type: "GEOGRAPHY", BigQuery rejects the resulting files with the error:

Cannot annotate Geography from BYTE_ARRAY for field ...

Background

hyparquet-writer correctly applies the Parquet-native GEOGRAPHY logical type (with crs and algorithm parameters) as defined in the Apache Parquet format specification. However, BigQuery's Parquet importer does not currently handle this logical type annotation. Instead, BigQuery relies on the GeoParquet file-level JSON metadata (the geo key in the Parquet file's key-value metadata) to identify and load geography columns.

For reference, DuckDB handles the GEOGRAPHY logical type without issue, so this is specifically a BigQuery compatibility problem (though I have not tested other consumers).

What BigQuery expects

For a geography column named geog, BigQuery expects:

  • Column type: BYTE_ARRAY with WKB-encoded data and no Parquet-level logical type annotation
  • File-level geo key-value metadata describing the column's encoding, CRS, and geometry types

What hyparquet-writer currently produces

  • Column type: optional binary geog (Geography(crs=, algorithm=spherical)) — BigQuery errors on this annotation
  • File-level geo metadata: note present by default, but can be added via kvMetadata

What a compatible writer (e.g. geopandas) produces

  • Column type: optional binary geog (plain BYTE_ARRAY, no logical type)
  • File-level geo metadata (automatically generated)
  • Arrow extension metadata (ARROW:extension:name: geoarrow.wkb) on the column (doesn't seem to affect BigQuery either way)

What I've Tried

  1. Adding file-level geo metadata via kvMetadata: This correctly adds the metadata, but BigQuery still fails because the column-level Geography logical type annotation is present.

  2. Using schemaOverrides: We tried overriding the geog column schema to strip the logical type, but the GEOGRAPHY type in columnData triggers the WKB conversion in unconvert.js based on the logical type check (ltype?.type === 'GEOGRAPHY'). So overriding the schema alone doesn't cleanly decouple the two concerns.

Workaround

I ended up pre-converting GeoJSON to WKB ourselves using geojsonToWkb from hyparquet-writer/src/wkb.js, and passing the resulting byte arrays directly without any type annotation:

import { parquetWriteFile } from 'hyparquet-writer';
import { geojsonToWkb } from 'hyparquet-writer/src/wkb.js';

// Convert GeoJSON to WKB manually
const geogColumn = features.map(f => f.geometry ? geojsonToWkb(f.geometry) : null);

parquetWriteFile({
  filename: 'output.parquet',
  columnData: [
    ...propertyColumns,
    { name: 'geog', data: geogColumn, type: 'BYTE_ARRAY' },
  ],
  kvMetadata: [
    { key: 'geo', value: JSON.stringify({
      version: '1.1.0',
      primary_column: 'geog',
      columns: { geog: { encoding: 'WKB', geometry_types: [] } },
    }) },
  ],
});

This produces files that BigQuery accepts, but it requires users to manually handle WKB conversion and construct the file-level metadata, which the GEOGRAPHY type handles automatically.

Feature Request

1. A geoMetadata writer option

It would be great to have a dedicated geoMetadata option on parquetWriteFile that manages the GeoParquet file-level geo metadata. Some suggested behaviors:

  • Default (auto-populate): When any GEOMETRY/GEOGRAPHY columns are present in columnData, geoMetadata would be automatically generated. These columns would be listed in the metadata, with the first geo column serving as the primary_column.
  • Manual override: In the absence of GEOMETRY/GEOGRAPHY columns (e.g., when passing pre-converted WKB as plain BYTE_ARRAY), users could manually specify which columns should be included in geoMetadata and their encoding/CRS/geometry_types.
  • Suppress: When GEOMETRY/GEOGRAPHY columns are present but file-level metadata is not wanted for some reason, setting geoMetadata: null would suppress it.

This would eliminate the need for users to manually construct the geo JSON and pass it through kvMetadata.

2. Easier access to WKB conversion (nice-to-have)

For users who need to work around the logical type issue (e.g., for BigQuery compatibility), it would help to have geojsonToWkb exported as a public API rather than requiring an import from hyparquet-writer/src/wkb.js.

3. Consider a BigQuery compatibility option (probably a bit much)

In an ideal world, BigQuery would handle the Parquet-native GEOGRAPHY logical type. I've filed a separate issue with Google about this. In the meantime, a compatibility option (e.g., geoMetadata: { logicalType: false } or similar) that writes the geo column as plain BYTE_ARRAY while still handling the GeoJSON→WKB conversion could help users targeting BigQuery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions