-
Notifications
You must be signed in to change notification settings - Fork 9
Description
When using hyparquet-writer to write GeoParquet files with type: "GEOGRAPHY", BigQuery rejects the resulting files with the error:
Cannot annotate Geography from BYTE_ARRAY for field ...
Background
hyparquet-writer correctly applies the Parquet-native GEOGRAPHY logical type (with crs and algorithm parameters) as defined in the Apache Parquet format specification. However, BigQuery's Parquet importer does not currently handle this logical type annotation. Instead, BigQuery relies on the GeoParquet file-level JSON metadata (the geo key in the Parquet file's key-value metadata) to identify and load geography columns.
For reference, DuckDB handles the GEOGRAPHY logical type without issue, so this is specifically a BigQuery compatibility problem (though I have not tested other consumers).
What BigQuery expects
For a geography column named geog, BigQuery expects:
- Column type:
BYTE_ARRAYwith WKB-encoded data and no Parquet-level logical type annotation - File-level
geokey-value metadata describing the column's encoding, CRS, and geometry types
What hyparquet-writer currently produces
- Column type:
optional binary geog (Geography(crs=, algorithm=spherical))— BigQuery errors on this annotation - File-level
geometadata: note present by default, but can be added viakvMetadata
What a compatible writer (e.g. geopandas) produces
- Column type:
optional binary geog(plainBYTE_ARRAY, no logical type) - File-level
geometadata (automatically generated) - Arrow extension metadata (
ARROW:extension:name: geoarrow.wkb) on the column (doesn't seem to affect BigQuery either way)
What I've Tried
-
Adding file-level
geometadata viakvMetadata: This correctly adds the metadata, but BigQuery still fails because the column-levelGeographylogical type annotation is present. -
Using
schemaOverrides: We tried overriding the geog column schema to strip the logical type, but theGEOGRAPHYtype incolumnDatatriggers the WKB conversion inunconvert.jsbased on the logical type check (ltype?.type === 'GEOGRAPHY'). So overriding the schema alone doesn't cleanly decouple the two concerns.
Workaround
I ended up pre-converting GeoJSON to WKB ourselves using geojsonToWkb from hyparquet-writer/src/wkb.js, and passing the resulting byte arrays directly without any type annotation:
import { parquetWriteFile } from 'hyparquet-writer';
import { geojsonToWkb } from 'hyparquet-writer/src/wkb.js';
// Convert GeoJSON to WKB manually
const geogColumn = features.map(f => f.geometry ? geojsonToWkb(f.geometry) : null);
parquetWriteFile({
filename: 'output.parquet',
columnData: [
...propertyColumns,
{ name: 'geog', data: geogColumn, type: 'BYTE_ARRAY' },
],
kvMetadata: [
{ key: 'geo', value: JSON.stringify({
version: '1.1.0',
primary_column: 'geog',
columns: { geog: { encoding: 'WKB', geometry_types: [] } },
}) },
],
});This produces files that BigQuery accepts, but it requires users to manually handle WKB conversion and construct the file-level metadata, which the GEOGRAPHY type handles automatically.
Feature Request
1. A geoMetadata writer option
It would be great to have a dedicated geoMetadata option on parquetWriteFile that manages the GeoParquet file-level geo metadata. Some suggested behaviors:
- Default (auto-populate): When any
GEOMETRY/GEOGRAPHYcolumns are present incolumnData,geoMetadatawould be automatically generated. These columns would be listed in the metadata, with the first geo column serving as theprimary_column. - Manual override: In the absence of
GEOMETRY/GEOGRAPHYcolumns (e.g., when passing pre-converted WKB as plainBYTE_ARRAY), users could manually specify which columns should be included ingeoMetadataand their encoding/CRS/geometry_types. - Suppress: When
GEOMETRY/GEOGRAPHYcolumns are present but file-level metadata is not wanted for some reason, settinggeoMetadata: nullwould suppress it.
This would eliminate the need for users to manually construct the geo JSON and pass it through kvMetadata.
2. Easier access to WKB conversion (nice-to-have)
For users who need to work around the logical type issue (e.g., for BigQuery compatibility), it would help to have geojsonToWkb exported as a public API rather than requiring an import from hyparquet-writer/src/wkb.js.
3. Consider a BigQuery compatibility option (probably a bit much)
In an ideal world, BigQuery would handle the Parquet-native GEOGRAPHY logical type. I've filed a separate issue with Google about this. In the meantime, a compatibility option (e.g., geoMetadata: { logicalType: false } or similar) that writes the geo column as plain BYTE_ARRAY while still handling the GeoJSON→WKB conversion could help users targeting BigQuery.