Skip to content

Commit 62e7a0d

Browse files
authored
Merge pull request #12 from maxmind/wstorey/bucket-column
Add network_bucket column type for efficient BigQuery lookups
2 parents 6b176e7 + e498ee4 commit 62e7a0d

File tree

14 files changed

+2361
-30
lines changed

14 files changed

+2361
-30
lines changed

.precious.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ lint-flags = "--diff"
55
ok-exit-codes = 0
66
invoke = "once"
77
include = ["**/*.go"]
8+
exclude = ["testdata/MaxMind-DB/**"]
89

910
[commands.golangci-lint]
1011
type = "both"
@@ -15,12 +16,16 @@ expect-stderr = true
1516
invoke = "once"
1617
path-args = "dir"
1718
include = ["**/*.go"]
19+
exclude = ["testdata/MaxMind-DB/**"]
1820

1921
[commands.prettier-markdown]
2022
type = "both"
2123
include = [
2224
"**/*.md"
2325
]
26+
exclude = [
27+
"testdata/MaxMind-DB/**"
28+
]
2429
cmd = [
2530
"npx",
2631
"prettier",

CHANGELOG.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,22 @@ and this project adheres to
1010

1111
### Added
1212

13-
- Parquet sorting column metadata for query optimization. When start_int
14-
columns are configured, mmdbconvert now writes sorting metadata to the
15-
Parquet file declaring that rows are sorted by start_int in ascending order.
16-
This enables query engines like DuckDB, Spark, and Trino to use the sort
17-
order for potential optimizations like binary search.
13+
- Parquet sorting column metadata for query optimization. When start_int columns
14+
are configured, mmdbconvert now writes sorting metadata to the Parquet file
15+
declaring that rows are sorted by start_int in ascending order. This enables
16+
query engines like DuckDB, Spark, and Trino to use the sort order for
17+
potential optimizations like binary search.
18+
- New `network_bucket` network column type for CSV and Parquet output, enabling
19+
efficient IP lookups in BigQuery and other analytics platforms. When a network
20+
spans multiple buckets, rows are duplicated with different bucket values while
21+
preserving original network info. For IPv4, the bucket is an integer. For
22+
IPv6, the bucket is either a hex string (e.g.,
23+
"200f0000000000000000000000000000") or an integer depending on
24+
`ipv6_bucket_type`. Requires split output files (`ipv4_file` and `ipv6_file`).
25+
- New CSV and Parquet options `ipv4_bucket_size` and `ipv6_bucket_size` to
26+
configure bucket prefix lengths (default: 16).
27+
- New CSV and Parquet option `ipv6_bucket_type` to configure the IPv6 network
28+
bucket column format (default: string).
1829

1930
## [0.1.0] - 2025-11-07
2031

README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,10 @@ type = "start_int" # e.g., 3405803776 (IPv4 only)
406406
[[network.columns]]
407407
name = "end_int"
408408
type = "end_int" # e.g., 3405804031 (IPv4 only)
409+
410+
[[network.columns]]
411+
name = "network_bucket"
412+
type = "network_bucket" # Bucket for efficient lookups. Requires split files.
409413
```
410414

411415
**Default network columns:** If you don't define any `[[network.columns]]`,
@@ -419,6 +423,52 @@ split your output into separate IPv4/IPv6 files via `output.ipv4_file` and
419423
`output.ipv6_file`. For single-file outputs that include IPv6 data, use string
420424
columns (`start_ip`, `end_ip`, `cidr`).
421425

426+
### Network Bucketing for Analytics (BigQuery, etc.)
427+
428+
When loading network data into analytics platforms like BigQuery, range queries
429+
can be slow due to full table scans. The `network_bucket` column provides a join
430+
key that enables efficient queries by first filtering to a specific bucket.
431+
432+
**Configuration:**
433+
434+
```toml
435+
[output]
436+
format = "parquet"
437+
ipv4_file = "geoip-v4.parquet"
438+
ipv6_file = "geoip-v6.parquet"
439+
440+
[output.parquet]
441+
ipv4_bucket_size = 16 # Optional, defaults to 16
442+
ipv6_bucket_size = 16 # Optional, defaults to 16
443+
ipv6_bucket_type = "int" # Optional: "string" (default) or "int"
444+
445+
[[network.columns]]
446+
name = "start_int"
447+
type = "start_int"
448+
449+
[[network.columns]]
450+
name = "end_int"
451+
type = "end_int"
452+
453+
[[network.columns]]
454+
name = "network_bucket"
455+
type = "network_bucket"
456+
```
457+
458+
For IPv4, the bucket is a 32-bit integer. For IPv6, the bucket is either a hex
459+
string (default) or a 60-bit integer when `ipv6_bucket_type = "int"` is
460+
configured.
461+
462+
Using `network_bucket` requires split output files.
463+
464+
See [docs/bigquery.md](docs/bigquery.md) for BigQuery query examples.
465+
466+
**Note:** When a network is larger than the bucket size (e.g., a /15 with /16
467+
buckets), the row is duplicated for each bucket it spans. This ensures queries
468+
find the correct network regardless of which bucket the IP falls into.
469+
470+
**Note:** `network_bucket` is supported for CSV and Parquet output.
471+
422472
### Data Type Hints
423473

424474
Parquet supports native types for efficient storage and queries:
@@ -527,6 +577,7 @@ This ensures accurate IP lookups with no ambiguity.
527577

528578
- [Configuration Reference](docs/config.md) - Complete config file documentation
529579
- [Parquet Query Guide](docs/parquet-queries.md) - Optimizing IP lookup queries
580+
- [BigQuery Guide](docs/bigquery.md) - Network bucketing for BigQuery
530581

531582
## Requirements
532583

docs/bigquery.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# BigQuery with Network Bucketing
2+
3+
BigQuery performs full table scans for range queries like
4+
`WHERE start_int <= ip AND end_int >= ip`. Use a `network_bucket` column to
5+
enable efficient lookups.
6+
7+
**Note:** The BigQuery table must be clustered on the `network_bucket` column
8+
for efficient querying.
9+
10+
**Important:** The bucket size in your queries must match the configured bucket
11+
size. The examples below use the default `/16` bucket size. If you configured a
12+
different `ipv4_bucket_size` or `ipv6_bucket_size`, adjust the second argument
13+
to `NET.IP_TRUNC()` accordingly.
14+
15+
## IPv4 Lookup
16+
17+
For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
18+
`NET.IPV4_TO_INT64()` to convert to the integer type:
19+
20+
```sql
21+
-- Using default ipv4_bucket_size = 16
22+
SELECT *
23+
FROM `project.dataset.geoip_v4`
24+
WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
25+
AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
26+
```
27+
28+
## IPv6 Lookup
29+
30+
The query depends on your `ipv6_bucket_type` configuration.
31+
32+
**Note:** For IPv6 files, `start_int` and `end_int` columns are stored as
33+
16-byte binary values, not integers. The comparison with `NET.IP_FROM_STRING()`
34+
works because it also returns BYTES.
35+
36+
**Using default `ipv6_bucket_type = "string"` (hex string):**
37+
38+
```sql
39+
-- Using default ipv6_bucket_size = 16
40+
SELECT *
41+
FROM `project.dataset.geoip_v6`
42+
WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16))
43+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
44+
```
45+
46+
**Using `ipv6_bucket_type = "int"` (60-bit int64):**
47+
48+
```sql
49+
-- Using default ipv6_bucket_size = 16
50+
SELECT *
51+
FROM `project.dataset.geoip_v6`
52+
WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
53+
TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
54+
)) AS INT64)
55+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
56+
```
57+
58+
The int type expression extracts the first 60 bits (15 hex chars) of the
59+
truncated IPv6 address as an integer.
60+
61+
## Why Bucketing Helps
62+
63+
Without bucketing, BigQuery must scan every row to check the range condition.
64+
With bucketing:
65+
66+
1. BigQuery first filters by exact match on `network_bucket`
67+
2. Only matching bucket rows are checked for the range condition
68+
3. Result: Query scans only rows in the matching bucket instead of the entire
69+
table

docs/config.md

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -72,17 +72,40 @@ When `format = "csv"`, you can specify CSV-specific options:
7272
[output.csv]
7373
delimiter = "," # Field delimiter (default: ",")
7474
include_header = true # Include column headers (default: true)
75+
ipv4_bucket_size = 16 # Bucket prefix length for IPv4 (default: 16)
76+
ipv6_bucket_size = 16 # Bucket prefix length for IPv6 (default: 16)
77+
ipv6_bucket_type = "string" # IPv6 bucket value type: "string" or "int" (default: "string")
7578
```
7679

80+
| Option | Description | Default |
81+
| ------------------ | -------------------------------------------------------------------------- | -------- |
82+
| `delimiter` | Field delimiter character | "," |
83+
| `include_header` | Include column headers in output | true |
84+
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (1-32, when `network_bucket` column used) | 16 |
85+
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (1-60, when `network_bucket` column used) | 16 |
86+
| `ipv6_bucket_type` | IPv6 bucket value type: "string" (hex) or "int" (first 60 bits as integer) | "string" |
87+
7788
#### Parquet Options
7889

7990
When `format = "parquet"`, you can specify Parquet-specific options:
8091

8192
```toml
8293
[output.parquet]
83-
compression = "snappy" # Compression: "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
94+
compression = "snappy" # Compression: "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
95+
row_group_size = 500000 # Rows per row group (default: 500000)
96+
ipv4_bucket_size = 16 # Bucket prefix length for IPv4 (default: 16)
97+
ipv6_bucket_size = 16 # Bucket prefix length for IPv6 (default: 16)
98+
ipv6_bucket_type = "string" # IPv6 bucket value type: "string" or "int" (default: "string")
8499
```
85100

101+
| Option | Description | Default |
102+
| ------------------ | -------------------------------------------------------------------------- | -------- |
103+
| `compression` | Compression codec: "none", "snappy", "gzip", "lz4", "zstd" | "snappy" |
104+
| `row_group_size` | Number of rows per row group | 500000 |
105+
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (1-32, when `network_bucket` column used) | 16 |
106+
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (1-60, when `network_bucket` column used) | 16 |
107+
| `ipv6_bucket_type` | IPv6 bucket value type: "string" (hex) or "int" (first 60 bits as integer) | "string" |
108+
86109
#### MMDB Options
87110

88111
When `format = "mmdb"`, you can specify MMDB-specific options:
@@ -121,6 +144,33 @@ ipv6_file = "merged_ipv6.parquet"
121144

122145
When splitting output, both `ipv4_file` and `ipv6_file` must be configured.
123146

147+
#### IPv6 Bucket Type Options
148+
149+
IPv6 buckets can be stored as either hex strings (default) or int64 values:
150+
151+
**String type (default):**
152+
153+
- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
154+
- Storage: 32 bytes per value
155+
156+
**Int type (`ipv6_bucket_type = "int"`):**
157+
158+
- Format: First 60 bits of the bucket address as int64
159+
- Storage: 8 bytes per value (4x smaller than string)
160+
161+
We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
162+
which simplifies queries by avoiding two's complement handling.
163+
164+
**When to use each type:**
165+
166+
- Use **string** (default) for databases where hex string representations are
167+
simpler to work with.
168+
- Use **int** for reduced storage cost at the price of more complicated queries.
169+
170+
We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
171+
there so far has not been a need. For example, BigQuery cannot cluster on
172+
`bytes`, so it is not helpful there.
173+
124174
### Network Columns
125175

126176
Network columns define how IP network information is output. These columns
@@ -134,11 +184,14 @@ type = "cidr" # Output type
134184

135185
**Available types:**
136186

137-
- `cidr` - CIDR notation (e.g., "203.0.113.0/24")
138-
- `start_ip` - Starting IP address (e.g., "203.0.113.0")
139-
- `end_ip` - Ending IP address (e.g., "203.0.113.255")
140-
- `start_int` - Starting IP as integer
141-
- `end_int` - Ending IP as integer
187+
| Type | Description |
188+
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
189+
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
190+
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
191+
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
192+
| `start_int` | Starting IP as integer |
193+
| `end_int` | Ending IP as integer |
194+
| `network_bucket` | Bucket for efficient lookups. IPv4: integer. IPv6: hex string (default) or integer (with `ipv6_bucket_type = "int"`). Requires split files (CSV and Parquet only). |
142195

143196
**Default behavior:** If no `[[network.columns]]` sections are defined:
144197

0 commit comments

Comments
 (0)