Skip to content

Commit 965783e

Browse files
horghclaude
andcommitted
Add network_bucket column type for efficient BigQuery lookups
Add a new `network_bucket` network column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int. Key changes: - Add SplitPrefix() function to split prefixes into bucket-sized pieces - Add IPv4BucketSize and IPv6BucketSize config options (default: 16) - Implement row duplication in Parquet writer for networks spanning buckets - Bucket type matches start_int/end_int: int64 for IPv4, 16-byte for IPv6 - Require split files (ipv4_file + ipv6_file) for network_bucket column 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent cd2f895 commit 965783e

File tree

11 files changed

+994
-11
lines changed

11 files changed

+994
-11
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@ and this project adheres to
1515
Parquet file declaring that rows are sorted by start_int in ascending order.
1616
This enables query engines like DuckDB, Spark, and Trino to use the sort
1717
order for potential optimizations like binary search.
18+
- New `network_bucket` network column type for Parquet output, enabling
19+
efficient IP lookups in BigQuery and other analytics platforms. When a
20+
network spans multiple buckets, rows are duplicated with different bucket
21+
values while preserving original network info. The bucket type matches
22+
`start_int`/`end_int`. Requires split output files (`ipv4_file` and
23+
`ipv6_file`).
24+
- New Parquet options `ipv4_bucket_size` and `ipv6_bucket_size` to
25+
configure bucket prefix lengths (default: 16).
1826

1927
## [0.1.0] - 2025-11-07
2028

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,10 @@ type = "start_int" # e.g., 3405803776 (IPv4 only)
406406
[[network.columns]]
407407
name = "end_int"
408408
type = "end_int" # e.g., 3405804031 (IPv4 only)
409+
410+
[[network.columns]]
411+
name = "network_bucket"
412+
type = "network_bucket" # Bucket (int for IPv4, bytes for IPv6). Requires split files.
409413
```
410414

411415
**Default network columns:** If you don't define any `[[network.columns]]`,
@@ -419,6 +423,47 @@ split your output into separate IPv4/IPv6 files via `output.ipv4_file` and
419423
`output.ipv6_file`. For single-file outputs that include IPv6 data, use string
420424
columns (`start_ip`, `end_ip`, `cidr`).
421425

426+
**Note:** `network_bucket` is currently only supported for Parquet output.
427+
428+
### Network Bucketing for Analytics (BigQuery, etc.)
429+
430+
When loading network data into analytics platforms like BigQuery, range queries
431+
can be slow due to full table scans. The `network_bucket` column provides a
432+
join key that enables efficient queries by first filtering to a specific bucket.
433+
434+
**Configuration:**
435+
436+
```toml
437+
[output]
438+
format = "parquet"
439+
ipv4_file = "geoip-v4.parquet"
440+
ipv6_file = "geoip-v6.parquet"
441+
442+
[output.parquet]
443+
ipv4_bucket_size = 16 # Optional, defaults to 16
444+
ipv6_bucket_size = 16 # Optional, defaults to 16
445+
446+
[[network.columns]]
447+
name = "start_int"
448+
type = "start_int"
449+
450+
[[network.columns]]
451+
name = "end_int"
452+
type = "end_int"
453+
454+
[[network.columns]]
455+
name = "network_bucket"
456+
type = "network_bucket"
457+
```
458+
459+
The bucket type matches `start_int`/`end_int`: int64 for IPv4, 16-byte array
460+
for IPv6. This requires split output files. See
461+
[docs/parquet-queries.md](docs/parquet-queries.md) for BigQuery query examples.
462+
463+
**Note:** When a network is larger than the bucket size (e.g., a /15 with /16
464+
buckets), the row is duplicated for each bucket it spans. This ensures queries
465+
find the correct network regardless of which bucket the IP falls into.
466+
422467
### Data Type Hints
423468

424469
Parquet supports native types for efficient storage and queries:

docs/config.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,19 @@ When `format = "parquet"`, you can specify Parquet-specific options:
8080

8181
```toml
8282
[output.parquet]
83-
compression = "snappy" # Compression: "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
83+
compression = "snappy" # Compression: "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
84+
row_group_size = 500000 # Rows per row group (default: 500000)
85+
ipv4_bucket_size = 16 # Bucket prefix length for IPv4 (default: 16)
86+
ipv6_bucket_size = 16 # Bucket prefix length for IPv6 (default: 16)
8487
```
8588

89+
| Option | Description | Default |
90+
|--------|-------------|---------|
91+
| `compression` | Compression codec: "none", "snappy", "gzip", "lz4", "zstd" | "snappy" |
92+
| `row_group_size` | Number of rows per row group | 500000 |
93+
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (when `network_bucket` column used) | 16 |
94+
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (when `network_bucket` column used) | 16 |
95+
8696
#### MMDB Options
8797

8898
When `format = "mmdb"`, you can specify MMDB-specific options:
@@ -134,11 +144,14 @@ type = "cidr" # Output type
134144

135145
**Available types:**
136146

137-
- `cidr` - CIDR notation (e.g., "203.0.113.0/24")
138-
- `start_ip` - Starting IP address (e.g., "203.0.113.0")
139-
- `end_ip` - Ending IP address (e.g., "203.0.113.255")
140-
- `start_int` - Starting IP as integer
141-
- `end_int` - Ending IP as integer
147+
| Type | Description |
148+
|------|-------------|
149+
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
150+
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
151+
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
152+
| `start_int` | Starting IP as integer (int64 for IPv4, 16-byte for IPv6) |
153+
| `end_int` | Ending IP as integer (int64 for IPv4, 16-byte for IPv6) |
154+
| `network_bucket` | Bucket for efficient lookups (int64 for IPv4, 16-byte for IPv6). Same types as `start_int`/`end_int`. Requires split files (Parquet only). |
142155

143156
**Default behavior:** If no `[[network.columns]]` sections are defined:
144157

docs/parquet-queries.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,86 @@ ipv4_file = "geo_ipv4.parquet"
330330
ipv6_file = "geo_ipv6.parquet"
331331
```
332332

333+
## BigQuery with Network Bucketing
334+
335+
BigQuery performs full table scans for range queries like `WHERE start_int <= ip
336+
AND end_int >= ip`. Use the `network_bucket` column to enable efficient lookups.
337+
338+
### Configuration
339+
340+
```toml
341+
[output]
342+
format = "parquet"
343+
ipv4_file = "geoip-v4.parquet"
344+
ipv6_file = "geoip-v6.parquet"
345+
346+
[output.parquet]
347+
ipv4_bucket_size = 16 # Default: /16 prefix
348+
ipv6_bucket_size = 16 # Default: /16 prefix
349+
350+
[[network.columns]]
351+
name = "start_int"
352+
type = "start_int"
353+
354+
[[network.columns]]
355+
name = "end_int"
356+
type = "end_int"
357+
358+
[[network.columns]]
359+
name = "network_bucket"
360+
type = "network_bucket"
361+
```
362+
363+
### BigQuery Query Patterns
364+
365+
#### IPv4 Lookup
366+
367+
For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
368+
`NET.IPV4_TO_INT64()` to convert to the integer type:
369+
370+
```sql
371+
SELECT *
372+
FROM `project.dataset.geoip_v4`
373+
WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
374+
AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
375+
```
376+
377+
#### IPv6 Lookup
378+
379+
For IPv6, the bucket is bytes. Use `NET.IP_TRUNC()` to get the bucket:
380+
381+
```sql
382+
SELECT *
383+
FROM `project.dataset.geoip_v6`
384+
WHERE network_bucket = NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)
385+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
386+
```
387+
388+
### Why Bucketing Helps
389+
390+
Without bucketing, BigQuery must scan every row to check the range condition.
391+
With bucketing:
392+
393+
1. BigQuery first filters by exact match on `network_bucket`
394+
2. Only matching bucket rows are checked for the range condition
395+
3. Result: Query scans ~0.4% of rows (1 bucket out of 256 for /16)
396+
397+
### Row Duplication
398+
399+
Networks larger than the bucket size are duplicated. For example, a /15 network
400+
spans two /16 buckets:
401+
402+
**IPv4 example (int64 bucket values):**
403+
404+
| network | start_int | end_int | network_bucket |
405+
|---------|-----------|---------|----------------|
406+
| 2.0.0.0/15 | 33554432 | 33685503 | 33554432 |
407+
| 2.0.0.0/15 | 33554432 | 33685503 | 33619968 |
408+
409+
Both rows have the same `start_int`/`end_int` (the full /15 range), but different
410+
`network_bucket` values (2.0.0.0 = 33554432, 2.1.0.0 = 33619968). Queries for IPs
411+
in either bucket will find the network.
412+
333413
## Common Query Patterns
334414

335415
### Single IP Lookup

internal/config/config.go

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,10 @@ type CSVConfig struct {
4747

4848
// ParquetConfig defines Parquet output options.
4949
type ParquetConfig struct {
50-
Compression string `toml:"compression"` // "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
51-
RowGroupSize int `toml:"row_group_size"` // Rows per row group (default: 500000)
50+
Compression string `toml:"compression"` // "none", "snappy", "gzip", "lz4", "zstd" (default: "snappy")
51+
RowGroupSize int `toml:"row_group_size"` // Rows per row group (default: 500000)
52+
IPv4BucketSize int `toml:"ipv4_bucket_size"` // Bucket prefix length for IPv4 (default: 16)
53+
IPv6BucketSize int `toml:"ipv6_bucket_size"` // Bucket prefix length for IPv6 (default: 16)
5254
}
5355

5456
// MMDBConfig defines MMDB output options.
@@ -174,6 +176,12 @@ func applyDefaults(config *Config) {
174176
if config.Output.Parquet.RowGroupSize == 0 {
175177
config.Output.Parquet.RowGroupSize = 500000
176178
}
179+
if config.Output.Parquet.IPv4BucketSize == 0 {
180+
config.Output.Parquet.IPv4BucketSize = 16
181+
}
182+
if config.Output.Parquet.IPv6BucketSize == 0 {
183+
config.Output.Parquet.IPv6BucketSize = 16
184+
}
177185

178186
// MMDB defaults
179187
if config.Output.Format == formatMMDB {
@@ -315,8 +323,10 @@ func validate(config *Config) error {
315323
// Validate network columns
316324
validNetworkTypes := map[string]bool{
317325
"cidr": true, "start_ip": true, "end_ip": true, "start_int": true, "end_int": true,
326+
"network_bucket": true,
318327
}
319328
networkColNames := map[mmdbtype.String]bool{}
329+
hasBucketColumn := false
320330
for _, col := range config.Network.Columns {
321331
if col.Name == "" {
322332
return errors.New("network column name is required")
@@ -326,17 +336,34 @@ func validate(config *Config) error {
326336
}
327337
if !validNetworkTypes[col.Type] {
328338
return fmt.Errorf(
329-
"invalid network column type '%s' for column '%s', must be one of: cidr, start_ip, end_ip, start_int, end_int",
339+
"invalid network column type '%s' for column '%s', must be one of: cidr, start_ip, end_ip, start_int, end_int, network_bucket",
330340
col.Type,
331341
col.Name,
332342
)
333343
}
344+
if col.Type == "network_bucket" {
345+
hasBucketColumn = true
346+
}
334347
if networkColNames[col.Name] {
335348
return fmt.Errorf("duplicate network column name '%s'", col.Name)
336349
}
337350
networkColNames[col.Name] = true
338351
}
339352

353+
// network_bucket column requires split files (different types for IPv4 vs IPv6)
354+
if hasBucketColumn {
355+
if config.Output.Format != formatParquet {
356+
return errors.New(
357+
"network_bucket column type is only supported for Parquet output",
358+
)
359+
}
360+
if config.Output.IPv4File == "" || config.Output.IPv6File == "" {
361+
return errors.New(
362+
"network_bucket column requires split files (ipv4_file and ipv6_file)",
363+
)
364+
}
365+
}
366+
340367
// Validate data columns
341368
validDataTypes := map[string]bool{
342369
"": true, "string": true, "int64": true, "float64": true, "bool": true, "binary": true,

0 commit comments

Comments
 (0)