Skip to content

Commit 32bc23a

Browse files
committed
Support IPv6 buckets being integer
1 parent a34a964 commit 32bc23a

File tree

10 files changed

+528
-42
lines changed

10 files changed

+528
-42
lines changed

CHANGELOG.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,14 @@ and this project adheres to
1818
- New `network_bucket` network column type for Parquet output, enabling
1919
efficient IP lookups in BigQuery and other analytics platforms. When a network
2020
spans multiple buckets, rows are duplicated with different bucket values while
21-
preserving original network info. For IPv4, the bucket is integer (matching
22-
`start_int`/`end_int`). For IPv6, the bucket is a hex string (e.g.,
23-
"200f0000000000000000000000000000"). Requires split output files (`ipv4_file`
24-
and `ipv6_file`).
21+
preserving original network info. For IPv4, the bucket is an integer. For
22+
IPv6, the bucket is either a hex string (e.g.,
23+
"200f0000000000000000000000000000") or an integer depending on
24+
`ipv6_bucket_type`. Requires split output files (`ipv4_file` and `ipv6_file`).
2525
- New Parquet options `ipv4_bucket_size` and `ipv6_bucket_size` to configure
2626
bucket prefix lengths (default: 16).
27+
- New Parquet option `ipv6_bucket_type` to configure the IPv6 network bucket
28+
column format (default: string).
2729

2830
## [0.1.0] - 2025-11-07
2931

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -440,8 +440,9 @@ ipv4_file = "geoip-v4.parquet"
440440
ipv6_file = "geoip-v6.parquet"
441441

442442
[output.parquet]
443-
ipv4_bucket_size = 16 # Optional, defaults to 16
444-
ipv6_bucket_size = 16 # Optional, defaults to 16
443+
ipv4_bucket_size = 16 # Optional, defaults to 16
444+
ipv6_bucket_size = 16 # Optional, defaults to 16
445+
ipv6_bucket_type = "int" # Optional: "string" (default) or "int"
445446

446447
[[network.columns]]
447448
name = "start_int"
@@ -456,9 +457,9 @@ name = "network_bucket"
456457
type = "network_bucket"
457458
```
458459

459-
For IPv4, the bucket is an integer (matching `start_int`/`end_int`). For IPv6,
460-
the bucket is a hex string (e.g., "200f0000000000000000000000000000"). This
461-
requires split output files. See
460+
For IPv4, the bucket is an integer. For IPv6, the bucket is either a hex string
461+
(default) or an integer when `ipv6_bucket_type = "int"` is configured. Using
462+
`network_bucket` requires split output files. See
462463
[docs/parquet-queries.md](docs/parquet-queries.md) for BigQuery query examples.
463464

464465
**Note:** When a network is larger than the bucket size (e.g., a /15 with /16

docs/config.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -84,14 +84,16 @@ compression = "snappy" # Compression: "none", "snappy", "gzip", "lz4", "zstd"
8484
row_group_size = 500000 # Rows per row group (default: 500000)
8585
ipv4_bucket_size = 16 # Bucket prefix length for IPv4 (default: 16)
8686
ipv6_bucket_size = 16 # Bucket prefix length for IPv6 (default: 16)
87+
ipv6_bucket_type = "string" # IPv6 bucket value type: "string" or "int" (default: "string")
8788
```
8889

89-
| Option | Description | Default |
90-
| ------------------ | ------------------------------------------------------------------ | -------- |
91-
| `compression` | Compression codec: "none", "snappy", "gzip", "lz4", "zstd" | "snappy" |
92-
| `row_group_size` | Number of rows per row group | 500000 |
93-
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (when `network_bucket` column used) | 16 |
94-
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (when `network_bucket` column used) | 16 |
90+
| Option | Description | Default |
91+
| ------------------ | -------------------------------------------------------------------------- | -------- |
92+
| `compression` | Compression codec: "none", "snappy", "gzip", "lz4", "zstd" | "snappy" |
93+
| `row_group_size` | Number of rows per row group | 500000 |
94+
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (1-32, when `network_bucket` column used) | 16 |
95+
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (1-60, when `network_bucket` column used) | 16 |
96+
| `ipv6_bucket_type` | IPv6 bucket value type: "string" (hex) or "int" (first 60 bits as integer) | "string" |
9597

9698
#### MMDB Options
9799

@@ -144,14 +146,14 @@ type = "cidr" # Output type
144146

145147
**Available types:**
146148

147-
| Type | Description |
148-
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
149-
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
150-
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
151-
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
152-
| `start_int` | Starting IP as integer |
153-
| `end_int` | Ending IP as integer |
154-
| `network_bucket` | Bucket for efficient lookups. For IPv4: integer (same as `start_int`/`end_int`). For IPv6: hex string. Requires split files (Parquet only). |
149+
| Type | Description |
150+
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
151+
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
152+
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
153+
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
154+
| `start_int` | Starting IP as integer |
155+
| `end_int` | Ending IP as integer |
156+
| `network_bucket` | Bucket for efficient lookups. IPv4: integer. IPv6: hex string (default) or integer (with `ipv6_bucket_type = "int"`). Requires split files (Parquet only). |
155157

156158
**Default behavior:** If no `[[network.columns]]` sections are defined:
157159

docs/parquet-queries.md

Lines changed: 46 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -345,8 +345,9 @@ ipv4_file = "geoip-v4.parquet"
345345
ipv6_file = "geoip-v6.parquet"
346346

347347
[output.parquet]
348-
ipv4_bucket_size = 16 # Default: /16 prefix
349-
ipv6_bucket_size = 16 # Default: /16 prefix
348+
ipv4_bucket_size = 16 # Default: /16 prefix
349+
ipv6_bucket_size = 16 # Default: /16 prefix
350+
ipv6_bucket_type = "string" # Default: "string" (hex), or "int" (60-bit integer)
350351

351352
[[network.columns]]
352353
name = "start_int"
@@ -386,7 +387,9 @@ AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND
386387

387388
#### IPv6 Lookup
388389

389-
For IPv6, the bucket is a hex string. Use `TO_HEX()` with `NET.IP_TRUNC()`:
390+
The query depends on your `ipv6_bucket_type` configuration.
391+
392+
**Using default `ipv6_bucket_type = "string"` (hex string):**
390393

391394
```sql
392395
-- Using default ipv6_bucket_size = 16
@@ -396,6 +399,21 @@ WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16
396399
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
397400
```
398401

402+
**Using `ipv6_bucket_type = "int"` (60-bit int64):**
403+
404+
```sql
405+
-- Using default ipv6_bucket_size = 16
406+
SELECT *
407+
FROM `project.dataset.geoip_v6`
408+
WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
409+
TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
410+
)) AS INT64)
411+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
412+
```
413+
414+
The int type expression extracts the first 60 bits (15 hex chars) of the
415+
truncated IPv6 address as an integer.
416+
399417
### Why Bucketing Helps
400418

401419
Without bucketing, BigQuery must scan every row to check the range condition.
@@ -422,13 +440,32 @@ Both rows have the same `start_int`/`end_int` (the full /15 range), but
422440
different `network_bucket` values (2.0.0.0 = 33554432, 2.1.0.0 = 33619968).
423441
Queries for IPs in either bucket will find the network.
424442

425-
### Why Bucket is a Hex String
443+
### IPv6 Bucket Type Options
444+
445+
IPv6 buckets can be stored as either hex strings (default) or int64 values:
446+
447+
**String type (default):**
448+
449+
- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
450+
- Storage: 32 bytes per value
451+
452+
**Int type (`ipv6_bucket_type = "int"`):**
453+
454+
- Format: First 60 bits of the bucket address as int64
455+
- Storage: 8 bytes per value (4x smaller than string)
456+
457+
We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
458+
which simplifies BigQuery queries by avoiding two's complement handling.
459+
460+
**When to use each type:**
461+
462+
- Use **string** (default) for databases where hex string representations are
463+
simpler to work with.
464+
- Use **int** for reduced storage cost at the price of more complicated queries.
426465

427-
BigQuery cannot cluster on the `bytes` type, so we can't use the same type as we
428-
do for `start_int` and `end_int`. Using `int` to include the prefix or using
429-
`bignumeric` would be an option, but both are more complicated to query with.
430-
Another reason to use a hex string is Snowflake's `PARSE_IP()` function provides
431-
the address in this format.
466+
We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
467+
there so far has not been a need. For example, BigQuery cannot cluster on
468+
`bytes`, so it is not helpful there.
432469

433470
## Common Query Patterns
434471

internal/config/config.go

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@ const (
1616
formatCSV = "csv"
1717
formatParquet = "parquet"
1818
formatMMDB = "mmdb"
19+
20+
// IPv6BucketTypeString stores IPv6 bucket values as hex strings.
21+
IPv6BucketTypeString = "string"
22+
// IPv6BucketTypeInt stores IPv6 bucket values as int64 (first 60 bits).
23+
IPv6BucketTypeInt = "int"
1924
)
2025

2126
// Config represents the complete configuration file structure.
@@ -51,6 +56,7 @@ type ParquetConfig struct {
5156
RowGroupSize int `toml:"row_group_size"` // Rows per row group (default: 500000)
5257
IPv4BucketSize int `toml:"ipv4_bucket_size"` // Bucket prefix length for IPv4 (default: 16)
5358
IPv6BucketSize int `toml:"ipv6_bucket_size"` // Bucket prefix length for IPv6 (default: 16)
59+
IPv6BucketType string `toml:"ipv6_bucket_type"` // "string" or "int" (default: "string")
5460
}
5561

5662
// MMDBConfig defines MMDB output options.
@@ -182,6 +188,9 @@ func applyDefaults(config *Config) {
182188
if config.Output.Parquet.IPv6BucketSize == 0 {
183189
config.Output.Parquet.IPv6BucketSize = 16
184190
}
191+
if config.Output.Parquet.IPv6BucketType == "" {
192+
config.Output.Parquet.IPv6BucketType = IPv6BucketTypeString
193+
}
185194

186195
// MMDB defaults
187196
if config.Output.Format == formatMMDB {
@@ -373,13 +382,24 @@ func validate(config *Config) error {
373382
config.Output.Parquet.IPv4BucketSize,
374383
)
375384
}
385+
// IPv6 bucket size capped at 60 to support int type (60-bit values fit in
386+
// positive int64, simplifying BigQuery queries)
376387
if config.Output.Parquet.IPv6BucketSize < 1 ||
377-
config.Output.Parquet.IPv6BucketSize > 128 {
388+
config.Output.Parquet.IPv6BucketSize > 60 {
378389
return fmt.Errorf(
379-
"ipv6_bucket_size must be between 1 and 128, got %d",
390+
"ipv6_bucket_size must be between 1 and 60, got %d",
380391
config.Output.Parquet.IPv6BucketSize,
381392
)
382393
}
394+
395+
// Validate IPv6 bucket type
396+
if config.Output.Parquet.IPv6BucketType != IPv6BucketTypeString &&
397+
config.Output.Parquet.IPv6BucketType != IPv6BucketTypeInt {
398+
return fmt.Errorf(
399+
"ipv6_bucket_type must be 'string' or 'int', got '%s'",
400+
config.Output.Parquet.IPv6BucketType,
401+
)
402+
}
383403
}
384404

385405
// Validate data columns

internal/config/config_test.go

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,6 +322,50 @@ path = ["country", "iso_code"]
322322
}
323323
},
324324
},
325+
{
326+
name: "parquet with ipv6_bucket_type int",
327+
toml: `
328+
[output]
329+
format = "parquet"
330+
ipv4_file = "output-v4.parquet"
331+
ipv6_file = "output-v6.parquet"
332+
333+
[output.parquet]
334+
ipv6_bucket_type = "int"
335+
ipv6_bucket_size = 48
336+
337+
[[network.columns]]
338+
name = "start_int"
339+
type = "start_int"
340+
341+
[[network.columns]]
342+
name = "network_bucket"
343+
type = "network_bucket"
344+
345+
[[databases]]
346+
name = "geo"
347+
path = "/path/to/geo.mmdb"
348+
349+
[[columns]]
350+
name = "country"
351+
database = "geo"
352+
path = ["country", "iso_code"]
353+
`,
354+
validate: func(t *testing.T, cfg *Config) {
355+
if cfg.Output.Parquet.IPv6BucketType != IPv6BucketTypeInt {
356+
t.Errorf(
357+
"expected IPv6BucketType=int, got %s",
358+
cfg.Output.Parquet.IPv6BucketType,
359+
)
360+
}
361+
if cfg.Output.Parquet.IPv6BucketSize != 48 {
362+
t.Errorf(
363+
"expected IPv6BucketSize=48, got %d",
364+
cfg.Output.Parquet.IPv6BucketSize,
365+
)
366+
}
367+
},
368+
},
325369
}
326370

327371
for _, tt := range tests {
@@ -818,7 +862,33 @@ ipv4_file = "output-v4.parquet"
818862
ipv6_file = "output-v6.parquet"
819863
820864
[output.parquet]
821-
ipv6_bucket_size = 129
865+
ipv6_bucket_size = 61
866+
867+
[[network.columns]]
868+
name = "network_bucket"
869+
type = "network_bucket"
870+
871+
[[databases]]
872+
name = "geo"
873+
path = "/path/to/geo.mmdb"
874+
875+
[[columns]]
876+
name = "country"
877+
database = "geo"
878+
path = ["country", "iso_code"]
879+
`,
880+
expectError: "ipv6_bucket_size must be between 1 and 60",
881+
},
882+
{
883+
name: "invalid ipv6_bucket_type",
884+
toml: `
885+
[output]
886+
format = "parquet"
887+
ipv4_file = "output-v4.parquet"
888+
ipv6_file = "output-v6.parquet"
889+
890+
[output.parquet]
891+
ipv6_bucket_type = "invalid"
822892
823893
[[network.columns]]
824894
name = "network_bucket"
@@ -833,7 +903,7 @@ name = "country"
833903
database = "geo"
834904
path = ["country", "iso_code"]
835905
`,
836-
expectError: "ipv6_bucket_size must be between 1 and 128",
906+
expectError: "ipv6_bucket_type must be 'string' or 'int'",
837907
},
838908
}
839909

internal/network/utils.go

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package network
33

44
import (
55
"encoding/binary"
6+
"errors"
67
"fmt"
78
"net/netip"
89

@@ -18,6 +19,36 @@ func IPv4ToUint32(addr netip.Addr) uint32 {
1819
return binary.BigEndian.Uint32(bytes[:])
1920
}
2021

22+
// IPv6BucketToInt64 converts the first 60 bits of an IPv6 address to int64.
23+
//
24+
// This is used for IPv6 bucket values where the address has been masked to
25+
// the bucket boundary (trailing bits are zero).
26+
//
27+
// NOTE: The address must already be masked to the appropriate bucket (i.e.,
28+
// if you have a bucket size of /16, you must provide 2001:: as opposed to
29+
// something like 2001:abcd::).
30+
//
31+
// We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
32+
// which simplifies BigQuery queries (no two's complement handling needed).
33+
//
34+
// In BigQuery, you can compute the same value using:
35+
//
36+
// CAST(CONCAT('0x', SUBSTR(
37+
// TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING(ip), bucket_size)), 1, 15
38+
// )) AS INT64)
39+
//
40+
// where bucket_size is the prefix length used for bucketing.
41+
func IPv6BucketToInt64(addr netip.Addr) (int64, error) {
42+
if !addr.Is6() {
43+
return 0, errors.New("IPv6BucketToInt64 called with non-IPv6 address")
44+
}
45+
bytes := addr.As16()
46+
// Read first 64 bits, then right-shift by 4 to get top 60 bits
47+
val := binary.BigEndian.Uint64(bytes[:8])
48+
//nolint:gosec // 60-bit value always fits in positive int64
49+
return int64(val >> 4), nil
50+
}
51+
2152
// IsAdjacent checks if two IP addresses are consecutive (no gap between them).
2253
func IsAdjacent(endIP, startIP netip.Addr) bool {
2354
if endIP.Is4() != startIP.Is4() {

0 commit comments

Comments
 (0)