Skip to content

Commit e498ee4

Browse files
committed
Improve docs
1 parent 9ae7146 commit e498ee4

File tree

5 files changed

+111
-150
lines changed

5 files changed

+111
-150
lines changed

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -423,8 +423,6 @@ split your output into separate IPv4/IPv6 files via `output.ipv4_file` and
423423
`output.ipv6_file`. For single-file outputs that include IPv6 data, use string
424424
columns (`start_ip`, `end_ip`, `cidr`).
425425

426-
**Note:** `network_bucket` is supported for CSV and Parquet output.
427-
428426
### Network Bucketing for Analytics (BigQuery, etc.)
429427

430428
When loading network data into analytics platforms like BigQuery, range queries
@@ -457,10 +455,13 @@ name = "network_bucket"
457455
type = "network_bucket"
458456
```
459457

460-
For IPv4, the bucket is an integer. For IPv6, the bucket is either a hex string
461-
(default) or an integer when `ipv6_bucket_type = "int"` is configured. Using
462-
`network_bucket` requires split output files. See
463-
[docs/parquet-queries.md](docs/parquet-queries.md) for BigQuery query examples.
458+
For IPv4, the bucket is a 32-bit integer. For IPv6, the bucket is either a hex
459+
string (default) or a 60-bit integer when `ipv6_bucket_type = "int"` is
460+
configured.
461+
462+
Using `network_bucket` requires split output files.
463+
464+
See [docs/bigquery.md](docs/bigquery.md) for BigQuery query examples.
464465

465466
**Note:** When a network is larger than the bucket size (e.g., a /15 with /16
466467
buckets), the row is duplicated for each bucket it spans. This ensures queries
@@ -576,6 +577,7 @@ This ensures accurate IP lookups with no ambiguity.
576577

577578
- [Configuration Reference](docs/config.md) - Complete config file documentation
578579
- [Parquet Query Guide](docs/parquet-queries.md) - Optimizing IP lookup queries
580+
- [BigQuery Guide](docs/bigquery.md) - Network bucketing for BigQuery
579581

580582
## Requirements
581583

docs/bigquery.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# BigQuery with Network Bucketing
2+
3+
BigQuery performs full table scans for range queries like
4+
`WHERE start_int <= ip AND end_int >= ip`. Use a `network_bucket` column to
5+
enable efficient lookups.
6+
7+
**Note:** The BigQuery table must be clustered on the `network_bucket` column
8+
for efficient querying.
9+
10+
**Important:** The bucket size in your queries must match the configured bucket
11+
size. The examples below use the default `/16` bucket size. If you configured a
12+
different `ipv4_bucket_size` or `ipv6_bucket_size`, adjust the second argument
13+
to `NET.IP_TRUNC()` accordingly.
14+
15+
## IPv4 Lookup
16+
17+
For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
18+
`NET.IPV4_TO_INT64()` to convert to the integer type:
19+
20+
```sql
21+
-- Using default ipv4_bucket_size = 16
22+
SELECT *
23+
FROM `project.dataset.geoip_v4`
24+
WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
25+
AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
26+
```
27+
28+
## IPv6 Lookup
29+
30+
The query depends on your `ipv6_bucket_type` configuration.
31+
32+
**Note:** For IPv6 files, `start_int` and `end_int` columns are stored as
33+
16-byte binary values, not integers. The comparison with `NET.IP_FROM_STRING()`
34+
works because it also returns BYTES.
35+
36+
**Using default `ipv6_bucket_type = "string"` (hex string):**
37+
38+
```sql
39+
-- Using default ipv6_bucket_size = 16
40+
SELECT *
41+
FROM `project.dataset.geoip_v6`
42+
WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16))
43+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
44+
```
45+
46+
**Using `ipv6_bucket_type = "int"` (60-bit int64):**
47+
48+
```sql
49+
-- Using default ipv6_bucket_size = 16
50+
SELECT *
51+
FROM `project.dataset.geoip_v6`
52+
WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
53+
TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
54+
)) AS INT64)
55+
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
56+
```
57+
58+
The int type expression extracts the first 60 bits (15 hex chars) of the
59+
truncated IPv6 address as an integer.
60+
61+
## Why Bucketing Helps
62+
63+
Without bucketing, BigQuery must scan every row to check the range condition.
64+
With bucketing:
65+
66+
1. BigQuery first filters by exact match on `network_bucket`
67+
2. Only matching bucket rows are checked for the range condition
68+
3. Result: Query scans only rows in the matching bucket instead of the entire
69+
table

docs/config.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,33 @@ ipv6_file = "merged_ipv6.parquet"
144144

145145
When splitting output, both `ipv4_file` and `ipv6_file` must be configured.
146146

147+
#### IPv6 Bucket Type Options
148+
149+
IPv6 buckets can be stored as either hex strings (default) or int64 values:
150+
151+
**String type (default):**
152+
153+
- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
154+
- Storage: 32 bytes per value
155+
156+
**Int type (`ipv6_bucket_type = "int"`):**
157+
158+
- Format: First 60 bits of the bucket address as int64
159+
- Storage: 8 bytes per value (4x smaller than string)
160+
161+
We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
162+
which simplifies queries by avoiding two's complement handling.
163+
164+
**When to use each type:**
165+
166+
- Use **string** (default) for databases where hex string representations are
167+
simpler to work with.
168+
- Use **int** for reduced storage cost at the price of more complicated queries.
169+
170+
We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
171+
there so far has not been a need. For example, BigQuery cannot cluster on
172+
`bytes`, so it is not helpful there.
173+
147174
### Network Columns
148175

149176
Network columns define how IP network information is output. These columns

docs/parquet-queries.md

Lines changed: 0 additions & 137 deletions
Original file line numberDiff line numberDiff line change
@@ -330,143 +330,6 @@ ipv4_file = "geo_ipv4.parquet"
330330
ipv6_file = "geo_ipv6.parquet"
331331
```
332332

333-
## BigQuery with Network Bucketing
334-
335-
BigQuery performs full table scans for range queries like
336-
`WHERE start_int <= ip AND end_int >= ip`. Use the `network_bucket` column to
337-
enable efficient lookups.
338-
339-
### Configuration
340-
341-
```toml
342-
[output]
343-
format = "parquet"
344-
ipv4_file = "geoip-v4.parquet"
345-
ipv6_file = "geoip-v6.parquet"
346-
347-
[output.parquet]
348-
ipv4_bucket_size = 16 # Default: /16 prefix
349-
ipv6_bucket_size = 16 # Default: /16 prefix
350-
ipv6_bucket_type = "string" # Default: "string" (hex), or "int" (60-bit integer)
351-
352-
[[network.columns]]
353-
name = "start_int"
354-
type = "start_int"
355-
356-
[[network.columns]]
357-
name = "end_int"
358-
type = "end_int"
359-
360-
[[network.columns]]
361-
name = "network_bucket"
362-
type = "network_bucket"
363-
```
364-
365-
### BigQuery Query Patterns
366-
367-
**Note:** The BigQuery table must be clustered on the `network_bucket` column
368-
for efficient querying.
369-
370-
**Important:** The bucket size in your queries must match the configured bucket
371-
size. The examples below use the default `/16` bucket size. If you configured a
372-
different `ipv4_bucket_size` or `ipv6_bucket_size`, adjust the second argument
373-
to `NET.IP_TRUNC()` accordingly.
374-
375-
#### IPv4 Lookup
376-
377-
For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
378-
`NET.IPV4_TO_INT64()` to convert to the integer type:
379-
380-
```sql
381-
-- Using default ipv4_bucket_size = 16
382-
SELECT *
383-
FROM `project.dataset.geoip_v4`
384-
WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
385-
AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
386-
```
387-
388-
#### IPv6 Lookup
389-
390-
The query depends on your `ipv6_bucket_type` configuration.
391-
392-
**Using default `ipv6_bucket_type = "string"` (hex string):**
393-
394-
```sql
395-
-- Using default ipv6_bucket_size = 16
396-
SELECT *
397-
FROM `project.dataset.geoip_v6`
398-
WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16))
399-
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
400-
```
401-
402-
**Using `ipv6_bucket_type = "int"` (60-bit int64):**
403-
404-
```sql
405-
-- Using default ipv6_bucket_size = 16
406-
SELECT *
407-
FROM `project.dataset.geoip_v6`
408-
WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
409-
TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
410-
)) AS INT64)
411-
AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
412-
```
413-
414-
The int type expression extracts the first 60 bits (15 hex chars) of the
415-
truncated IPv6 address as an integer.
416-
417-
### Why Bucketing Helps
418-
419-
Without bucketing, BigQuery must scan every row to check the range condition.
420-
With bucketing:
421-
422-
1. BigQuery first filters by exact match on `network_bucket`
423-
2. Only matching bucket rows are checked for the range condition
424-
3. Result: Query scans only rows in the matching bucket instead of the entire
425-
table
426-
427-
### Row Duplication
428-
429-
Networks larger than the bucket size are duplicated. For example, a /15 network
430-
spans two /16 buckets:
431-
432-
**IPv4 example (int64 bucket values):**
433-
434-
| network | start_int | end_int | network_bucket |
435-
| ---------- | --------- | -------- | -------------- |
436-
| 2.0.0.0/15 | 33554432 | 33685503 | 33554432 |
437-
| 2.0.0.0/15 | 33554432 | 33685503 | 33619968 |
438-
439-
Both rows have the same `start_int`/`end_int` (the full /15 range), but
440-
different `network_bucket` values (2.0.0.0 = 33554432, 2.1.0.0 = 33619968).
441-
Queries for IPs in either bucket will find the network.
442-
443-
### IPv6 Bucket Type Options
444-
445-
IPv6 buckets can be stored as either hex strings (default) or int64 values:
446-
447-
**String type (default):**
448-
449-
- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
450-
- Storage: 32 bytes per value
451-
452-
**Int type (`ipv6_bucket_type = "int"`):**
453-
454-
- Format: First 60 bits of the bucket address as int64
455-
- Storage: 8 bytes per value (4x smaller than string)
456-
457-
We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
458-
which simplifies BigQuery queries by avoiding two's complement handling.
459-
460-
**When to use each type:**
461-
462-
- Use **string** (default) for databases where hex string representations are
463-
simpler to work with.
464-
- Use **int** for reduced storage cost at the price of more complicated queries.
465-
466-
We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
467-
there so far has not been a need. For example, BigQuery cannot cluster on
468-
`bytes`, so it is not helpful there.
469-
470333
## Common Query Patterns
471334

472335
### Single IP Lookup

internal/network/utils.go

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,15 @@ func IPv4ToUint32(addr netip.Addr) uint32 {
2121

2222
// IPv6BucketToInt64 converts the first 60 bits of an IPv6 address to int64.
2323
//
24-
// This is used for IPv6 bucket values where the address has been masked to
25-
// the bucket boundary (trailing bits are zero).
26-
//
27-
// NOTE: The address must already be masked to the appropriate bucket (i.e.,
28-
// if you have a bucket size of /16, you must provide 2001:: as opposed to
24+
// NOTE: The address must already be masked to the appropriate bucket (i.e., if
25+
// you have a bucket size of /16, you must provide 2001:: as opposed to
2926
// something like 2001:abcd::).
3027
//
31-
// We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
32-
// which simplifies BigQuery queries (no two's complement handling needed).
28+
// We use 60 bits (not 64) because 60-bit values always fit in a positive
29+
// int64, which simplifies queries (no two's complement handling needed).
30+
//
31+
// We use 60 bits in particular as that is what 15 hex characters provides.
32+
// This is already more bits than we'd typically need.
3333
//
3434
// In BigQuery, you can compute the same value using:
3535
//

0 commit comments

Comments
 (0)