Improve docs

horgh · horgh · commit e498ee4ceb41 · 2025-12-24T20:04:52.000Z
diff --git a/README.md b/README.md
@@ -423,8 +423,6 @@ split your output into separate IPv4/IPv6 files via `output.ipv4_file` and
 `output.ipv6_file`. For single-file outputs that include IPv6 data, use string
 columns (`start_ip`, `end_ip`, `cidr`).
 
-**Note:** `network_bucket` is supported for CSV and Parquet output.
-
 ### Network Bucketing for Analytics (BigQuery, etc.)
 
 When loading network data into analytics platforms like BigQuery, range queries
@@ -457,10 +455,13 @@ name = "network_bucket"
 type = "network_bucket"
 ```
 
-For IPv4, the bucket is an integer. For IPv6, the bucket is either a hex string
-(default) or an integer when `ipv6_bucket_type = "int"` is configured. Using
-`network_bucket` requires split output files. See
-[docs/parquet-queries.md](docs/parquet-queries.md) for BigQuery query examples.
+For IPv4, the bucket is a 32-bit integer. For IPv6, the bucket is either a hex
+string (default) or a 60-bit integer when `ipv6_bucket_type = "int"` is
+configured.
+
+Using `network_bucket` requires split output files.
+
+See [docs/bigquery.md](docs/bigquery.md) for BigQuery query examples.
 
 **Note:** When a network is larger than the bucket size (e.g., a /15 with /16
 buckets), the row is duplicated for each bucket it spans. This ensures queries
@@ -576,6 +577,7 @@ This ensures accurate IP lookups with no ambiguity.
 
 - [Configuration Reference](docs/config.md) - Complete config file documentation
 - [Parquet Query Guide](docs/parquet-queries.md) - Optimizing IP lookup queries
+- [BigQuery Guide](docs/bigquery.md) - Network bucketing for BigQuery
 
 ## Requirements
 
diff --git a/docs/bigquery.md b/docs/bigquery.md
@@ -0,0 +1,69 @@
+# BigQuery with Network Bucketing
+
+BigQuery performs full table scans for range queries like
+`WHERE start_int <= ip AND end_int >= ip`. Use a `network_bucket` column to
+enable efficient lookups.
+
+**Note:** The BigQuery table must be clustered on the `network_bucket` column
+for efficient querying.
+
+**Important:** The bucket size in your queries must match the configured bucket
+size. The examples below use the default `/16` bucket size. If you configured a
+different `ipv4_bucket_size` or `ipv6_bucket_size`, adjust the second argument
+to `NET.IP_TRUNC()` accordingly.
+
+## IPv4 Lookup
+
+For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
+`NET.IPV4_TO_INT64()` to convert to the integer type:
+
+```sql
+-- Using default ipv4_bucket_size = 16
+SELECT *
+FROM `project.dataset.geoip_v4`
+WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
+AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
+```
+
+## IPv6 Lookup
+
+The query depends on your `ipv6_bucket_type` configuration.
+
+**Note:** For IPv6 files, `start_int` and `end_int` columns are stored as
+16-byte binary values, not integers. The comparison with `NET.IP_FROM_STRING()`
+works because it also returns BYTES.
+
+**Using default `ipv6_bucket_type = "string"` (hex string):**
+
+```sql
+-- Using default ipv6_bucket_size = 16
+SELECT *
+FROM `project.dataset.geoip_v6`
+WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16))
+AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
+```
+
+**Using `ipv6_bucket_type = "int"` (60-bit int64):**
+
+```sql
+-- Using default ipv6_bucket_size = 16
+SELECT *
+FROM `project.dataset.geoip_v6`
+WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
+    TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
+  )) AS INT64)
+AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
+```
+
+The int type expression extracts the first 60 bits (15 hex chars) of the
+truncated IPv6 address as an integer.
+
+## Why Bucketing Helps
+
+Without bucketing, BigQuery must scan every row to check the range condition.
+With bucketing:
+
+1. BigQuery first filters by exact match on `network_bucket`
+2. Only matching bucket rows are checked for the range condition
+3. Result: Query scans only rows in the matching bucket instead of the entire
+   table
diff --git a/docs/config.md b/docs/config.md
@@ -144,6 +144,33 @@ ipv6_file = "merged_ipv6.parquet"
 
 When splitting output, both `ipv4_file` and `ipv6_file` must be configured.
 
+#### IPv6 Bucket Type Options
+
+IPv6 buckets can be stored as either hex strings (default) or int64 values:
+
+**String type (default):**
+
+- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
+- Storage: 32 bytes per value
+
+**Int type (`ipv6_bucket_type = "int"`):**
+
+- Format: First 60 bits of the bucket address as int64
+- Storage: 8 bytes per value (4x smaller than string)
+
+We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
+which simplifies queries by avoiding two's complement handling.
+
+**When to use each type:**
+
+- Use **string** (default) for databases where hex string representations are
+  simpler to work with.
+- Use **int** for reduced storage cost at the price of more complicated queries.
+
+We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
+there so far has not been a need. For example, BigQuery cannot cluster on
+`bytes`, so it is not helpful there.
+
 ### Network Columns
 
 Network columns define how IP network information is output. These columns
diff --git a/docs/parquet-queries.md b/docs/parquet-queries.md
@@ -330,143 +330,6 @@ ipv4_file = "geo_ipv4.parquet"
 ipv6_file = "geo_ipv6.parquet"
 ```
 
-## BigQuery with Network Bucketing
-
-BigQuery performs full table scans for range queries like
-`WHERE start_int <= ip AND end_int >= ip`. Use the `network_bucket` column to
-enable efficient lookups.
-
-### Configuration
-
-```toml
-[output]
-format = "parquet"
-ipv4_file = "geoip-v4.parquet"
-ipv6_file = "geoip-v6.parquet"
-
-[output.parquet]
-ipv4_bucket_size = 16    # Default: /16 prefix
-ipv6_bucket_size = 16    # Default: /16 prefix
-ipv6_bucket_type = "string"  # Default: "string" (hex), or "int" (60-bit integer)
-
-[[network.columns]]
-name = "start_int"
-type = "start_int"
-
-[[network.columns]]
-name = "end_int"
-type = "end_int"
-
-[[network.columns]]
-name = "network_bucket"
-type = "network_bucket"
-```
-
-### BigQuery Query Patterns
-
-**Note:** The BigQuery table must be clustered on the `network_bucket` column
-for efficient querying.
-
-**Important:** The bucket size in your queries must match the configured bucket
-size. The examples below use the default `/16` bucket size. If you configured a
-different `ipv4_bucket_size` or `ipv6_bucket_size`, adjust the second argument
-to `NET.IP_TRUNC()` accordingly.
-
-#### IPv4 Lookup
-
-For IPv4, the bucket is int64. Use `NET.IP_TRUNC()` to get the bucket and
-`NET.IPV4_TO_INT64()` to convert to the integer type:
-
-```sql
--- Using default ipv4_bucket_size = 16
-SELECT *
-FROM `project.dataset.geoip_v4`
-WHERE network_bucket = NET.IPV4_TO_INT64(NET.IP_TRUNC(NET.IP_FROM_STRING('203.0.113.100'), 16))
-AND NET.IPV4_TO_INT64(NET.IP_FROM_STRING('203.0.113.100')) BETWEEN start_int AND end_int;
-```
-
-#### IPv6 Lookup
-
-The query depends on your `ipv6_bucket_type` configuration.
-
-**Using default `ipv6_bucket_type = "string"` (hex string):**
-
-```sql
--- Using default ipv6_bucket_size = 16
-SELECT *
-FROM `project.dataset.geoip_v6`
-WHERE network_bucket = TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16))
-AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
-```
-
-**Using `ipv6_bucket_type = "int"` (60-bit int64):**
-
-```sql
--- Using default ipv6_bucket_size = 16
-SELECT *
-FROM `project.dataset.geoip_v6`
-WHERE network_bucket = CAST(CONCAT('0x', SUBSTR(
-    TO_HEX(NET.IP_TRUNC(NET.IP_FROM_STRING('2001:db8::1'), 16)), 1, 15
-  )) AS INT64)
-AND NET.IP_FROM_STRING('2001:db8::1') BETWEEN start_int AND end_int;
-```
-
-The int type expression extracts the first 60 bits (15 hex chars) of the
-truncated IPv6 address as an integer.
-
-### Why Bucketing Helps
-
-Without bucketing, BigQuery must scan every row to check the range condition.
-With bucketing:
-
-1. BigQuery first filters by exact match on `network_bucket`
-2. Only matching bucket rows are checked for the range condition
-3. Result: Query scans only rows in the matching bucket instead of the entire
-   table
-
-### Row Duplication
-
-Networks larger than the bucket size are duplicated. For example, a /15 network
-spans two /16 buckets:
-
-**IPv4 example (int64 bucket values):**
-
-| network    | start_int | end_int  | network_bucket |
-| ---------- | --------- | -------- | -------------- |
-| 2.0.0.0/15 | 33554432  | 33685503 | 33554432       |
-| 2.0.0.0/15 | 33554432  | 33685503 | 33619968       |
-
-Both rows have the same `start_int`/`end_int` (the full /15 range), but
-different `network_bucket` values (2.0.0.0 = 33554432, 2.1.0.0 = 33619968).
-Queries for IPs in either bucket will find the network.
-
-### IPv6 Bucket Type Options
-
-IPv6 buckets can be stored as either hex strings (default) or int64 values:
-
-**String type (default):**
-
-- Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
-- Storage: 32 bytes per value
-
-**Int type (`ipv6_bucket_type = "int"`):**
-
-- Format: First 60 bits of the bucket address as int64
-- Storage: 8 bytes per value (4x smaller than string)
-
-We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
-which simplifies BigQuery queries by avoiding two's complement handling.
-
-**When to use each type:**
-
-- Use **string** (default) for databases where hex string representations are
-  simpler to work with.
-- Use **int** for reduced storage cost at the price of more complicated queries.
-
-We do not provide a `bytes` type for the IPv6 bucket. Primarily this is because
-there so far has not been a need. For example, BigQuery cannot cluster on
-`bytes`, so it is not helpful there.
-
 ## Common Query Patterns
 
 ### Single IP Lookup
diff --git a/internal/network/utils.go b/internal/network/utils.go
@@ -21,15 +21,15 @@ func IPv4ToUint32(addr netip.Addr) uint32 {
 
 // IPv6BucketToInt64 converts the first 60 bits of an IPv6 address to int64.
 //
-// This is used for IPv6 bucket values where the address has been masked to
-// the bucket boundary (trailing bits are zero).
-//
-// NOTE: The address must already be masked to the appropriate bucket (i.e.,
-// if you have a bucket size of /16, you must provide 2001:: as opposed to
+// NOTE: The address must already be masked to the appropriate bucket (i.e., if
+// you have a bucket size of /16, you must provide 2001:: as opposed to
 // something like 2001:abcd::).
 //
-// We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
-// which simplifies BigQuery queries (no two's complement handling needed).
+// We use 60 bits (not 64) because 60-bit values always fit in a positive
+// int64, which simplifies queries (no two's complement handling needed).
+//
+// We use 60 bits in particular as that is what 15 hex characters provides.
+// This is already more bits than we'd typically need.
 //
 // In BigQuery, you can compute the same value using:
 //