@@ -330,143 +330,6 @@ ipv4_file = "geo_ipv4.parquet"
330330ipv6_file = " geo_ipv6.parquet"
331331```
332332
333- ## BigQuery with Network Bucketing
334-
335- BigQuery performs full table scans for range queries like
336- ` WHERE start_int <= ip AND end_int >= ip ` . Use the ` network_bucket ` column to
337- enable efficient lookups.
338-
339- ### Configuration
340-
341- ``` toml
342- [output ]
343- format = " parquet"
344- ipv4_file = " geoip-v4.parquet"
345- ipv6_file = " geoip-v6.parquet"
346-
347- [output .parquet ]
348- ipv4_bucket_size = 16 # Default: /16 prefix
349- ipv6_bucket_size = 16 # Default: /16 prefix
350- ipv6_bucket_type = " string" # Default: "string" (hex), or "int" (60-bit integer)
351-
352- [[network .columns ]]
353- name = " start_int"
354- type = " start_int"
355-
356- [[network .columns ]]
357- name = " end_int"
358- type = " end_int"
359-
360- [[network .columns ]]
361- name = " network_bucket"
362- type = " network_bucket"
363- ```
364-
365- ### BigQuery Query Patterns
366-
367- ** Note:** The BigQuery table must be clustered on the ` network_bucket ` column
368- for efficient querying.
369-
370- ** Important:** The bucket size in your queries must match the configured bucket
371- size. The examples below use the default ` /16 ` bucket size. If you configured a
372- different ` ipv4_bucket_size ` or ` ipv6_bucket_size ` , adjust the second argument
373- to ` NET.IP_TRUNC() ` accordingly.
374-
375- #### IPv4 Lookup
376-
377- For IPv4, the bucket is int64. Use ` NET.IP_TRUNC() ` to get the bucket and
378- ` NET.IPV4_TO_INT64() ` to convert to the integer type:
379-
380- ``` sql
381- -- Using default ipv4_bucket_size = 16
382- SELECT *
383- FROM ` project.dataset.geoip_v4`
384- WHERE network_bucket = NET .IPV4_TO_INT64 (NET .IP_TRUNC (NET .IP_FROM_STRING (' 203.0.113.100' ), 16 ))
385- AND NET .IPV4_TO_INT64 (NET .IP_FROM_STRING (' 203.0.113.100' )) BETWEEN start_int AND end_int;
386- ```
387-
388- #### IPv6 Lookup
389-
390- The query depends on your ` ipv6_bucket_type ` configuration.
391-
392- ** Using default ` ipv6_bucket_type = "string" ` (hex string):**
393-
394- ``` sql
395- -- Using default ipv6_bucket_size = 16
396- SELECT *
397- FROM ` project.dataset.geoip_v6`
398- WHERE network_bucket = TO_HEX(NET .IP_TRUNC (NET .IP_FROM_STRING (' 2001:db8::1' ), 16 ))
399- AND NET .IP_FROM_STRING (' 2001:db8::1' ) BETWEEN start_int AND end_int;
400- ```
401-
402- ** Using ` ipv6_bucket_type = "int" ` (60-bit int64):**
403-
404- ``` sql
405- -- Using default ipv6_bucket_size = 16
406- SELECT *
407- FROM ` project.dataset.geoip_v6`
408- WHERE network_bucket = CAST(CONCAT(' 0x' , SUBSTR(
409- TO_HEX(NET .IP_TRUNC (NET .IP_FROM_STRING (' 2001:db8::1' ), 16 )), 1 , 15
410- )) AS INT64)
411- AND NET .IP_FROM_STRING (' 2001:db8::1' ) BETWEEN start_int AND end_int;
412- ```
413-
414- The int type expression extracts the first 60 bits (15 hex chars) of the
415- truncated IPv6 address as an integer.
416-
417- ### Why Bucketing Helps
418-
419- Without bucketing, BigQuery must scan every row to check the range condition.
420- With bucketing:
421-
422- 1 . BigQuery first filters by exact match on ` network_bucket `
423- 2 . Only matching bucket rows are checked for the range condition
424- 3 . Result: Query scans only rows in the matching bucket instead of the entire
425- table
426-
427- ### Row Duplication
428-
429- Networks larger than the bucket size are duplicated. For example, a /15 network
430- spans two /16 buckets:
431-
432- ** IPv4 example (int64 bucket values):**
433-
434- | network | start_int | end_int | network_bucket |
435- | ---------- | --------- | -------- | -------------- |
436- | 2.0.0.0/15 | 33554432 | 33685503 | 33554432 |
437- | 2.0.0.0/15 | 33554432 | 33685503 | 33619968 |
438-
439- Both rows have the same ` start_int ` /` end_int ` (the full /15 range), but
440- different ` network_bucket ` values (2.0.0.0 = 33554432, 2.1.0.0 = 33619968).
441- Queries for IPs in either bucket will find the network.
442-
443- ### IPv6 Bucket Type Options
444-
445- IPv6 buckets can be stored as either hex strings (default) or int64 values:
446-
447- ** String type (default):**
448-
449- - Format: 32-character hex string (e.g., "20010db8000000000000000000000000")
450- - Storage: 32 bytes per value
451-
452- ** Int type (` ipv6_bucket_type = "int" ` ):**
453-
454- - Format: First 60 bits of the bucket address as int64
455- - Storage: 8 bytes per value (4x smaller than string)
456-
457- We use 60 bits (not 64) because 60-bit values always fit in a positive int64,
458- which simplifies BigQuery queries by avoiding two's complement handling.
459-
460- ** When to use each type:**
461-
462- - Use ** string** (default) for databases where hex string representations are
463- simpler to work with.
464- - Use ** int** for reduced storage cost at the price of more complicated queries.
465-
466- We do not provide a ` bytes ` type for the IPv6 bucket. Primarily this is because
467- there so far has not been a need. For example, BigQuery cannot cluster on
468- ` bytes ` , so it is not helpful there.
469-
470333## Common Query Patterns
471334
472335### Single IP Lookup
0 commit comments