Add network_bucket column type for efficient BigQuery lookups by horgh · Pull Request #12 · maxmind/mmdbconvert

horgh · 2025-12-19T23:24:43Z

Add a new network_bucket network column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int.

Key changes:

Add SplitPrefix() function to split prefixes into bucket-sized pieces
Add IPv4BucketSize and IPv6BucketSize config options (default: 16)
Implement row duplication in Parquet writer for networks spanning buckets
Bucket type: int64 for IPv4, hex string for IPv6
Require split files (ipv4_file + ipv6_file) for network_bucket column

🤖 Generated with Claude Code

horgh · 2025-12-19T23:26:59Z

I was thinking about whether the bucket column should be 4 bytes for IPv4 instead of integer like I have it currently. I decided to go with integer as it seemed more consistent with what we do for start_ip and end_ip.

One downside with this is we have to use the bucket column differently depending on whether we have IPv4 or IPv6. However that is already the case with start_ip and end_ip.

Another consideration is this is pretty BigQuery specific. It is conceivable that in other cases, having the bucket in a different format could be convenient. However we could always add further types/options if that turns out to be the case.

horgh · 2025-12-19T23:37:44Z

Another consideration is whether to duplicate the other network columns when a row needs to be added to multiple buckets. Currently I duplicate everything except the bucket column.

I think either way could make sense. However I have opted for duplicating them as it retains the original network size, which could be interesting knowledge. However I could also see a case for making the other network columns match the bucket network.

e.g. right now we go from:

network,country
2.0.0.0/15,CA

to

network,country,network_bucket
2.0.0.0/15,CA,2.0.0.0/16
2.0.0.0/15,CA,2.1.0.0/16

But we could instead go to:

network,country,network_bucket
2.0.0.0/16,CA,2.0.0.0/16
2.1.0.0/16,CA,2.1.0.0/16

I think it is a simpler implementation to duplicate the data. But it could be surprising too.

horgh · 2025-12-20T00:49:54Z

Thinking about the duplication issue I mention above, I am leaning towards changing it. I'll think about it some more.

horgh · 2025-12-21T17:45:30Z

After thinking about it a bit, I am now thinking it is likely okay to duplicate the network columns. It might even be desirable in some use cases as knowing the original network could be more interesting than one created for the bucket, which is not meaningful other than for enabling faster queries.

Copilot

Pull request overview

This PR adds a network_bucket column type for Parquet output to enable efficient IP lookups in BigQuery and other analytics platforms. The bucketing strategy partitions networks into fixed-size buckets, allowing queries to filter by bucket before checking range conditions, dramatically reducing scan size.

Key changes:

New SplitPrefix() function that splits IP prefixes into bucket-aligned sub-prefixes
Configuration options ipv4_bucket_size and ipv6_bucket_size (default: 16) for Parquet output
Row duplication logic in Parquet writer for networks spanning multiple buckets

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`internal/network/utils.go`	Adds `SplitPrefix()` function to split prefixes into bucket-sized pieces with protection against infinite loops
`internal/network/utils_test.go`	Comprehensive test coverage for `SplitPrefix()` including edge cases at IP space boundaries
`internal/writer/parquet.go`	Implements bucketing logic with row duplication and bucket column value generation
`internal/writer/parquet_test.go`	Tests for network bucket functionality covering split/no-split cases and type validation
`internal/writer/csv.go`	Adds `NetworkColumnBucket` constant for consistency
`internal/config/config.go`	Adds bucket size configuration fields with defaults and validation for bucket column requirements
`internal/config/config_test.go`	Test coverage for bucket configuration parsing and validation
`docs/parquet-queries.md`	Documents BigQuery query patterns with bucketing examples
`docs/config.md`	Documents bucket size configuration options in tabular format
`README.md`	Explains network bucketing feature with configuration examples
`CHANGELOG.md`	Documents new feature in unreleased section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/config/config.go

Add a new `network_bucket` network column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int. Key changes: - Add SplitPrefix() function to split prefixes into bucket-sized pieces - Add IPv4BucketSize and IPv6BucketSize config options (default: 16) - Implement row duplication in Parquet writer for networks spanning buckets - Bucket type: int64 for IPv4, hex string for IPv6 - Require split files (ipv4_file + ipv6_file) for network_bucket column 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This extends the network_bucket column type, previously only available for Parquet output, to also work with CSV output. The implementation mirrors the Parquet approach: - Add bucket configuration to CSVConfig (ipv4_bucket_size, ipv6_bucket_size, ipv6_bucket_type) - Implement bucketing logic in CSV writer - Support both hex string and integer formats for IPv6 buckets - Require split files when using network_bucket (same as Parquet) Also refactors shared code: - Move hasNetworkBucketColumn() and network column constants to new writer.go file - Rewrite CSV network_bucket tests to mirror Parquet test structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

internal/network/utils_test.go

docs/bigquery.md

marselester · 2025-12-24T21:18:14Z

Looks great! 🚀

horgh force-pushed the wstorey/bucket-column branch from 03f583c to ba3e244 Compare December 21, 2025 17:36

horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from 965783e to ff9a76c Compare December 21, 2025 19:55

horgh requested a review from Copilot December 21, 2025 19:57

Copilot started reviewing on behalf of horgh December 21, 2025 19:57 View session

Copilot AI reviewed Dec 21, 2025

View reviewed changes

internal/config/config.go Show resolved Hide resolved

horgh force-pushed the wstorey/bucket-column branch 6 times, most recently from f736930 to eb99657 Compare December 22, 2025 17:39

horgh added 2 commits December 23, 2025 23:14

Do not lint testdata/MaxMind-DB

e743aa7

Reflow changelog entry

e4ca579

horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from c7fa37b to 6064bbf Compare December 24, 2025 17:12

horgh and others added 3 commits December 24, 2025 17:24

Support IPv6 buckets being integer

32bc23a

horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from 936f17a to b9b28df Compare December 24, 2025 19:47

Improve docs

e498ee4

horgh force-pushed the wstorey/bucket-column branch from b9b28df to e498ee4 Compare December 24, 2025 20:05

marselester reviewed Dec 24, 2025

View reviewed changes

internal/network/utils_test.go Show resolved Hide resolved

marselester reviewed Dec 24, 2025

View reviewed changes

docs/bigquery.md Show resolved Hide resolved

marselester approved these changes Dec 24, 2025

View reviewed changes

horgh merged commit 62e7a0d into main Dec 24, 2025
12 checks passed

horgh deleted the wstorey/bucket-column branch December 24, 2025 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add network_bucket column type for efficient BigQuery lookups#12

Add network_bucket column type for efficient BigQuery lookups#12
horgh merged 6 commits intomainfrom
wstorey/bucket-column

horgh commented Dec 19, 2025 •

edited

Loading

Uh oh!

horgh commented Dec 19, 2025

Uh oh!

horgh commented Dec 19, 2025 •

edited

Loading

Uh oh!

horgh commented Dec 20, 2025

Uh oh!

horgh commented Dec 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marselester commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

horgh commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horgh commented Dec 19, 2025

Uh oh!

horgh commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horgh commented Dec 20, 2025

Uh oh!

horgh commented Dec 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marselester commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

horgh commented Dec 19, 2025 •

edited

Loading

horgh commented Dec 19, 2025 •

edited

Loading