Skip to content

Add network_bucket column type for efficient BigQuery lookups#12

Merged
horgh merged 6 commits intomainfrom
wstorey/bucket-column
Dec 24, 2025
Merged

Add network_bucket column type for efficient BigQuery lookups#12
horgh merged 6 commits intomainfrom
wstorey/bucket-column

Conversation

@horgh
Copy link
Copy Markdown
Contributor

@horgh horgh commented Dec 19, 2025

Add a new network_bucket network column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int.

Key changes:

  • Add SplitPrefix() function to split prefixes into bucket-sized pieces
  • Add IPv4BucketSize and IPv6BucketSize config options (default: 16)
  • Implement row duplication in Parquet writer for networks spanning buckets
  • Bucket type: int64 for IPv4, hex string for IPv6
  • Require split files (ipv4_file + ipv6_file) for network_bucket column

🤖 Generated with Claude Code

@horgh
Copy link
Copy Markdown
Contributor Author

horgh commented Dec 19, 2025

I was thinking about whether the bucket column should be 4 bytes for IPv4 instead of integer like I have it currently. I decided to go with integer as it seemed more consistent with what we do for start_ip and end_ip.

One downside with this is we have to use the bucket column differently depending on whether we have IPv4 or IPv6. However that is already the case with start_ip and end_ip.

Another consideration is this is pretty BigQuery specific. It is conceivable that in other cases, having the bucket in a different format could be convenient. However we could always add further types/options if that turns out to be the case.

@horgh
Copy link
Copy Markdown
Contributor Author

horgh commented Dec 19, 2025

Another consideration is whether to duplicate the other network columns when a row needs to be added to multiple buckets. Currently I duplicate everything except the bucket column.

I think either way could make sense. However I have opted for duplicating them as it retains the original network size, which could be interesting knowledge. However I could also see a case for making the other network columns match the bucket network.

e.g. right now we go from:

network,country
2.0.0.0/15,CA

to

network,country,network_bucket
2.0.0.0/15,CA,2.0.0.0/16
2.0.0.0/15,CA,2.1.0.0/16

But we could instead go to:

network,country,network_bucket
2.0.0.0/16,CA,2.0.0.0/16
2.1.0.0/16,CA,2.1.0.0/16

I think it is a simpler implementation to duplicate the data. But it could be surprising too.

@horgh
Copy link
Copy Markdown
Contributor Author

horgh commented Dec 20, 2025

Thinking about the duplication issue I mention above, I am leaning towards changing it. I'll think about it some more.

@horgh horgh force-pushed the wstorey/bucket-column branch from 03f583c to ba3e244 Compare December 21, 2025 17:36
@horgh
Copy link
Copy Markdown
Contributor Author

horgh commented Dec 21, 2025

After thinking about it a bit, I am now thinking it is likely okay to duplicate the network columns. It might even be desirable in some use cases as knowing the original network could be more interesting than one created for the bucket, which is not meaningful other than for enabling faster queries.

@horgh horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from 965783e to ff9a76c Compare December 21, 2025 19:55
@horgh horgh requested a review from Copilot December 21, 2025 19:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a network_bucket column type for Parquet output to enable efficient IP lookups in BigQuery and other analytics platforms. The bucketing strategy partitions networks into fixed-size buckets, allowing queries to filter by bucket before checking range conditions, dramatically reducing scan size.

Key changes:

  • New SplitPrefix() function that splits IP prefixes into bucket-aligned sub-prefixes
  • Configuration options ipv4_bucket_size and ipv6_bucket_size (default: 16) for Parquet output
  • Row duplication logic in Parquet writer for networks spanning multiple buckets

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/network/utils.go Adds SplitPrefix() function to split prefixes into bucket-sized pieces with protection against infinite loops
internal/network/utils_test.go Comprehensive test coverage for SplitPrefix() including edge cases at IP space boundaries
internal/writer/parquet.go Implements bucketing logic with row duplication and bucket column value generation
internal/writer/parquet_test.go Tests for network bucket functionality covering split/no-split cases and type validation
internal/writer/csv.go Adds NetworkColumnBucket constant for consistency
internal/config/config.go Adds bucket size configuration fields with defaults and validation for bucket column requirements
internal/config/config_test.go Test coverage for bucket configuration parsing and validation
docs/parquet-queries.md Documents BigQuery query patterns with bucketing examples
docs/config.md Documents bucket size configuration options in tabular format
README.md Explains network bucketing feature with configuration examples
CHANGELOG.md Documents new feature in unreleased section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@horgh horgh force-pushed the wstorey/bucket-column branch 6 times, most recently from f736930 to eb99657 Compare December 22, 2025 17:39
@horgh horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from c7fa37b to 6064bbf Compare December 24, 2025 17:12
horgh and others added 3 commits December 24, 2025 17:24
Add a new `network_bucket` network column type for Parquet output that
enables efficient IP lookups in BigQuery and other analytics platforms.
When a network spans multiple buckets, rows are duplicated with different
bucket values while preserving the original network info in start_int/end_int.

Key changes:
- Add SplitPrefix() function to split prefixes into bucket-sized pieces
- Add IPv4BucketSize and IPv6BucketSize config options (default: 16)
- Implement row duplication in Parquet writer for networks spanning buckets
- Bucket type: int64 for IPv4, hex string for IPv6
- Require split files (ipv4_file + ipv6_file) for network_bucket column

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This extends the network_bucket column type, previously only available
for Parquet output, to also work with CSV output. The implementation
mirrors the Parquet approach:

- Add bucket configuration to CSVConfig (ipv4_bucket_size,
  ipv6_bucket_size, ipv6_bucket_type)
- Implement bucketing logic in CSV writer
- Support both hex string and integer formats for IPv6 buckets
- Require split files when using network_bucket (same as Parquet)

Also refactors shared code:
- Move hasNetworkBucketColumn() and network column constants to new
  writer.go file
- Rewrite CSV network_bucket tests to mirror Parquet test structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@horgh horgh force-pushed the wstorey/bucket-column branch 2 times, most recently from 936f17a to b9b28df Compare December 24, 2025 19:47
@horgh horgh force-pushed the wstorey/bucket-column branch from b9b28df to e498ee4 Compare December 24, 2025 20:05
@marselester
Copy link
Copy Markdown
Contributor

Looks great! 🚀

@horgh horgh merged commit 62e7a0d into main Dec 24, 2025
12 checks passed
@horgh horgh deleted the wstorey/bucket-column branch December 24, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants