Add network_bucket column type for efficient BigQuery lookups#12
Conversation
|
I was thinking about whether the bucket column should be 4 bytes for IPv4 instead of integer like I have it currently. I decided to go with integer as it seemed more consistent with what we do for start_ip and end_ip. One downside with this is we have to use the bucket column differently depending on whether we have IPv4 or IPv6. However that is already the case with start_ip and end_ip. Another consideration is this is pretty BigQuery specific. It is conceivable that in other cases, having the bucket in a different format could be convenient. However we could always add further types/options if that turns out to be the case. |
|
Another consideration is whether to duplicate the other network columns when a row needs to be added to multiple buckets. Currently I duplicate everything except the bucket column. I think either way could make sense. However I have opted for duplicating them as it retains the original network size, which could be interesting knowledge. However I could also see a case for making the other network columns match the bucket network. e.g. right now we go from: network,country
2.0.0.0/15,CAto network,country,network_bucket
2.0.0.0/15,CA,2.0.0.0/16
2.0.0.0/15,CA,2.1.0.0/16But we could instead go to: network,country,network_bucket
2.0.0.0/16,CA,2.0.0.0/16
2.1.0.0/16,CA,2.1.0.0/16I think it is a simpler implementation to duplicate the data. But it could be surprising too. |
|
Thinking about the duplication issue I mention above, I am leaning towards changing it. I'll think about it some more. |
03f583c to
ba3e244
Compare
|
After thinking about it a bit, I am now thinking it is likely okay to duplicate the network columns. It might even be desirable in some use cases as knowing the original network could be more interesting than one created for the bucket, which is not meaningful other than for enabling faster queries. |
965783e to
ff9a76c
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a network_bucket column type for Parquet output to enable efficient IP lookups in BigQuery and other analytics platforms. The bucketing strategy partitions networks into fixed-size buckets, allowing queries to filter by bucket before checking range conditions, dramatically reducing scan size.
Key changes:
- New
SplitPrefix()function that splits IP prefixes into bucket-aligned sub-prefixes - Configuration options
ipv4_bucket_sizeandipv6_bucket_size(default: 16) for Parquet output - Row duplication logic in Parquet writer for networks spanning multiple buckets
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
internal/network/utils.go |
Adds SplitPrefix() function to split prefixes into bucket-sized pieces with protection against infinite loops |
internal/network/utils_test.go |
Comprehensive test coverage for SplitPrefix() including edge cases at IP space boundaries |
internal/writer/parquet.go |
Implements bucketing logic with row duplication and bucket column value generation |
internal/writer/parquet_test.go |
Tests for network bucket functionality covering split/no-split cases and type validation |
internal/writer/csv.go |
Adds NetworkColumnBucket constant for consistency |
internal/config/config.go |
Adds bucket size configuration fields with defaults and validation for bucket column requirements |
internal/config/config_test.go |
Test coverage for bucket configuration parsing and validation |
docs/parquet-queries.md |
Documents BigQuery query patterns with bucketing examples |
docs/config.md |
Documents bucket size configuration options in tabular format |
README.md |
Explains network bucketing feature with configuration examples |
CHANGELOG.md |
Documents new feature in unreleased section |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f736930 to
eb99657
Compare
c7fa37b to
6064bbf
Compare
Add a new `network_bucket` network column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int. Key changes: - Add SplitPrefix() function to split prefixes into bucket-sized pieces - Add IPv4BucketSize and IPv6BucketSize config options (default: 16) - Implement row duplication in Parquet writer for networks spanning buckets - Bucket type: int64 for IPv4, hex string for IPv6 - Require split files (ipv4_file + ipv6_file) for network_bucket column 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This extends the network_bucket column type, previously only available for Parquet output, to also work with CSV output. The implementation mirrors the Parquet approach: - Add bucket configuration to CSVConfig (ipv4_bucket_size, ipv6_bucket_size, ipv6_bucket_type) - Implement bucketing logic in CSV writer - Support both hex string and integer formats for IPv6 buckets - Require split files when using network_bucket (same as Parquet) Also refactors shared code: - Move hasNetworkBucketColumn() and network column constants to new writer.go file - Rewrite CSV network_bucket tests to mirror Parquet test structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
936f17a to
b9b28df
Compare
b9b28df to
e498ee4
Compare
|
Looks great! 🚀 |
Add a new
network_bucketnetwork column type for Parquet output that enables efficient IP lookups in BigQuery and other analytics platforms. When a network spans multiple buckets, rows are duplicated with different bucket values while preserving the original network info in start_int/end_int.Key changes:
🤖 Generated with Claude Code