Skip to content

Commit 9ae7146

Browse files
horghclaude
andcommitted
Add network_bucket column support for CSV output
This extends the network_bucket column type, previously only available for Parquet output, to also work with CSV output. The implementation mirrors the Parquet approach: - Add bucket configuration to CSVConfig (ipv4_bucket_size, ipv6_bucket_size, ipv6_bucket_type) - Implement bucketing logic in CSV writer - Support both hex string and integer formats for IPv6 buckets - Require split files when using network_bucket (same as Parquet) Also refactors shared code: - Move hasNetworkBucketColumn() and network column constants to new writer.go file - Rewrite CSV network_bucket tests to mirror Parquet test structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 32bc23a commit 9ae7146

File tree

9 files changed

+843
-71
lines changed

9 files changed

+843
-71
lines changed

CHANGELOG.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,17 @@ and this project adheres to
1515
declaring that rows are sorted by start_int in ascending order. This enables
1616
query engines like DuckDB, Spark, and Trino to use the sort order for
1717
potential optimizations like binary search.
18-
- New `network_bucket` network column type for Parquet output, enabling
18+
- New `network_bucket` network column type for CSV and Parquet output, enabling
1919
efficient IP lookups in BigQuery and other analytics platforms. When a network
2020
spans multiple buckets, rows are duplicated with different bucket values while
2121
preserving original network info. For IPv4, the bucket is an integer. For
2222
IPv6, the bucket is either a hex string (e.g.,
2323
"200f0000000000000000000000000000") or an integer depending on
2424
`ipv6_bucket_type`. Requires split output files (`ipv4_file` and `ipv6_file`).
25-
- New Parquet options `ipv4_bucket_size` and `ipv6_bucket_size` to configure
26-
bucket prefix lengths (default: 16).
27-
- New Parquet option `ipv6_bucket_type` to configure the IPv6 network bucket
28-
column format (default: string).
25+
- New CSV and Parquet options `ipv4_bucket_size` and `ipv6_bucket_size` to
26+
configure bucket prefix lengths (default: 16).
27+
- New CSV and Parquet option `ipv6_bucket_type` to configure the IPv6 network
28+
bucket column format (default: string).
2929

3030
## [0.1.0] - 2025-11-07
3131

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ split your output into separate IPv4/IPv6 files via `output.ipv4_file` and
423423
`output.ipv6_file`. For single-file outputs that include IPv6 data, use string
424424
columns (`start_ip`, `end_ip`, `cidr`).
425425

426-
**Note:** `network_bucket` is currently only supported for Parquet output.
426+
**Note:** `network_bucket` is supported for CSV and Parquet output.
427427

428428
### Network Bucketing for Analytics (BigQuery, etc.)
429429

@@ -466,7 +466,7 @@ For IPv4, the bucket is an integer. For IPv6, the bucket is either a hex string
466466
buckets), the row is duplicated for each bucket it spans. This ensures queries
467467
find the correct network regardless of which bucket the IP falls into.
468468

469-
**Note:** `network_bucket` is currently only supported for Parquet output.
469+
**Note:** `network_bucket` is supported for CSV and Parquet output.
470470

471471
### Data Type Hints
472472

docs/config.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,19 @@ When `format = "csv"`, you can specify CSV-specific options:
7272
[output.csv]
7373
delimiter = "," # Field delimiter (default: ",")
7474
include_header = true # Include column headers (default: true)
75+
ipv4_bucket_size = 16 # Bucket prefix length for IPv4 (default: 16)
76+
ipv6_bucket_size = 16 # Bucket prefix length for IPv6 (default: 16)
77+
ipv6_bucket_type = "string" # IPv6 bucket value type: "string" or "int" (default: "string")
7578
```
7679

80+
| Option | Description | Default |
81+
| ------------------ | -------------------------------------------------------------------------- | -------- |
82+
| `delimiter` | Field delimiter character | "," |
83+
| `include_header` | Include column headers in output | true |
84+
| `ipv4_bucket_size` | Prefix length for IPv4 buckets (1-32, when `network_bucket` column used) | 16 |
85+
| `ipv6_bucket_size` | Prefix length for IPv6 buckets (1-60, when `network_bucket` column used) | 16 |
86+
| `ipv6_bucket_type` | IPv6 bucket value type: "string" (hex) or "int" (first 60 bits as integer) | "string" |
87+
7788
#### Parquet Options
7889

7990
When `format = "parquet"`, you can specify Parquet-specific options:
@@ -146,14 +157,14 @@ type = "cidr" # Output type
146157

147158
**Available types:**
148159

149-
| Type | Description |
150-
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
151-
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
152-
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
153-
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
154-
| `start_int` | Starting IP as integer |
155-
| `end_int` | Ending IP as integer |
156-
| `network_bucket` | Bucket for efficient lookups. IPv4: integer. IPv6: hex string (default) or integer (with `ipv6_bucket_type = "int"`). Requires split files (Parquet only). |
160+
| Type | Description |
161+
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
162+
| `cidr` | CIDR notation (e.g., "203.0.113.0/24") |
163+
| `start_ip` | Starting IP address (e.g., "203.0.113.0") |
164+
| `end_ip` | Ending IP address (e.g., "203.0.113.255") |
165+
| `start_int` | Starting IP as integer |
166+
| `end_int` | Ending IP as integer |
167+
| `network_bucket` | Bucket for efficient lookups. IPv4: integer. IPv6: hex string (default) or integer (with `ipv6_bucket_type = "int"`). Requires split files (CSV and Parquet only). |
157168

158169
**Default behavior:** If no `[[network.columns]]` sections are defined:
159170

internal/config/config.go

Lines changed: 59 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,11 @@ type OutputConfig struct {
4646

4747
// CSVConfig defines CSV output options.
4848
type CSVConfig struct {
49-
Delimiter string `toml:"delimiter"` // Field delimiter (default: ",")
50-
IncludeHeader *bool `toml:"include_header"` // Include column headers (default: true)
49+
Delimiter string `toml:"delimiter"` // Field delimiter (default: ",")
50+
IncludeHeader *bool `toml:"include_header"` // Include column headers (default: true)
51+
IPv4BucketSize int `toml:"ipv4_bucket_size"` // Bucket prefix length for IPv4 (default: 16)
52+
IPv6BucketSize int `toml:"ipv6_bucket_size"` // Bucket prefix length for IPv6 (default: 16)
53+
IPv6BucketType string `toml:"ipv6_bucket_type"` // "string" or "int" (default: "string")
5154
}
5255

5356
// ParquetConfig defines Parquet output options.
@@ -174,6 +177,15 @@ func applyDefaults(config *Config) {
174177
if config.Output.CSV.IncludeHeader == nil {
175178
config.Output.CSV.IncludeHeader = boolPtr(true)
176179
}
180+
if config.Output.CSV.IPv4BucketSize == 0 {
181+
config.Output.CSV.IPv4BucketSize = 16
182+
}
183+
if config.Output.CSV.IPv6BucketSize == 0 {
184+
config.Output.CSV.IPv6BucketSize = 16
185+
}
186+
if config.Output.CSV.IPv6BucketType == "" {
187+
config.Output.CSV.IPv6BucketType = IPv6BucketTypeString
188+
}
177189

178190
// Parquet defaults
179191
if config.Output.Parquet.Compression == "" {
@@ -360,9 +372,9 @@ func validate(config *Config) error {
360372
}
361373

362374
if hasBucketColumn {
363-
if config.Output.Format != formatParquet {
375+
if config.Output.Format == formatMMDB {
364376
return errors.New(
365-
"network_bucket column type is only supported for Parquet output",
377+
"network_bucket column type is only supported for CSV and Parquet output",
366378
)
367379
}
368380

@@ -374,31 +386,8 @@ func validate(config *Config) error {
374386
)
375387
}
376388

377-
// Validate bucket sizes when network_bucket column is used
378-
if config.Output.Parquet.IPv4BucketSize < 1 ||
379-
config.Output.Parquet.IPv4BucketSize > 32 {
380-
return fmt.Errorf(
381-
"ipv4_bucket_size must be between 1 and 32, got %d",
382-
config.Output.Parquet.IPv4BucketSize,
383-
)
384-
}
385-
// IPv6 bucket size capped at 60 to support int type (60-bit values fit in
386-
// positive int64, simplifying BigQuery queries)
387-
if config.Output.Parquet.IPv6BucketSize < 1 ||
388-
config.Output.Parquet.IPv6BucketSize > 60 {
389-
return fmt.Errorf(
390-
"ipv6_bucket_size must be between 1 and 60, got %d",
391-
config.Output.Parquet.IPv6BucketSize,
392-
)
393-
}
394-
395-
// Validate IPv6 bucket type
396-
if config.Output.Parquet.IPv6BucketType != IPv6BucketTypeString &&
397-
config.Output.Parquet.IPv6BucketType != IPv6BucketTypeInt {
398-
return fmt.Errorf(
399-
"ipv6_bucket_type must be 'string' or 'int', got '%s'",
400-
config.Output.Parquet.IPv6BucketType,
401-
)
389+
if err := validateBucketConfig(config); err != nil {
390+
return err
402391
}
403392
}
404393

@@ -451,3 +440,44 @@ func validate(config *Config) error {
451440

452441
return nil
453442
}
443+
444+
// validateBucketConfig validates bucket configuration for CSV or Parquet output.
445+
func validateBucketConfig(config *Config) error {
446+
var ipv4BucketSize, ipv6BucketSize int
447+
var ipv6BucketType string
448+
449+
if config.Output.Format == formatCSV {
450+
ipv4BucketSize = config.Output.CSV.IPv4BucketSize
451+
ipv6BucketSize = config.Output.CSV.IPv6BucketSize
452+
ipv6BucketType = config.Output.CSV.IPv6BucketType
453+
} else {
454+
ipv4BucketSize = config.Output.Parquet.IPv4BucketSize
455+
ipv6BucketSize = config.Output.Parquet.IPv6BucketSize
456+
ipv6BucketType = config.Output.Parquet.IPv6BucketType
457+
}
458+
459+
if ipv4BucketSize < 1 || ipv4BucketSize > 32 {
460+
return fmt.Errorf(
461+
"ipv4_bucket_size must be between 1 and 32, got %d",
462+
ipv4BucketSize,
463+
)
464+
}
465+
466+
// IPv6 bucket size capped at 60 to support int type (60-bit values fit in
467+
// positive int64, simplifying BigQuery queries)
468+
if ipv6BucketSize < 1 || ipv6BucketSize > 60 {
469+
return fmt.Errorf(
470+
"ipv6_bucket_size must be between 1 and 60, got %d",
471+
ipv6BucketSize,
472+
)
473+
}
474+
475+
if ipv6BucketType != IPv6BucketTypeString && ipv6BucketType != IPv6BucketTypeInt {
476+
return fmt.Errorf(
477+
"ipv6_bucket_type must be 'string' or 'int', got '%s'",
478+
ipv6BucketType,
479+
)
480+
}
481+
482+
return nil
483+
}

0 commit comments

Comments
 (0)