Skip to content

Update parquet-go and drop string workaround#143

Merged
eest merged 1 commit intomainfrom
fix-parquet-hll-bytes
Dec 9, 2025
Merged

Update parquet-go and drop string workaround#143
eest merged 1 commit intomainfrom
fix-parquet-hll-bytes

Conversation

@eest
Copy link
Member

@eest eest commented Dec 9, 2025

With parquet-go v0.26.0 it now properly handles optional []byte data so we no longer need to store it as a string.

go get github.com/parquet-go/parquet-go
go mod tidy

Using a []byte also means we can make the test code check for nil rather than a zero length string which is what we actually want.

Summary by CodeRabbit

  • Chores

    • Updated parquet-go dependency to version 0.26.0 with additional indirect dependencies.
  • Refactor

    • Optimized internal data structure handling for improved efficiency in byte storage operations.

✏️ Tip: You can customize this high-level summary in your review settings.

With parquet-go v0.26.0 it now properly handles optional []byte data so
we no longer need to store it as a string.

```
go get github.com/parquet-go/parquet-go
go mod tidy
```

Using a []byte also means we can make the test code check for nil rather
than a zero length string which is what we actually want.
@eest eest requested a review from a team as a code owner December 9, 2025 11:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

This PR upgrades parquet-go from v0.25.1 to v0.26.0 and refactors histogram data storage by changing HLL bytes fields from string to []byte type, eliminating unnecessary conversion steps when writing parquet data.

Changes

Cohort / File(s) Summary
Dependency upgrade
go.mod
Upgraded parquet-go to v0.26.0; added indirect dependencies parquet-go/bitpack v0.2.0 and parquet-go/jsonlite v0.8.1
Type refactoring
pkg/runner/runner.go
Changed histogramData.V4ClientCountHLLBytes and V6ClientCountHLLBytes from string to []byte type; updated parquet struct tags to remove redundant bytes directive
Test updates
pkg/runner/runner_test.go
Updated test cases to work with []byte type instead of string, including nil checks and byte slice length assertions

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Type changes are straightforward and consistent across files
  • Dependency upgrade is minimal with no API changes evident
  • Test updates follow predictable patterns aligned with type changes
  • No complex logic modifications or control flow changes

Possibly related PRs

Suggested reviewers

  • zluudg

Poem

🐰 Bytes were hiding in strings, oh what a sight,
Now they hop freely in slices so light!
Parquet upgraded, no conversion delays,
The rabbit rejoices through data's new ways! 🎉

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: upgrading parquet-go and removing the string workaround for HLL bytes fields.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-parquet-hll-bytes

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d4bae0 and e6a894d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • go.mod (2 hunks)
  • pkg/runner/runner.go (2 hunks)
  • pkg/runner/runner_test.go (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: eest
Repo: dnstapir/edm PR: 93
File: pkg/runner/runner.go:164-178
Timestamp: 2025-09-19T15:16:16.760Z
Learning: The parquet-go library has a known issue (#303) where optional []byte fields don't work correctly, so using string with "bytes,optional" tags is the recommended workaround for optional byte array fields.
Learnt from: eest
Repo: dnstapir/edm PR: 93
File: pkg/runner/runner.go:164-178
Timestamp: 2025-09-19T15:14:01.687Z
Learning: Empty strings in optional parquet fields represent NULL values when consumed by tools like DuckDB, so using `string` with `"bytes,optional"` tags works correctly for conditional HLL byte storage.
📚 Learning: 2025-09-19T15:16:16.760Z
Learnt from: eest
Repo: dnstapir/edm PR: 93
File: pkg/runner/runner.go:164-178
Timestamp: 2025-09-19T15:16:16.760Z
Learning: The parquet-go library has a known issue (#303) where optional []byte fields don't work correctly, so using string with "bytes,optional" tags is the recommended workaround for optional byte array fields.

Applied to files:

  • go.mod
  • pkg/runner/runner.go
📚 Learning: 2025-09-19T15:14:01.687Z
Learnt from: eest
Repo: dnstapir/edm PR: 93
File: pkg/runner/runner.go:164-178
Timestamp: 2025-09-19T15:14:01.687Z
Learning: Empty strings in optional parquet fields represent NULL values when consumed by tools like DuckDB, so using `string` with `"bytes,optional"` tags works correctly for conditional HLL byte storage.

Applied to files:

  • pkg/runner/runner.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (1.25.5)
🔇 Additional comments (4)
go.mod (1)

15-15: parquet-go v0.26.0 bump and new indirects align with dropping the []byte workaround

Upgrading to github.com/parquet-go/parquet-go v0.26.0 is the right precondition for using optional []byte fields directly: this release includes the “fix handling of optional byte slice” change (PR #329) which closes issue #303 where optional []byte columns were always written as NULL. (github.com) The added parquet-go/bitpack and parquet-go/jsonlite indirects are part of that same release train.

No further adjustments needed here; just ensure CI and downstream consumers (e.g., DuckDB queries) are re-run against files produced with the new schema to confirm expected NULL vs non-NULL behaviour for the HLL columns.

Also applies to: 68-69

pkg/runner/runner_test.go (2)

1864-1884: HLL test/benchmark now writing raw []byte payloads – consistent with new field types

Using v4hll.ToBytes() / v6hll.ToBytes() directly for V4ClientCountHLLBytes and V6ClientCountHLLBytes matches the switch to []byte fields and avoids redundant string conversions. The TestHistogramWriter and BenchmarkHistogramWriter still exercise the same parquet path, just with the more natural representation.

No issues spotted here.

Also applies to: 1942-1962


2133-2144: Nil vs non-nil assertions for HLL bytes correctly exercise optional []byte semantics

The updated checks

  • test.ipv4HllIsNull && row.V4ClientCountHLLBytes != nil
  • !test.ipv4HllIsNull && len(row.V4ClientCountHLLBytes) == 0

(and the IPv6 equivalents) are a good fit for the new optional []byte columns: they verify that NULLs round-trip as nil slices, and that non-NULL values have non-zero length.

Since this relies on parquet-go’s mapping of optional BYTE_ARRAY columns to []byte (and the v0.26.0 fix for optional byte slices), please rerun these tests against the upgraded dependency and, if relevant, inspect a generated file in DuckDB to confirm the expected NULL/non-NULL behaviour of the HLL columns.

pkg/runner/runner.go (1)

170-174: Switching HLL payloads from string+bytes to optional []byte is correct with parquet-go v0.26.0

Changing

  • V4ClientCountHLLBytes / V6ClientCountHLLBytes from string with parquet:"…,bytes,optional" to []byte with parquet:"…,optional", and
  • assigning v4HLLBytes / v6HLLBytes slices directly in writeHistogramParquet when the HLL storage type is sparse/dense,

does what this PR intends:

  • With parquet-go v0.26.0, optional []byte fields are now handled correctly (fix for “optional byte slice always NULL”, issue #303 / PR #329). (github.com)
  • For []byte fields, omitting the string/bytes tag leaves the Parquet logical type unset, so you still get a BYTE_ARRAY column, matching the previous schema produced by string + bytes. (pkg.go.dev)
  • Leaving these fields at their zero value (nil) when the HLL is in explicit mode, and only setting them for sparse/dense, preserves the “NULL for explicit, bytes for probabilistic” contract, now using true NULLs instead of an empty-string sentinel.

I don’t see any functional issues with the change; just make sure to validate that:

  • newly written histogram Parquet files show v4client_count_hll / v6client_count_hll as BLOB columns with proper NULL vs non-NULL values in DuckDB (or your downstream tooling), and
  • any code that previously expected string-typed HLL fields in Go has been updated to use []byte (tests in this PR already do so).

Also applies to: 2716-2723


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@zluudg zluudg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eest eest merged commit e94870a into main Dec 9, 2025
5 checks passed
@eest eest deleted the fix-parquet-hll-bytes branch December 9, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants