lazy type decoding for CSUP by mccanne · Pull Request #5778 · brimdata/super

mccanne · 2025-04-08T03:18:30Z

This commit addresses a scaling challenge for CSUP with respect to large numbers of types by arranging for the per-type metadata to be decoded on an as-needed basis. To do so, we changed the metadata representation from a single super value to an array of flattened metadata records indexed by ID. This allows us to incrementally unmarshal and build the shadow vectors for only the types needed by a query. Additionally, the metadata object filter now performs projection too so that only the metadata values from the types needed are deserialized and acted upon by the metadata filter.

These changes affect the CSUP file format so we bumped its version number from 8 to 9.

The code that unmarshals the shadow vectors is not currently reentrant. If/when we want to allow concurrent updates (e.g., vcache used by parallel lake requests) we will need to invoke the proper locking protocol.

Some rough perf measurements indicate a 5X speedup for the Bluesky Million data set on a simple query projecting a single column.

Fixes #5550

This commit addresses a scaling challenge for CSUP with respect to large numbers of types by arranging for the per-type metadata to be decoded on an as-needed basis. To do so, we changed the metadata representation from a single super value to an array of flattened metadata records indexed by ID. This allows us to incrementally unmarshal and build the shadow vectors fow only the types needed by a query. Additionally, the metadata object filter now performs projection too so that only the metadata values from the types needed are deserialized and acted upon by the metadata filter. These changes affect the CSUP file format so we bumped its version number from 8 to 9. The code that unmarshals the shadow vectors is not currently reentrant. If/when we want to allow concurrent updates (e.g., vcache used by parallel lake requests) we will need to invoke the proper locking protocol. Some rough perf measurements indicate a 5X speedup for the Bluesky Million data set on a simple query projecting a single column.

nwt

Nice!

runtime/vcache/object.go

runtime/vcache/shadow.go

nwt · 2025-04-08T15:58:01Z

csup/context.go

+
+type Context struct {
+	mu     sync.Mutex
+	local  *super.Context // holds the types for the Metadata values


Maybe call this sctx for consistency?

This raises an interesting point. I find it helpful to know it's different than the shared query context (i.e., created locally vs passed in) so that you need to do translations.

mccanne requested review from mattnibs and nwt April 8, 2025 03:21

nwt approved these changes Apr 8, 2025

View reviewed changes

address PR feedback

eadb8e1

mccanne merged commit de8a72e into main Apr 8, 2025
3 checks passed

mccanne deleted the csup-context branch April 8, 2025 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy type decoding for CSUP#5778

lazy type decoding for CSUP#5778
mccanne merged 2 commits intomainfrom
csup-context

mccanne commented Apr 8, 2025 •

edited by philrz

Loading

Uh oh!

nwt left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nwt Apr 8, 2025

Uh oh!

mccanne Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mccanne commented Apr 8, 2025 • edited by philrz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nwt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nwt Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

mccanne Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mccanne commented Apr 8, 2025 •

edited by philrz

Loading