Skip to content

lazy type decoding for CSUP#5778

Merged
mccanne merged 2 commits intomainfrom
csup-context
Apr 8, 2025
Merged

lazy type decoding for CSUP#5778
mccanne merged 2 commits intomainfrom
csup-context

Conversation

@mccanne
Copy link
Copy Markdown
Collaborator

@mccanne mccanne commented Apr 8, 2025

This commit addresses a scaling challenge for CSUP with respect to large numbers of types by arranging for the per-type metadata to be decoded on an as-needed basis. To do so, we changed the metadata representation from a single super value to an array of flattened metadata records indexed by ID. This allows us to incrementally unmarshal and build the shadow vectors for only the types needed by a query. Additionally, the metadata object filter now performs projection too so that only the metadata values from the types needed are deserialized and acted upon by the metadata filter.

These changes affect the CSUP file format so we bumped its version number from 8 to 9.

The code that unmarshals the shadow vectors is not currently reentrant. If/when we want to allow concurrent updates (e.g., vcache used by parallel lake requests) we will need to invoke the proper locking protocol.

Some rough perf measurements indicate a 5X speedup for the Bluesky Million data set on a simple query projecting a single column.

Fixes #5550

This commit addresses a scaling challenge for CSUP with respect
to large numbers of types by arranging for the per-type metadata
to be decoded on an as-needed basis.  To do so, we changed the
metadata representation from a single super value to an array of
flattened metadata records indexed by ID.  This allows us to
incrementally unmarshal and build the shadow vectors fow only the
types needed by a query.  Additionally, the metadata object filter
now performs projection too so that only the metadata values from
the types needed are deserialized and acted upon by the metadata
filter.

These changes affect the CSUP file format so we bumped its version
number from 8 to 9.

The code that unmarshals the shadow vectors is not currently reentrant.
If/when we want to allow concurrent updates (e.g., vcache used by
parallel lake requests) we will need to invoke the proper locking protocol.

Some rough perf measurements indicate a 5X speedup for the
Bluesky Million data set on a simple query projecting a single column.
@mccanne mccanne requested review from mattnibs and nwt April 8, 2025 03:21
Copy link
Copy Markdown
Member

@nwt nwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


type Context struct {
mu sync.Mutex
local *super.Context // holds the types for the Metadata values
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this sctx for consistency?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises an interesting point. I find it helpful to know it's different than the shared query context (i.e., created locally vs passed in) so that you need to do translations.

@mccanne mccanne merged commit de8a72e into main Apr 8, 2025
3 checks passed
@mccanne mccanne deleted the csup-context branch April 8, 2025 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector CSUP search query consumes 75+ GB memory

2 participants