Efficient encoding for List<Categorical> #5772
gatesn
started this conversation in
Feature Requests
Replies: 2 comments 6 replies
-
|
So here's a proposal: Take the array: We can encode as a |
Beta Was this translation helpful? Give feedback.
5 replies
-
|
@agola11 - yes I would say this logic should be wrapped up a specific compression strategy. So essentially a custom builder that you can push arrays into that will be converted into this form. Then we need some way to either infer this strategy, or probably easier is to prescribe this strategy. e.g. use a CompressorWriteStrategy that you pre-configured to run this specific encoding function. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
A common use-case is to store a column containing "tags" where the tags have relatively low cardinality and each row stores a list of these tags.
e.g. the statpopgen FILTER column
In this case, there are actually very few unique sets of tags:
But I don't think this generally holds true?
So what should a good Vortex encoding be for this column? And is there a sensible Arrow data type that we can decompress into?
Beta Was this translation helpful? Give feedback.
All reactions