-
|
Hi all, I primarily work in the big data java ecosystem and am new to Vortex. I was wondering what is the recommendation that the Vortex team gives users when it comes to storing large binary content such as (images, videos) as well the vector embeddings within vortex files. When checking the Vortex type system https://docs.vortex.dev/concepts/dtypes#logical-types, I see there is a When looking at the spark integration saw these mappings https://github.com/vortex-data/vortex/blob/develop/java/vortex-spark/src/main/java/dev/vortex/spark/SparkTypes.java, which seems to align with what I mentioned but wanted to know if there is any other insights on how users can best leverage vortex for storing this kinda of data? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
|
Hey @rahil-c ! Welcome to the orange side of the big data ecosystem 😉 Indeed Vortex supports bytes arrays, list arrays, and also fixed-length-list arrays. You probably want the latter for vectors. Looks like we need to add support for fixed-length-list to the Java bindings. Below, I share some first-hand information about a dataset with vectors and matrices, but, to best answer your question, I think I need to know what kinds of operations you want to perform on this table of vectors / images / videos. Vortex expressions are extensible, but we do not currently have anything like "slice video" or "slice image", which, I imagine, are important to you. I suspect storing a PNG as binary is not a very useful thing to do (because you're treating very meaningful bytes as meaningless!). PS: if you do choose to store raw binary in Vortex, you probably want to enable the compact compressor which enables paged ZSTD compression for binary columns. FWIW, we have a benchmark which stores a matrix of human genomes in a Vortex array. Each row of the array contains a few thousand metadata fields (about that genetic position) as well as nine vector or matrix fields (which have an entry or column per sample of which there are ~4k). The types of these fields look like this (from Vortex, by default, does not use row groups. These vector and matrix columns, which are significantly larger than the metadata, all end up at the beginning of the file (from The metadata arrays are stored near the end of the file without chunking due to their extremely small size: If you run |
Beta Was this translation helpful? Give feedback.
-
|
We are still missing a list layout that would properly partition large strings and lists if they don’t end up dictionary encoded. There’s also some harder to implement features like storing big values outside of the file which we need for values >4GB |
Beta Was this translation helpful? Give feedback.
Hey @rahil-c ! Welcome to the orange side of the big data ecosystem 😉
Indeed Vortex supports bytes arrays, list arrays, and also fixed-length-list arrays. You probably want the latter for vectors. Looks like we need to add support for fixed-length-list to the Java bindings.
Below, I share some first-hand information about a dataset with vectors and matrices, but, to best answer your question, I think I need to know what kinds of operations you want to perform on this table of vectors / images / videos. Vortex expressions are extensible, but we do not currently have anything like "slice video" or "slice image", which, I imagine, are important to you. I suspect storing a PNG as binary is no…