-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Overview
The addition of Tensor support is valuable for positioning Arrow.jl as a great library for machine learning and scientific computing. These multi-dimensional data structures are currently unsupported but are defined within the Arrow format specification. The Arrow specification provides distinct formats for dense and sparse multi-dimensional arrays, this issue are focusing on dense multi-dimensional arrays
Dense Tensors
The canonical extension for dense tensors is arrow.fixed_shape_tensor. This format is designed to represent a column where every element is a tensor of the same shape and element type.
- Storage Type: The underlying storage is a FixedSizeList. The elements of the tensor are flattened into a single, contiguous list in row-major (C-style) order. The list_size of the FixedSizeList is therefore the total number of elements in the tensor (i.e., the product of its dimensions).
- Metadata: The multi-dimensional structure is described in the extension metadata. This metadata is a JSON string containing:
- shape: A required array of integers defining the dimensions of the tensor.
- dim_names: An optional array of strings providing names for each dimension.
- permutation: An optional array of integers to describe a logical layout that is a permutation of the physical row-major layout.
Proposed Design
New struct types could be defined that act as zero-copy views over the underlying Arrow memory buffers.
This struct would not own the data itself but would provide a multi-dimensional interpretation of an underlying Arrow.FixedSizeList.
struct DenseTensor{T, N} <: AbstractArray{T, N}
parent::Arrow.FixedSizeList{T}
shape::NTuple{N, Int}
dim_names::Union{Nothing, NTuple{N, Symbol}}
end
This design allows the DenseTensor to leverage Julia's rich AbstractArray interface for slicing and other operations while ensuring data is never copied from the Arrow buffer.