Skip to content

Dense Tensor support #564

@ollemartensson

Description

@ollemartensson

Overview

The addition of Tensor support is valuable for positioning Arrow.jl as a great library for machine learning and scientific computing. These multi-dimensional data structures are currently unsupported but are defined within the Arrow format specification. The Arrow specification provides distinct formats for dense and sparse multi-dimensional arrays, this issue are focusing on dense multi-dimensional arrays

Dense Tensors

The canonical extension for dense tensors is arrow.fixed_shape_tensor. This format is designed to represent a column where every element is a tensor of the same shape and element type.

  • Storage Type: The underlying storage is a FixedSizeList. The elements of the tensor are flattened into a single, contiguous list in row-major (C-style) order. The list_size of the FixedSizeList is therefore the total number of elements in the tensor (i.e., the product of its dimensions).
  • Metadata: The multi-dimensional structure is described in the extension metadata. This metadata is a JSON string containing:
    • shape: A required array of integers defining the dimensions of the tensor.
    • dim_names: An optional array of strings providing names for each dimension.
    • permutation: An optional array of integers to describe a logical layout that is a permutation of the physical row-major layout.

Proposed Design

New struct types could be defined that act as zero-copy views over the underlying Arrow memory buffers.
This struct would not own the data itself but would provide a multi-dimensional interpretation of an underlying Arrow.FixedSizeList.

struct DenseTensor{T, N} <: AbstractArray{T, N}
    parent::Arrow.FixedSizeList{T}
    shape::NTuple{N, Int}
    dim_names::Union{Nothing, NTuple{N, Symbol}}
end

This design allows the DenseTensor to leverage Julia's rich AbstractArray interface for slicing and other operations while ensuring data is never copied from the Arrow buffer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions