You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(r): Add experimental nanoarrow_vctr to wrap a list of arrays (#461)
This PR adds the `nanoarrow_vctr`, which is an R translation of the
Python `Array` class in nanoarrow's Python bindings. This is implemented
like an R `factor()` in the sense that under the hood it is a sequence
of integers (`0:(array$length - 1)` at the beginning) with attributes
that give those integers context.
This is implemented in such a way that it is "tacked on" to the existing
conversions. The existing conversions do need a refactoring (
#392 ), but that is a
heavy change for this point in the release cycle.
The only change needed to the existing conversion was a slight refactor
of the "consume array stream" code that correctly gave each array in the
stream its own R object to manage its lifecycle (before each array was
"materialized" and then immediately released because no previous
conversion code required an ArrowArray to live beyond the conversion.
The motivation for this change is converting GeoArrow extension types.
In the geoarrow package, we implement an efficient conversion from a
stream of arrays to various types of R-spatial objects (e.g., sf);
however, we really don't want to invoke the default conversion for those
types because they have awful performance (e.g., the multipolygon would
be a `list(list(list(data.frame))))`) and there's no need to invoke that
number of R object conversions between the initial state (an arrow
array) and the final state (an sfc column). The nanoarrow_vctr allows
something like:
```r
df <- convert_array(some_array_containing_a_geoarrow_col)
st_as_sfc(df$geometry) # or s2::as_s2_geography(df$geometry), or something else
```
A side-effect of this change is that we have an escape hatch for
conversions that are lossy or contain types with no R equivalent.
A quick demo:
``` r
library(nanoarrow)
arrays <- lapply(
list(1:5, 6:10, 11:13),
as_nanoarrow_array
)
# A vctr can be created from any stream
(vctr <- as_nanoarrow_vctr(basic_array_stream(arrays)))
#> <nanoarrow_vctr int32[13]>
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
# Under the hood this is something like a factor() where levels are
# a list of arrays with cached offsets. This is like an Arrow ChunkedArray
str(vctr)
#> <nanoarrow_vctr int32[13]>
#> List of 3
#> $ :<nanoarrow_array int32[5]>
#> ..$ length : int 5
#> ..$ null_count: int 0
#> ..$ offset : int 0
#> ..$ buffers :List of 2
#> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#> ..$ dictionary: NULL
#> ..$ children : list()
#> $ :<nanoarrow_array int32[5]>
#> ..$ length : int 5
#> ..$ null_count: int 0
#> ..$ offset : int 0
#> ..$ buffers :List of 2
#> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `6 7 8 9 10`
#> ..$ dictionary: NULL
#> ..$ children : list()
#> $ :<nanoarrow_array int32[3]>
#> ..$ length : int 3
#> ..$ null_count: int 0
#> ..$ offset : int 0
#> ..$ buffers :List of 2
#> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. ..$ :<nanoarrow_buffer data<int32>[3][12 b]> `11 12 13`
#> ..$ dictionary: NULL
#> ..$ children : list()
# vctrs can be sliced:
head(vctr)
#> <nanoarrow_vctr int32[6]>
#> [1] 1 2 3 4 5 6
# ...and can live in a data.frame
head(tibble::tibble(x = vctr))
#> # A tibble: 6 × 1
#> x
#> <nnrrw_vc>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
# They can be used as zero-copy conversion targets
array <- as_nanoarrow_array(1:5)
convert_array(array, nanoarrow_vctr())
#> <nanoarrow_vctr int32[5]>
#> [1] 1 2 3 4 5
# ...also works in a nested ptype
array <- as_nanoarrow_array(data.frame(x = 1:5))
convert_array(array, tibble::tibble(x = nanoarrow_vctr()))
#> # A tibble: 5 × 1
#> x
#> <nnrrw_vc>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
# For nested list output, it will give a slice of the original array for
# each list item
array <- as_nanoarrow_array(
list(1:5, 6:10, NULL, 11:13),
schema = na_list(na_int32())
)
(lst_of <- convert_array(array, vctrs::list_of(nanoarrow_vctr())))
#> <list_of<nanoarrow_vctr>[4]>
#> [[1]]
#> <nanoarrow_vctr int32[5]>
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> <nanoarrow_vctr int32[5]>
#> [1] 6 7 8 9 10
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> <nanoarrow_vctr int32[3]>
#> [1] 11 12 13
for (i in seq_along(lst_of)) {
array <- attr(lst_of[[i]], "chunks")[[1]]
cat(sprintf("offset: %d, length: %d\n", array$offset, array$length))
}
#> offset: 0, length: 5
#> offset: 5, length: 5
#> offset: 10, length: 3
```
<sup>Created on 2024-05-10 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
0 commit comments