Skip to content

Commit cdf4237

Browse files
authored
Update docs (#28)
1 parent 5716faf commit cdf4237

File tree

4 files changed

+118
-70
lines changed

4 files changed

+118
-70
lines changed

README.md

Lines changed: 104 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,104 @@
1-
# arrow-js-wasm-ffi
2-
Exploration of zero-copy reading of Arrow data from WebAssembly
1+
# arrow-js-ffi
2+
3+
Interpret [Arrow](https://arrow.apache.org/) memory across the WebAssembly boundary without serialization.
4+
5+
## Why?
6+
7+
Arrow is a high-performance memory layout for analytical programs. Since Arrow's memory layout is defined to be the same in every implementation, programs that use Arrow in WebAssembly are using the same exact layout that [Arrow JS](https://arrow.apache.org/docs/js/) implements! This means we can use plain [`ArrayBuffer`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer)s to move highly structured data back and forth to WebAssembly memory, entirely avoiding serialization.
8+
9+
I wrote an [interactive blog post](https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly) that goes into more detail on why this is useful and how this library implements Arrow's [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) in JavaScript.
10+
11+
## Usage
12+
13+
This package exports two functions, `parseField` for parsing the `ArrowSchema` struct into an `arrow.Field` and `parseVector` for parsing the `ArrowArray` struct into an `arrow.Vector`.
14+
15+
### `parseField`
16+
17+
Parse an [`ArrowSchema`](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) C FFI struct into an `arrow.Field` instance. The `Field` is necessary for later using `parseVector` below.
18+
19+
- `buffer` (`ArrayBuffer`): The [`WebAssembly.Memory`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory) instance to read from.
20+
- `ptr` (`number`): The numeric pointer in `buffer` where the C struct is located.
21+
22+
```js
23+
const WASM_MEMORY: WebAssembly.Memory = ...
24+
const field = parseField(WASM_MEMORY.buffer, fieldPtr);
25+
```
26+
27+
### `parseVector`
28+
29+
Parse an [`ArrowArray`](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowarray-structure) C FFI struct into an [`arrow.Vector`](https://arrow.apache.org/docs/js/classes/Arrow_dom.Vector.html) instance. Multiple `Vector` instances can be joined to make an [`arrow.Table`](https://arrow.apache.org/docs/js/classes/Arrow_dom.Table.html).
30+
31+
- `buffer` (`ArrayBuffer`): The [`WebAssembly.Memory`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory) instance to read from.
32+
- `ptr` (`number`): The numeric pointer in `buffer` where the C struct is located.
33+
- `dataType` (`arrow.DataType`): The type of the vector to parse. This is retrieved from `field.type` on the result of `parseField`.
34+
- `copy` (`boolean`): If `true`, will _copy_ data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If `false`, the resulting `arrow.Vector` objects will be _views_ on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
35+
36+
```ts
37+
const WASM_MEMORY: WebAssembly.Memory = ...
38+
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type);
39+
// Copy arrays into JS instead of creating views
40+
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type, true);
41+
```
42+
43+
## Type Support
44+
45+
Most of the unsupported types should be pretty straightforward to implement; they just need some testing.
46+
47+
### Primitive Types
48+
49+
- [x] Null
50+
- [x] Boolean
51+
- [x] Int8
52+
- [x] Uint8
53+
- [x] Int16
54+
- [x] Uint16
55+
- [x] Int32
56+
- [x] Uint32
57+
- [x] Int64
58+
- [x] Uint64
59+
- [x] Float16
60+
- [x] Float32
61+
- [x] Float64
62+
63+
### Binary & String
64+
65+
- [x] Binary
66+
- [ ] Large Binary (with int64 offsets. Not supported by Arrow JS but we can implement downcasting in the future.)
67+
- [x] String
68+
- [ ] Large String (with int64 offsets. Not supported by Arrow JS but we can implement downcasting in the future.)
69+
- [x] Fixed-width Binary
70+
71+
### Decimal
72+
73+
- [ ] Decimal128 (failing a test)
74+
- [ ] Decimal256 (failing a test)
75+
76+
### Temporal Types
77+
78+
- [x] Date32
79+
- [x] Date64
80+
- [x] Time32
81+
- [x] Time64
82+
- [x] Timestamp (with timezone)
83+
- [ ] Duration
84+
- [ ] Interval
85+
86+
### Nested Types
87+
88+
- [x] List
89+
- [ ] Large List (with int64 offsets. Not supported by Arrow JS but we can implement downcasting in the future.)
90+
- [x] Fixed-size List
91+
- [x] Struct
92+
- [ ] Map
93+
- [ ] Dense Union
94+
- [ ] Sparse Union
95+
- [ ] Dictionary-encoded arrays
96+
97+
### Extension Types
98+
99+
- [x] Field metadata is preserved.
100+
101+
## TODO:
102+
103+
- Call the release callback on the C structs. This requires figuring out how to call C function pointers from JS.
104+

src/field.ts

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,12 @@ const formatMapping: Record<string, arrow.DataType | undefined> = {
4141
};
4242

4343
/**
44-
* Parse Field from Arrow C Data Interface
44+
Parse an [`ArrowSchema`](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) C FFI struct into an `arrow.Field` instance. The `Field` is necessary for later using `parseVector` below.
45+
46+
- `buffer` (`ArrayBuffer`): The [`WebAssembly.Memory`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory) instance to read from.
47+
- `ptr` (`number`): The numeric pointer in `buffer` where the C struct is located.
4548
*/
46-
export function parseField(buffer: ArrayBuffer, ptr: number) {
49+
export function parseField(buffer: ArrayBuffer, ptr: number): arrow.Field {
4750
const dataView = new DataView(buffer);
4851

4952
const formatPtr = dataView.getUint32(ptr, true);
@@ -147,7 +150,6 @@ export function parseField(buffer: ArrayBuffer, ptr: number) {
147150
throw new Error(`Unsupported format: ${formatString}`);
148151
}
149152

150-
// https://stackoverflow.com/a/9954810
151153
function parseFlags(flag: bigint): Flags {
152154
if (flag === 0n) {
153155
return {
@@ -157,6 +159,7 @@ function parseFlags(flag: bigint): Flags {
157159
};
158160
}
159161

162+
// https://stackoverflow.com/a/9954810
160163
let parsed = flag.toString(2);
161164
return {
162165
nullable: parsed[0] === "1" ? true : false,

src/vector.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,14 @@ import { DataType } from "apache-arrow";
33

44
type NullBitmap = Uint8Array | null | undefined;
55

6+
/**
7+
Parse an [`ArrowArray`](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowarray-structure) C FFI struct into an [`arrow.Vector`](https://arrow.apache.org/docs/js/classes/Arrow_dom.Vector.html) instance. Multiple `Vector` instances can be joined to make an [`arrow.Table`](https://arrow.apache.org/docs/js/classes/Arrow_dom.Table.html).
8+
9+
- `buffer` (`ArrayBuffer`): The [`WebAssembly.Memory`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory) instance to read from.
10+
- `ptr` (`number`): The numeric pointer in `buffer` where the C struct is located.
11+
- `dataType` (`arrow.DataType`): The type of the vector to parse. This is retrieved from `field.type` on the result of `parseField`.
12+
- `copy` (`boolean`): If `true`, will _copy_ data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If `false`, the resulting `arrow.Vector` objects will be _views_ on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
13+
*/
614
export function parseVector<T extends DataType>(
715
buffer: ArrayBuffer,
816
ptr: number,

tests/ffi.ts

Lines changed: 0 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -499,68 +499,3 @@ test.skip("timestamp", (t) => {
499499
}
500500
t.end();
501501
});
502-
503-
// console.log(originalVector.getChildAt(0)?.toArray());
504-
// console.log(wasmVector.toJSON());
505-
506-
// test.skip("utf8 non-null", (t) => {
507-
// const table = arrow.tableFromArrays({
508-
// col1: ["a", "b", "c", "d"],
509-
// });
510-
// const ffiTable = arrowTableToFFI(table);
511-
// console.log("test");
512-
// const fieldPtr = ffiTable.schemaAddr(0);
513-
// const field = parseField(WASM_MEMORY.buffer, fieldPtr);
514-
515-
// t.equals(field.name, "col1");
516-
// t.equals(field.typeId, new arrow.Utf8().typeId);
517-
// t.equals(field.nullable, false);
518-
519-
// const arrayPtr = ffiTable.arrayAddr(0, 0);
520-
// const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type);
521-
// t.equals(wasmVector, table.getChildAt(0));
522-
523-
// console.log("table", table);
524-
// console.log("table", table.schema.fields);
525-
// console.log(table.toString());
526-
527-
// t.end();
528-
529-
// // const builder = arrow.makeBuilder({
530-
// // type: new arrow.Utf8(),
531-
// // nullValues: null
532-
// // });
533-
// // builder.append("a");
534-
// // builder.append("b");
535-
// // builder.append("c");
536-
// // builder.append("d");
537-
538-
// // const vector = builder.finish().toVector();
539-
// // const schema = new arrow.Schema([new arrow.Field("col1", new arrow.Utf8(), false)]);
540-
541-
// // // new arrow.RecordBatch()
542-
// // // arrow.makeData<arrow.Struct>()
543-
// // // @ts-ignore
544-
// // const recordBatchData = arrow.makeData<arrow.Struct>({ type: new arrow.Struct(), children: [vector.data] });
545-
// // const recordBatch = new arrow.RecordBatch(
546-
// // schema,
547-
// // recordBatchData
548-
// // );
549-
// // const table = new arrow.Table(schema, recordBatch);
550-
// // console.log('table', table.schema.fields)
551-
552-
// // const ffiTable = arrowTableToFFI(table);
553-
// // const fieldPtr = ffiTable.schemaAddr(0);
554-
// // const field = parseField(WASM_MEMORY.buffer, fieldPtr);
555-
556-
// // t.equals(field.name, "col1");
557-
// // t.equals(field.typeId, new arrow.Utf8().typeId);
558-
// // t.equals(field.nullable, false);
559-
560-
// // const arrayPtr = ffiTable.arrayAddr(0, 0);
561-
// // const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type);
562-
// // console.log(wasmVector.toString())
563-
564-
// // t.end();
565-
// // new arrow.RecordBatch(schema, )
566-
// });

0 commit comments

Comments
 (0)