-
Notifications
You must be signed in to change notification settings - Fork 563
Basic Xorb creation + Add xet-core WASM bindings #1616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrote the file generated by wasmbindgen to TS + make it work in node & web + add init() function
| } | ||
| } finally { | ||
| chunker.free(); | ||
| // ^ is this really needed ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WASM module can allocate memory that exists outside js garbage collector, so it depends really on the wasm Chunker code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some code in chunker_wasm_bg.js but I can't tell for sure, so being prudent atm
| chunkToCopy = sourceChunks[0].subarray(0, chunk.length); | ||
| sourceChunks[0] = sourceChunks[0].subarray(chunk.length); | ||
| } else { | ||
| chunkToCopy = new Uint8Array(chunk.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be optimized for less memory allocation btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing we could do is have a permanent MAX_CHUNK_SIZE Uint8 array that we reuse?
(since we only have one chunkToCopy at a time)
| let chunkToCopy: Uint8Array; | ||
| if (chunk.length === sourceChunks[0].length) { | ||
| chunkToCopy = sourceChunks[0]; | ||
| sourceChunks.shift(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more inefficient than using an index approach no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just the leftover data from the chunking process, there should be at most 128kB of it I think and even in 16kB chunks it's only 8 chunks, eg sourceChunks.length <= 8
So with those params I don't think we care too much
| initPromise = new Promise((_resolve, _reject) => { | ||
| resolve = _resolve; | ||
| reject = _reject; | ||
| }); | ||
|
|
||
| await Promise.resolve(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit suspicious as a method to create a singleton you can't recreate if the wasm import is failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this await in particular is to enforce resolve and reject are assigned before running the rest of the code
not sure it's necessary
Co-authored-by: Sylvestre Bouchot <[email protected]>
|
Merging for now, just added wasm generation directly from xet-core, thanks @assafvayner and @hoytak And testing bg4 compression Will open another PR for shard generation don't worry it's not exported / availalbe in the published lib yet |
cc @Kakulukian @assafvayner for viz, follow up to #1616 Based on https://github.com/huggingface/xet-core/blob/7e41fb0dd7cfb276222b9668d0b97a984647721e/spec/shard.md Need to handle: - split into multiple shards when xorb or file info grows too big - uploading xorbs & shards (and we need to upload xorbs before shards referencing them)
This doesn't handle dedup at all yet, will not be used as is even if I merge it.
What it does:
Questions:
Are there xorb headers? I'm writing chunk headers before each chunk, but is there a xorb header at the beginning of the xorb to write?No, thanks @assafvaynercc @assafvayner @seanses @hoytak @sirahd @rajatarya for viz and if you have answers to the questions :)