Skip to content

Conversation

@coyotte508
Copy link
Member

@coyotte508 coyotte508 commented Jul 11, 2025

This doesn't handle dedup at all yet, will not be used as is even if I merge it.

What it does:

  • Write chunks into xorb
  • Write chunk headers
  • Handle compression
  • Handle xorb size limit and xorb chunk count limit

Questions:

  • Are there xorb headers? I'm writing chunk headers before each chunk, but is there a xorb header at the beginning of the xorb to write? No, thanks @assafvayner
  • For now I only test lz4 compression length against uncompressed length, I do not bother with bg4. Should I test both?
  • Is the xet backend hardened against invalid chunks/xorbs?

cc @assafvayner @seanses @hoytak @sirahd @rajatarya for viz and if you have answers to the questions :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrote the file generated by wasmbindgen to TS + make it work in node & web + add init() function

@coyotte508 coyotte508 changed the title Basic Xorb creation Basic Xorb creation + Add xet-core WASM bindings Jul 11, 2025
}
} finally {
chunker.free();
// ^ is this really needed ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WASM module can allocate memory that exists outside js garbage collector, so it depends really on the wasm Chunker code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some code in chunker_wasm_bg.js but I can't tell for sure, so being prudent atm

chunkToCopy = sourceChunks[0].subarray(0, chunk.length);
sourceChunks[0] = sourceChunks[0].subarray(chunk.length);
} else {
chunkToCopy = new Uint8Array(chunk.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be optimized for less memory allocation btw

Copy link
Member Author

@coyotte508 coyotte508 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing we could do is have a permanent MAX_CHUNK_SIZE Uint8 array that we reuse?

(since we only have one chunkToCopy at a time)

let chunkToCopy: Uint8Array;
if (chunk.length === sourceChunks[0].length) {
chunkToCopy = sourceChunks[0];
sourceChunks.shift();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more inefficient than using an index approach no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the leftover data from the chunking process, there should be at most 128kB of it I think and even in 16kB chunks it's only 8 chunks, eg sourceChunks.length <= 8

So with those params I don't think we care too much

Comment on lines +16 to +21
initPromise = new Promise((_resolve, _reject) => {
resolve = _resolve;
reject = _reject;
});

await Promise.resolve();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit suspicious as a method to create a singleton you can't recreate if the wasm import is failing

Copy link
Member Author

@coyotte508 coyotte508 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this await in particular is to enforce resolve and reject are assigned before running the rest of the code

not sure it's necessary

@coyotte508
Copy link
Member Author

coyotte508 commented Jul 16, 2025

Merging for now, just added wasm generation directly from xet-core, thanks @assafvayner and @hoytak

And testing bg4 compression

Will open another PR for shard generation

don't worry it's not exported / availalbe in the published lib yet

@coyotte508 coyotte508 merged commit 762ef41 into main Jul 16, 2025
3 of 5 checks passed
@coyotte508 coyotte508 deleted the create-xorb branch July 16, 2025 13:47
coyotte508 added a commit that referenced this pull request Aug 26, 2025
cc @Kakulukian @assafvayner for viz, follow up to #1616 

Based on
https://github.com/huggingface/xet-core/blob/7e41fb0dd7cfb276222b9668d0b97a984647721e/spec/shard.md

Need to handle:

- split into multiple shards when xorb or file info grows too big
- uploading xorbs & shards (and we need to upload xorbs before shards
referencing them)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants