Skip to content

Conversation

kaiyuan-li
Copy link
Contributor

integrate TSSD into torchstore. It's controlled by an environment variable TORCHSTORE_TSSD_ENABLED.

For put, once storage volume detected a TSSD key, it will try to see if all three keys for blob, flattened sd, and mapping are available. If yes, it will dereference the sd with blob, remove flattened sd and blob from kv store and commit each entry of the flattened and dereferenced sd back into the kv store.

This makes it identical storage representation so get() code doesn't have to be changed.

TODO:

  1. controller notification also has to be changed. We need to consolidate that part, especially that old TSSD internal keys are not removed in controller.
  2. verify the dtensor code path. Likely it just works.

@kaiyuan-li kaiyuan-li requested a review from LucasLLC October 10, 2025 15:06
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
@casteryh
Copy link
Contributor

casteryh commented Oct 10, 2025

Just one thought: we should either

  • keep two separate util functions (put_state_dict / put_state_dict_batch)
  • or get rid of the original implementation (you might want to keep it for benchmark, verifying correctness etc.)

Instead of adding yet another flag.

I am not a fan of env variables because ensuring env variables to be the same across multiple hosts is a fragile and non-trivial process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants