Skip to content

Conversation

@shenzheyu
Copy link

No description provided.

@GuanhuaWang
Copy link
Contributor

@hwchen2017 , please follow up on this pr. thank you!

loadams and others added 19 commits March 5, 2025 17:55
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Propagate API change.

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
- add zero2 test
- minor fix with transformer version update & ds master merge.

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range

Signed-off-by: shaomin <wukon1992@gmail.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.4
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
@jeffra and I fixed this many years ago, so bringing this doc to a
correct state.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <siqi@tecorigin.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
More information on libuv in pytorch:
https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html
Issue tracking the prevalence of the error on Windows (unresolved at the
time of this PR): pytorch/pytorch#139990
LibUV github: https://github.com/libuv/libuv

Windows error:
```
  File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
```

use_libuv isn't well supported on Windows in pytorch <2.4, so we need to
guard around this case.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
@fukun07 and I discovered a bug when using the `offload_states` and
`reload_states` APIs of the Zero3 optimizer. When using grouped
parameters (for example, in weight decay or grouped lr scenarios), the
order of the parameters mapping in `reload_states`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953))
does not correspond with the initialization of `self.lp_param_buffer`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)),
which leads to misaligned parameter loading. This issue was overlooked
by the corresponding unit tests
([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)),
so we fixed the bug in our PR and added the corresponding unit tests.

---------

Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Following changes in Pytorch trace rules , my previous PR to avoid graph
breaks caused by logger is no longer relevant. So instead I've added
this functionality to torch dynamo -
pytorch/pytorch@16ea0dd
This commit allows the user to config torch to ignore logger methods and
avoid associated graph breaks.

To enable ignore logger methods -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
To ignore logger methods except for a specific method / methods (for
example, info and isEnabledFor) -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info,
isEnabledFor"

Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Co-authored-by: snahir <snahir@habana.ai>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
The partition tensor doesn't need to move to the current device when
meta load is used.

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
…t` (deepspeedai#7069)

With future changes coming to pip/python/etc, we need to modify to no
longer call `python setup.py ...` and replace it instead:
https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted


![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d)

This means we need to install the build package which is added here as
well.

Additionally, we pass the `--sdist` flag to only build the sdist rather
than the wheel as well here.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
Add deepseekv3 autotp.

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Zheyu SHEN <zyshen@umd.edu>
@loadams
Copy link
Collaborator

loadams commented Aug 11, 2025

@shenzheyu - could you please resolve merge conflicts and then we can get this reviewed? Thanks!

@GuanhuaWang
Copy link
Contributor

@shenzheyu , please help here, thanks

@hwchen2017 hwchen2017 marked this pull request as draft September 3, 2025 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.