-
Notifications
You must be signed in to change notification settings - Fork 51
Issue: Missing README and Incompatibility with Habana Containers for gpt2 deep speed training #114
Description
- Missing README.md for Installation Instructions
I followed the instructions to run the Gaudi tutorials using the Habana container, but I noticed that this repository (Gaudi-tutorials/Lightning/DeepSpeed_Lightning) is missing a README.md file. Without clear instructions, it's unclear how to set up and run the tutorial properly.
Specifically, there is not possible to run
pip install -e .
because the README.md is missing from this folder.
- requirements.txt Enforces Non-Habana Versions
The requirements.txt file enforces:
lightning>=2.2.0
torch>=2.2.0
habana-lightning==1.4.0
However, when using the Habana container, installing this version of torch will overwrite Habana’s custom torch package, leading to version mismatches and potential runtime errors.
then, if will has habana 's torch version, it might be more appropriate to match the habana-lightning version also?
Currently, I am using habana-lightning==1.7.0rc0 due to API changes in PyTorch 2.5.0 ( not sure since which version), but the tutorial does not specify compatible versions.
- Compatibility Issue with DeepSpeed and PyTorch Elastic API
Because of the API changes in PyTorch's elastic module, running the tutorial in newer environments results in an error:
from torch.distributed.elastic.utils.logging import get_logger
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
The log function has been removed in newer versions of PyTorch Elastic. Will this tutorial enforce a specific version of deepspeed, such as <1.7, to maintain compatibility?
Expected Fixes
Add a README.md with installation steps.
Ensure requirements.txt is compatible with Habana containers, avoiding conflicts with pre-installed Habana versions of PyTorch and Lightning.
Clarify DeepSpeed and PyTorch compatibility to prevent version mismatches.
Thanks for maintaining this tutorial! Let me know if any additional details are needed.
I am sorry did not create pull request, but through github issue description. Because i not sure whether should change the way jupyterbook mention, use non-container.
Major Error,
-
pytorch api incompatible with dependency
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py) -
native pytorch and habana pytorch
if args.compile:
if not hasattr(torch, "compile"):
raise RuntimeError(
f"The current torch version ({torch.version}) does not have support for compile."
"Please install torch >= 1.14 or disable compile."
)