Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Issue: Missing README and Incompatibility with Habana Containers for gpt2 deep speed training #114

@kahlun

Description

@kahlun
  1. Missing README.md for Installation Instructions
    I followed the instructions to run the Gaudi tutorials using the Habana container, but I noticed that this repository (Gaudi-tutorials/Lightning/DeepSpeed_Lightning) is missing a README.md file. Without clear instructions, it's unclear how to set up and run the tutorial properly.

Specifically, there is not possible to run
pip install -e .
because the README.md is missing from this folder.

  1. requirements.txt Enforces Non-Habana Versions
    The requirements.txt file enforces:

lightning>=2.2.0
torch>=2.2.0
habana-lightning==1.4.0
However, when using the Habana container, installing this version of torch will overwrite Habana’s custom torch package, leading to version mismatches and potential runtime errors.
then, if will has habana 's torch version, it might be more appropriate to match the habana-lightning version also?

Currently, I am using habana-lightning==1.7.0rc0 due to API changes in PyTorch 2.5.0 ( not sure since which version), but the tutorial does not specify compatible versions.

  1. Compatibility Issue with DeepSpeed and PyTorch Elastic API
    Because of the API changes in PyTorch's elastic module, running the tutorial in newer environments results in an error:

from torch.distributed.elastic.utils.logging import get_logger
from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
The log function has been removed in newer versions of PyTorch Elastic. Will this tutorial enforce a specific version of deepspeed, such as <1.7, to maintain compatibility?

Expected Fixes
Add a README.md with installation steps.
Ensure requirements.txt is compatible with Habana containers, avoiding conflicts with pre-installed Habana versions of PyTorch and Lightning.
Clarify DeepSpeed and PyTorch compatibility to prevent version mismatches.
Thanks for maintaining this tutorial! Let me know if any additional details are needed.

I am sorry did not create pull request, but through github issue description. Because i not sure whether should change the way jupyterbook mention, use non-container.

Major Error,

  1. pytorch api incompatible with dependency
    ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py)

  2. native pytorch and habana pytorch
    if args.compile:
    if not hasattr(torch, "compile"):
    raise RuntimeError(
    f"The current torch version ({torch.version}) does not have support for compile."
    "Please install torch >= 1.14 or disable compile."
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions