Skip to content

Conversation

@anhuong
Copy link
Collaborator

@anhuong anhuong commented Dec 3, 2024

Description of the change

In order to support Mamba2ForCausalLM and JambaForCausalLM models, we needed to install deps mamba_ssm and transformers with changes from Fabian's fork. In order to install mamba_ssm, we needed package cudnn9-cuda-12 otherwise we hit error ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory. Then, causal-conv1d failed to install from mamba_ssm due to ModuleNotFoundError: No module named 'torch' which is why we are installing mamba_ssm as a separate dependency after the base deps are installed.

Related issue number

How to verify the PR

Built image with these changes and was able to run tuning on Mamba and Jamba models.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Ssukriti and others added 7 commits November 21, 2024 15:19
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@github-actions
Copy link

github-actions bot commented Dec 3, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the build label Dec 3, 2024
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@fabianlim
Copy link
Collaborator

@anhuong just wanted to point out that bamba is now released so no need to use my forks anymore

@Ssukriti
Copy link
Collaborator

ya thanks @fabianlim . I have updated transformers. Just need to clean up the CUDNN versions tomorrow and then should be ready

Ssukriti and others added 4 commits March 10, 2025 19:47
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@Ssukriti
Copy link
Collaborator

Ssukriti commented Mar 13, 2025

installing "mamba_ssm[causal-conv1d] is resulting in a lot of errors in Travis build . Unable to figure out the cause https://v3.travis.ibm.com/github/ai-foundation/sft-trainer-image/builds/36206057 . Updating that thats where the PR is at , we might have to try different ways of installing

Strange part is sometimes randomly , actually only once the Travis build passed and I was able to build an image. The download links are correct for these CUDA libraries. These failures are only happening on this branch and not on main

@aluu317 was able to build image locally and verified it has all deps installed. It also builds on Github . No idea whats happening with Travis builds

@Ssukriti Ssukriti marked this pull request as ready for review March 18, 2025 20:53
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aluu317 for trying different things on Travis. We now have passing builds, though we arent sure why fixing version of libcudparselt was causing it to fail earlier. We will keep an eye and if failures resume, we will post on the Travis channel. We can merge for now probably

@Ssukriti Ssukriti merged commit c963595 into main Mar 18, 2025
11 checks passed
@dushyantbehl dushyantbehl deleted the mamba_model branch December 18, 2025 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants