Skip to content

Conversation

Zblocker64
Copy link

No description provided.

@andy108369
Copy link
Contributor

andy108369 commented Mar 18, 2024

Thank you for PR!

This has been tested only up until the SHM related error.

It awaits akash-network/support#179 first.

One can run it if one has access to the provider by setting up the /dev/shm - Memory K8s kind of path as explained here #507 (comment)

@andy108369
Copy link
Contributor

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. https://github.com/xai-org/grok-1/issues/164#issuecomment-2004750281

@Zblocker64
Copy link
Author

Zblocker64 commented Mar 18, 2024

@Zblocker64 it appears you are using the /dev/shm => /root/shm workaround; please remove it:

root@grok-1-596d68d5c7-5cq9f:/app# ps auxwwf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          32  0.0  0.0   4608  2016 pts/0    Ss   20:14   0:00 bash
root         206  0.0  0.0   8480  2016 pts/0    R+   20:15   0:00  \_ ps auxwwf
root           1  0.0  0.0   2576     0 ?        Ss   20:12   0:00 /bin/sh -c pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --user ; huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False ;  mv /app/checkpoints/ckpt /app/checkpoints/ckpt-0 ; mkdir /root/shm ; sed -i "s;/dev/shm/;/root/shm/;g" /app/checkpoint.py ; pip install -r requirements.txt ; python run.py
root          22  284  0.0 715020 323064 ?       Sl   20:13   6:51 /usr/local/bin/python /usr/local/bin/huggingface-cli download xai-org/grok-1 --repo-type model --include ckpt-0/* --local-dir /app/checkpoints --local-dir-use-symlinks False

Additionally, it is suggested to use pip install -r requirements.txt instead of pip install <one-by-oe-manually>

refs.

  1. Readme https://github.com/xai-org/grok-1
  2. python3 process exits eventually (8x h100's) xai-org/grok-1#164 (comment)

Just pushed an update to docker hub. You can use latest or 1.0 as the tag

@andy108369
Copy link
Contributor

andy108369 commented Mar 18, 2024

@andy108369
Copy link
Contributor

Please do not use this image (or any xai-org's grok-1 image) on H100's !
It still locks up the latest nvidia drivers 550.54.15 which then forces us to reboot these nodes.

Details
https://github.com/xai-org/grok-1/issues/164#issuecomment-2022572399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants