@@ -30,26 +30,14 @@ source .venv/bin/activate
30
30
31
31
Install python dependencies:
32
32
``` bash
33
- # Hivemind
34
- cd hivemind_source
35
- pip install .
36
- cp build/lib/hivemind/proto/* hivemind/proto/.
37
- pip install -e " .[all]"
38
- cd ..
39
- # Requirements
40
- pip install -r requirements.txt
41
- # Others
42
- pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/cpu
43
- pip install -e ./pydantic_config
44
- # OpenDiLoCo
45
33
pip install .
46
34
```
47
35
48
36
Optionally, you can install flash-attn to use Flash Attention 2.
49
37
This requires your system to have cuda compiler set up.
50
38
```
51
39
# (Optional) flash-attn
52
- pip install flash-attn= =2.5.8
40
+ pip install flash-attn> =2.5.8
53
41
```
54
42
55
43
## Docker container
@@ -305,20 +293,10 @@ We recommend using `bf16` to avoid scaling and desynchronization issues with hiv
305
293
306
294
307
295
# Debugging Issues
308
- 1 . ` hivemind ` or ` pydantic_config `
309
- If you are having issues with ` hivemind ` or ` pydantic_config ` , the issue could be related to submodules.
310
- You can clean and reinitialize the submodules from the root of the repository with the following commands:
311
-
312
- ```
313
- git submodule deinit -f .
314
- git clean -xdf
315
- git submodule update --init --recursive
316
- ```
317
-
318
- 2. `RuntimeError: CUDA error: invalid device ordinal`
296
+ 1 . ` RuntimeError: CUDA error: invalid device ordinal `
319
297
A possible culprit is that your ` --nproc-per-node ` argument for the torchrun launcher is set incorrectly.
320
298
Please set it to an integer less than equal to the number of gpus you have on your machine.
321
299
322
- 3 . `torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate...`
300
+ 2 . ` torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate... `
323
301
A possible culprit is that your ` --per-device-train-batch-size ` is too high.
324
302
Try a smaller value.
0 commit comments