@@ -30,26 +30,16 @@ source .venv/bin/activate
30
30
31
31
Install python dependencies:
32
32
``` bash
33
- # Hivemind
34
- cd hivemind_source
35
33
pip install .
36
- cp build/lib/hivemind/proto/* hivemind/proto/.
37
- pip install -e " .[all]"
38
- cd ..
39
- # Requirements
40
- pip install -r requirements.txt
41
- # Others
42
34
pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/cpu
43
- pip install -e ./pydantic_config
44
- # OpenDiLoCo
45
- pip install .
46
35
```
47
36
48
37
Optionally, you can install flash-attn to use Flash Attention 2.
49
38
This requires your system to have cuda compiler set up.
50
- ```
39
+
40
+ ``` bash
51
41
# (Optional) flash-attn
52
- pip install flash-attn= =2.5.8
42
+ pip install flash-attn> =2.5.8
53
43
```
54
44
55
45
## Docker container
@@ -305,20 +295,10 @@ We recommend using `bf16` to avoid scaling and desynchronization issues with hiv
305
295
306
296
307
297
# Debugging Issues
308
- 1 . ` hivemind ` or ` pydantic_config `
309
- If you are having issues with ` hivemind ` or ` pydantic_config ` , the issue could be related to submodules.
310
- You can clean and reinitialize the submodules from the root of the repository with the following commands:
311
-
312
- ```
313
- git submodule deinit -f .
314
- git clean -xdf
315
- git submodule update --init --recursive
316
- ```
317
-
318
- 2. `RuntimeError: CUDA error: invalid device ordinal`
298
+ 1 . ` RuntimeError: CUDA error: invalid device ordinal `
319
299
A possible culprit is that your ` --nproc-per-node ` argument for the torchrun launcher is set incorrectly.
320
300
Please set it to an integer less than equal to the number of gpus you have on your machine.
321
301
322
- 3 . `torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate...`
302
+ 2 . ` torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate... `
323
303
A possible culprit is that your ` --per-device-train-batch-size ` is too high.
324
304
Try a smaller value.
0 commit comments