Good practices of training

List of good practices to prevent already-occurred issues:

1) Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error. 
Error: 
```
RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory
```
2) Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.

3) Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data

4) Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).

5) In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
`TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)`

    To prevent it, write your shape as in the following example, with '\\' and 'int': 
`nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)`

6) You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job (nodename@surf.nl) and running `nvidia-smi`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good practices of training #73

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Good practices of training #73

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions