|
7 | 7 | /* All the matrices/tensors are stored in the row major format
|
8 | 8 |
|
9 | 9 | NOTES for the conv layers
|
| 10 | +-> The conv1d & conv1d_lr layers work for all cases and can be used unconstrained. |
| 11 | + There are no hard constraints for the parallel version, but a points regarding the optimal usage are given below |
| 12 | +-> Dilation = 1 (no dilation) for all cases |
10 | 13 | -> For the non-depthwise cases, store the matrices as described below. Permutation might be necessary
|
11 |
| --> The low-rank conv layers don't support depthwise computation. This is due to the out_channels/in_channels = 0 constarint. |
| 14 | +-> The low-rank decomposition cannot be applied to the depthwise weight matrices. This is due to the out_channels/in_channels = 0 constarint imposed by the depthwise convolution. |
12 | 15 | For full-rank this is satisfied since out_channels = in_channels
|
13 |
| - When the weight matrix is decomposed, the constarint is violated (since rank < out_channels ; and out_channels = in_channels for depthwise) |
| 16 | + But, when the matrix is decomposed, the constarint is violated (since rank < out_channels ; rank is not divisible by in_channels) |
| 17 | + Hence due to the decomposition being theoretically impossible, we have not provided the support |
| 18 | + However we suggest a less-efficient alternative => First pre-compute the weights W = W2 * W1 and then use a regular conv |
14 | 19 | -> For the parallel cases, the non-overlapping cases of the convolution are computed parallelly using MatMul (since the blocked MatMul is faster)
|
15 |
| - This howver is only valid for when the filter is fully in the input. There would be no-overlapping filters for the edge cases |
| 20 | + This howver is only valid for when the filter is fully in the input. There would be no-overlapping for the edge cases |
16 | 21 | Hence the MatVec code(regular code) is used to calculate these cases
|
17 | 22 |
|
18 |
| - Constraint |
19 |
| --> Due to the above reason, the parallel layers have to be used only for large in_time inputs |
20 |
| - This should typically be for in_time (without the padding) greater than 3 times the kernel_size |
21 |
| - For such short input cases, the code will either yield index-mismatched output or display a segmentration fault |
22 |
| --> This constraint is due to a lack of time steps to parallelize into a matrix |
23 |
| - For such cases, the MatVec would need to be used |
| 23 | + Important points regarding parallel versions |
| 24 | +-> Due to the above reason, the parallel layers is only recommended for large in_time inputs |
| 25 | + This should typically be for in_time (without the padding) > 2 * (kernel_size + stride). Else there would not be enough time-steps to efficiently parallelize |
| 26 | + For other shorter input cases, the code will skip the MatMul computation and use MatVec instead (but the MatMul-variable computation overhead would remain) |
| 27 | + For such cases, the MatVec code (conv1d and conv1d_lr) would work more efficiently |
| 28 | + The RAM usage would be lower and the function would not have any overheads (calculation of the iterators and MatMul-auxiliary variables) |
| 29 | +-> There is no support for depthwise for conv1d_parallel |
| 30 | + The regular convolution acts on all the channels while the depthwise acts only on one channel at a time |
| 31 | + This results in a non-contiguos memory access. MatMul would need to process multiple such time-steps, while the MatVec would only need to process one |
| 32 | + Hence, the MatVec would be able to enter the next channel earlier and would work much faster |
| 33 | + While the MatMul would have cache misses (when dealing with the small chache size of edge devices) |
24 | 34 | */
|
25 | 35 |
|
26 | 36 | /**
|
|
0 commit comments