Skip to content

Commit 2b712a1

Browse files
Short sequence handling for conv1d parallel versions
1 parent d8471f5 commit 2b712a1

File tree

2 files changed

+124
-100
lines changed

2 files changed

+124
-100
lines changed

c_reference/include/conv1d.h

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,30 @@
77
/* All the matrices/tensors are stored in the row major format
88
99
NOTES for the conv layers
10+
-> The conv1d & conv1d_lr layers work for all cases and can be used unconstrained.
11+
There are no hard constraints for the parallel version, but a points regarding the optimal usage are given below
12+
-> Dilation = 1 (no dilation) for all cases
1013
-> For the non-depthwise cases, store the matrices as described below. Permutation might be necessary
11-
-> The low-rank conv layers don't support depthwise computation. This is due to the out_channels/in_channels = 0 constarint.
14+
-> The low-rank decomposition cannot be applied to the depthwise weight matrices. This is due to the out_channels/in_channels = 0 constarint imposed by the depthwise convolution.
1215
For full-rank this is satisfied since out_channels = in_channels
13-
When the weight matrix is decomposed, the constarint is violated (since rank < out_channels ; and out_channels = in_channels for depthwise)
16+
But, when the matrix is decomposed, the constarint is violated (since rank < out_channels ; rank is not divisible by in_channels)
17+
Hence due to the decomposition being theoretically impossible, we have not provided the support
18+
However we suggest a less-efficient alternative => First pre-compute the weights W = W2 * W1 and then use a regular conv
1419
-> For the parallel cases, the non-overlapping cases of the convolution are computed parallelly using MatMul (since the blocked MatMul is faster)
15-
This howver is only valid for when the filter is fully in the input. There would be no-overlapping filters for the edge cases
20+
This howver is only valid for when the filter is fully in the input. There would be no-overlapping for the edge cases
1621
Hence the MatVec code(regular code) is used to calculate these cases
1722
18-
Constraint
19-
-> Due to the above reason, the parallel layers have to be used only for large in_time inputs
20-
This should typically be for in_time (without the padding) greater than 3 times the kernel_size
21-
For such short input cases, the code will either yield index-mismatched output or display a segmentration fault
22-
-> This constraint is due to a lack of time steps to parallelize into a matrix
23-
For such cases, the MatVec would need to be used
23+
Important points regarding parallel versions
24+
-> Due to the above reason, the parallel layers is only recommended for large in_time inputs
25+
This should typically be for in_time (without the padding) > 2 * (kernel_size + stride). Else there would not be enough time-steps to efficiently parallelize
26+
For other shorter input cases, the code will skip the MatMul computation and use MatVec instead (but the MatMul-variable computation overhead would remain)
27+
For such cases, the MatVec code (conv1d and conv1d_lr) would work more efficiently
28+
The RAM usage would be lower and the function would not have any overheads (calculation of the iterators and MatMul-auxiliary variables)
29+
-> There is no support for depthwise for conv1d_parallel
30+
The regular convolution acts on all the channels while the depthwise acts only on one channel at a time
31+
This results in a non-contiguos memory access. MatMul would need to process multiple such time-steps, while the MatVec would only need to process one
32+
Hence, the MatVec would be able to enter the next channel earlier and would work much faster
33+
While the MatMul would have cache misses (when dealing with the small chache size of edge devices)
2434
*/
2535

2636
/**

0 commit comments

Comments
 (0)