-
-
Notifications
You must be signed in to change notification settings - Fork 29
Description
In order to make function evaluation properly lazy, all functions (e.g. matmul) should be implemented as LazyUDFs.
Secondly, the lazyexpr machinery of compute should loop over chunks of the result. Each function must then decide what slices of the operands are necessary to form the corresponding result chunk (as matmul currently does internally). Then, upon evaluation, although the expression evaluates term-by-term, it does not compute the full result for e..g matmul before proceeding to the next term, but only for the necessary chunk(s) of output. Thus there is a higher chance of cache hits (not the case currently for eager execution of linalg and reductions).
Should be possible to handle even something like "matmul(sum(a,axis=1), b)" by passing the desired slice from matmul->sum, which treats the asked-for-slice as a desired output slice and handles accordingly.
This would avoid large in-memory temporaries for example when calculating from eagerly executed linear algebra functions e.g. in "matmul(a, b) + b".
Problems:
1 - reductions could be a problem since for example "sum(a) + a" for the chunks of output would recalculate the scalar "sum(a)" for each chunk of output. This could be avoided by using some kind of result cache for reductions.
2 - naturally, numexpr would always be faster since it compiles the expression into bytecode. Thus we should make sure that, when possible, we still use numexpr preferentially (for most elementwise funcs).