fix: DH-21732: Compromise for stateful partition filter reordering.#7697
fix: DH-21732: Compromise for stateful partition filter reordering.#7697cpwright merged 14 commits intodeephaven:mainfrom
Conversation
|
|
||
| #### Stateful Partition Filters | ||
|
|
||
| If Deephaven is configured to use stateless filters by default (the default setting in 41.0 and later), then a _partition |
There was a problem hiding this comment.
If Deephaven is configured to use stateless filters by default (the default setting in 41.0 and later), then a partition filter (a filter that only accesses partitioning columns of the data) is only stateful if the application developer explicitly marks it as such. When Deephaven is configured to use stateless filters by default, stateful partition filters are not reordered and must be evaluated on all rows of the table.
When Deephaven is configured to use stateful filters by default, the engine is permitted to treat stateful partition filters as if they were stateless for pragmatic, performance-oriented reasons.
In particular, the ordering constraints for filters on partitioning columns may be relaxed, and rather than
evaluating the filter on every row in the table, it may only be evaluated per location. This allows common partition filters to be reordered ahead of other filters and to avoid repeated evaluation against the same value. For example, the formula filter Date=today() is stateful if filters are stateful by default, but in nearly every case, users would prefer this to be evaluated early, location-by-location.
To ensure that partition filters are not reordered and are evaluated row-by-row, you can coalesce the table prior to applying the filter. When possible, specify filters in the order they should be applied rather than relying on the engine to reorder them.
...va/io/deephaven/engine/table/impl/select/vectorchunkfilter/VectorComponentFilterWrapper.java
Show resolved
Hide resolved
| /** | ||
| * Return true if this filter is a serial filter. | ||
| */ | ||
| default boolean isSerial() { |
There was a problem hiding this comment.
This feels insufficient right now.
- Right now, most of the code that uses filters only cares about
permitParallelization(), notisSerial(). Shouldn't we have had to change something in the code that adds implicit barriers? - I think we need to clarify the distinction between
permitParallelization()andisSerial(). - Shouldn't there be more overrides? Condition filters, for example?
io.deephaven.engine.table.impl.select.ConditionFilter#permitParallelizationpays attention toQueryTable.STATELESS_FILTERS_BY_DEFAULT. I would want you to look at everyWhereFilterimpl that overridespermitParallelization(), and consider whether we need to be instead/also serial.
There was a problem hiding this comment.
The serial implicit barriers are only a select thing; not a filter thing.
There was a problem hiding this comment.
I did look at the permitParallelization overrides.
Aside from the condition filter they can be divided into two main categories:
- Delegation, which deserves the same treatment (and exists).
- Turning off parallelism for filters that can't handle it (e.g. an incremental release filter).
The only particularly relevant one that is special is ConditionFilter. The point of the change is that even though a ConditionFilter might not permit parallelization (because it is stateful); we should not consider it serial unless the user explicitly said so and permit the partition aware source table to reorder it. This requires no override for a condition filter.
There was a problem hiding this comment.
To expand upon the bit about implicit barriers; any filter that does not permit parallelism is treated as a stateful filter in AbstractFilterExecution so it is effectively serial and executed on it's own.
The fundamental difference is that in Filters, we try to evaluate F_a, F_b, F_c generally in some sequence of a; b; and c such that we use the restriction of the inputs from one filter to prevent us from evaluating unnecessary rows in the next filter. In select, if there are no explicit column dependencies between F_a, F_b and F_c we may evaluate all three columns simultaneously; because there would be no savings by evaluating one before the other.
No description provided.