Skip to content

Commit 0f9ac9d

Browse files
Use SearchStats instead of field.isAggregatable in data node planning (#115744)
Since ES|QL makes use of field-caps and only considers `isAggregatable` during Lucene pushdown, turning off doc-values disables Lucene pushdown. This is incorrect. The physical planning decision for Lucene pushdown is made during local planning on the data node, at which point `SearchStats` are known, and both `isIndexed` and `hasDocValues` are separately knowable. The Lucene pushdown should happen for `isIndexed` and not consider `hasDocValues` at all. This PR adds hasDocValues to SearchStats and the uses isIndexed and hasDocValue separately during local physical planning on the data nodes. This immediately cleared up one issue for spatial data, which could not push down a lucene query when doc-values was disabled. Summary of what `isAggregatable` means for different implementations of `MappedFieldType`: * Default implementation of `isAggregatable` in `MappedFieldType` is `hasDocValues`, and does not consider `isIndexed` * All classes that extend `AbstractScriptFieldType` (eg. `LongScriptFieldType`) hard coded `isAggregatable` to `true`. This presumably means Lucene is happy to mimic having doc-values * `TestFieldType`, and classes that extend it, return the value of `fielddata`, so consider the field aggregatable if there is field-data. * `AggregateDoubleMetricFieldType` and `ConstantFieldType` hard coded to `true` * `DenseVectorFieldType` hard coded to `false` * `IdFieldType` return the value of `fieldDataEnabled.getAsBoolean()` In no case is `isIndexed` used for `isAggregatable`. However, for our Lucene pushdown of filters, `isIndexed` would make a lot more sense. But for pushdown of TopN, `hasDocValues` makes more sense. Summarising the results of the various options for the various field types, where `?` means configrable: | Class | isAggregatable | isIndexed | isStored | hasDocValues | | --- | --- | --- | --- | --- | | AbstractScriptFieldType | true | false | false | false | | AggregateDoubleMetricFieldType | true | true | false | false | | DenseVectorFieldType | false | ? | false | !indexed | | IdFieldType | fieldData | true | true | false | | TsidExtractingIdField | false | true | true | false | | TextFieldType | fieldData | ? | ? | false | | ? (the rest) | hasDocValues | ? | ? | ? | It has also been observed that we cannot push filters to source without checking `hasDocValues` when we use the `SingleValueQuery`. So this leads to three groups of conditions: | Category | require `indexed` | require `docValues` | | --- | --- | --- | | Filters(single-value) | true | true | | Filters(multi-value) | true | false | | TopN | true | true | And for all cases we will also consider `isAggregatable` as a disjunction to cover the script field types, leading to two possible combinations: * `fa.isAggregatable() || searchStats.isIndexed(fa.name()) && searchStats.hasDocValues(fa.name())` * `fa.isAggregatable() || searchStats.isIndexed(fa.name())`
1 parent f7ee3fb commit 0f9ac9d

File tree

26 files changed

+1049
-565
lines changed

26 files changed

+1049
-565
lines changed

docs/changelog/115744.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
pr: 115744
2+
summary: Use `SearchStats` instead of field.isAggregatable in data node planning
3+
area: ES|QL
4+
type: bug
5+
issues:
6+
- 115737

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -968,15 +968,27 @@ public boolean isAggregatable() {
968968
return fielddata;
969969
}
970970

971-
public boolean canUseSyntheticSourceDelegateForQuerying() {
971+
/**
972+
* Returns true if the delegate sub-field can be used for loading and querying (ie. either isIndexed or isStored is true)
973+
*/
974+
public boolean canUseSyntheticSourceDelegateForLoading() {
972975
return syntheticSourceDelegate != null
973976
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
974977
&& (syntheticSourceDelegate.isIndexed() || syntheticSourceDelegate.isStored());
975978
}
976979

980+
/**
981+
* Returns true if the delegate sub-field can be used for querying only (ie. isIndexed must be true)
982+
*/
983+
public boolean canUseSyntheticSourceDelegateForQuerying() {
984+
return syntheticSourceDelegate != null
985+
&& syntheticSourceDelegate.ignoreAbove() == Integer.MAX_VALUE
986+
&& syntheticSourceDelegate.isIndexed();
987+
}
988+
977989
@Override
978990
public BlockLoader blockLoader(BlockLoaderContext blContext) {
979-
if (canUseSyntheticSourceDelegateForQuerying()) {
991+
if (canUseSyntheticSourceDelegateForLoading()) {
980992
return new BlockLoader.Delegating(syntheticSourceDelegate.blockLoader(blContext)) {
981993
@Override
982994
protected String delegatingTo() {

test/framework/src/main/java/org/elasticsearch/index/mapper/TextFieldFamilySyntheticSourceTestSetup.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ public static MapperTestCase.BlockReaderSupport getSupportedReaders(MapperServic
3939
TextFieldMapper.TextFieldType text = (TextFieldMapper.TextFieldType) ft;
4040
boolean supportsColumnAtATimeReader = text.syntheticSourceDelegate() != null
4141
&& text.syntheticSourceDelegate().hasDocValues()
42-
&& text.canUseSyntheticSourceDelegateForQuerying();
42+
&& text.canUseSyntheticSourceDelegateForLoading();
4343
return new MapperTestCase.BlockReaderSupport(supportsColumnAtATimeReader, mapper, loaderFieldName);
4444
}
4545
MappedFieldType parent = mapper.fieldType(parentName);

x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ public class CsvTestsDataLoader {
7272
private static final TestsDataset DECADES = new TestsDataset("decades");
7373
private static final TestsDataset AIRPORTS = new TestsDataset("airports");
7474
private static final TestsDataset AIRPORTS_MP = AIRPORTS.withIndex("airports_mp").withData("airports_mp.csv");
75+
private static final TestsDataset AIRPORTS_NO_DOC_VALUES = new TestsDataset("airports_no_doc_values").withData("airports.csv");
76+
private static final TestsDataset AIRPORTS_NOT_INDEXED = new TestsDataset("airports_not_indexed").withData("airports.csv");
77+
private static final TestsDataset AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES = new TestsDataset("airports_not_indexed_nor_doc_values")
78+
.withData("airports.csv");
7579
private static final TestsDataset AIRPORTS_WEB = new TestsDataset("airports_web");
7680
private static final TestsDataset DATE_NANOS = new TestsDataset("date_nanos");
7781
private static final TestsDataset COUNTRIES_BBOX = new TestsDataset("countries_bbox");
@@ -105,6 +109,9 @@ public class CsvTestsDataLoader {
105109
Map.entry(DECADES.indexName, DECADES),
106110
Map.entry(AIRPORTS.indexName, AIRPORTS),
107111
Map.entry(AIRPORTS_MP.indexName, AIRPORTS_MP),
112+
Map.entry(AIRPORTS_NO_DOC_VALUES.indexName, AIRPORTS_NO_DOC_VALUES),
113+
Map.entry(AIRPORTS_NOT_INDEXED.indexName, AIRPORTS_NOT_INDEXED),
114+
Map.entry(AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES.indexName, AIRPORTS_NOT_INDEXED_NOR_DOC_VALUES),
108115
Map.entry(AIRPORTS_WEB.indexName, AIRPORTS_WEB),
109116
Map.entry(COUNTRIES_BBOX.indexName, COUNTRIES_BBOX),
110117
Map.entry(COUNTRIES_BBOX_WEB.indexName, COUNTRIES_BBOX_WEB),

x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/EsqlTestUtils.java

Lines changed: 93 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,8 @@
8989
import java.time.Period;
9090
import java.util.ArrayList;
9191
import java.util.EnumSet;
92+
import java.util.HashMap;
93+
import java.util.HashSet;
9294
import java.util.Iterator;
9395
import java.util.LinkedHashMap;
9496
import java.util.List;
@@ -206,9 +208,30 @@ public static EsRelation relation() {
206208
return new EsRelation(EMPTY, new EsIndex(randomAlphaOfLength(8), emptyMap()), IndexMode.STANDARD, randomBoolean());
207209
}
208210

209-
public static class TestSearchStats extends SearchStats {
210-
public TestSearchStats() {
211-
super(emptyList());
211+
/**
212+
* This version of SearchStats always returns true for all fields for all boolean methods.
213+
* For custom behaviour either use {@link TestConfigurableSearchStats} or override the specific methods.
214+
*/
215+
public static class TestSearchStats implements SearchStats {
216+
217+
@Override
218+
public boolean exists(String field) {
219+
return true;
220+
}
221+
222+
@Override
223+
public boolean isIndexed(String field) {
224+
return exists(field);
225+
}
226+
227+
@Override
228+
public boolean hasDocValues(String field) {
229+
return exists(field);
230+
}
231+
232+
@Override
233+
public boolean hasExactSubfield(String field) {
234+
return exists(field);
212235
}
213236

214237
@Override
@@ -226,11 +249,6 @@ public long count(String field, BytesRef value) {
226249
return exists(field) ? -1 : 0;
227250
}
228251

229-
@Override
230-
public boolean exists(String field) {
231-
return true;
232-
}
233-
234252
@Override
235253
public byte[] min(String field, DataType dataType) {
236254
return null;
@@ -245,10 +263,76 @@ public byte[] max(String field, DataType dataType) {
245263
public boolean isSingleValue(String field) {
246264
return false;
247265
}
266+
}
267+
268+
/**
269+
* This version of SearchStats can be preconfigured to return true/false for various combinations of the four field settings:
270+
* <ol>
271+
* <li>exists</li>
272+
* <li>isIndexed</li>
273+
* <li>hasDocValues</li>
274+
* <li>hasExactSubfield</li>
275+
* </ol>
276+
* The default will return true for all fields. The include/exclude methods can be used to configure the settings for specific fields.
277+
* If you call 'include' with no fields, it will switch to return false for all fields.
278+
*/
279+
public static class TestConfigurableSearchStats extends TestSearchStats {
280+
public enum Config {
281+
EXISTS,
282+
INDEXED,
283+
DOC_VALUES,
284+
EXACT_SUBFIELD
285+
}
286+
287+
private final Map<Config, Set<String>> includes = new HashMap<>();
288+
private final Map<Config, Set<String>> excludes = new HashMap<>();
289+
290+
public TestConfigurableSearchStats include(Config key, String... fields) {
291+
// If this method is called with no fields, it is interpreted to mean include none, so we include a dummy field
292+
for (String field : fields.length == 0 ? new String[] { "-" } : fields) {
293+
includes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
294+
excludes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
295+
}
296+
return this;
297+
}
298+
299+
public TestConfigurableSearchStats exclude(Config key, String... fields) {
300+
for (String field : fields) {
301+
includes.computeIfAbsent(key, k -> new HashSet<>()).remove(field);
302+
excludes.computeIfAbsent(key, k -> new HashSet<>()).add(field);
303+
}
304+
return this;
305+
}
306+
307+
private boolean isConfigationSet(Config config, String field) {
308+
Set<String> in = includes.getOrDefault(config, Set.of());
309+
Set<String> ex = excludes.getOrDefault(config, Set.of());
310+
return (in.isEmpty() || in.contains(field)) && ex.contains(field) == false;
311+
}
312+
313+
@Override
314+
public boolean exists(String field) {
315+
return isConfigationSet(Config.EXISTS, field);
316+
}
248317

249318
@Override
250319
public boolean isIndexed(String field) {
251-
return exists(field);
320+
return isConfigationSet(Config.INDEXED, field);
321+
}
322+
323+
@Override
324+
public boolean hasDocValues(String field) {
325+
return isConfigationSet(Config.DOC_VALUES, field);
326+
}
327+
328+
@Override
329+
public boolean hasExactSubfield(String field) {
330+
return isConfigationSet(Config.EXACT_SUBFIELD, field);
331+
}
332+
333+
@Override
334+
public String toString() {
335+
return "TestConfigurableSearchStats{" + "includes=" + includes + ", excludes=" + excludes + '}';
252336
}
253337
}
254338

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"properties": {
3+
"abbrev": {
4+
"type": "keyword"
5+
},
6+
"name": {
7+
"type": "text"
8+
},
9+
"scalerank": {
10+
"type": "integer"
11+
},
12+
"type": {
13+
"type": "keyword"
14+
},
15+
"location": {
16+
"type": "geo_point",
17+
"index": true,
18+
"doc_values": false
19+
},
20+
"country": {
21+
"type": "keyword"
22+
},
23+
"city": {
24+
"type": "keyword"
25+
},
26+
"city_location": {
27+
"type": "geo_point"
28+
}
29+
}
30+
}
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"properties": {
3+
"abbrev": {
4+
"type": "keyword"
5+
},
6+
"name": {
7+
"type": "text"
8+
},
9+
"scalerank": {
10+
"type": "integer"
11+
},
12+
"type": {
13+
"type": "keyword"
14+
},
15+
"location": {
16+
"type": "geo_point",
17+
"index": false,
18+
"doc_values": true
19+
},
20+
"country": {
21+
"type": "keyword"
22+
},
23+
"city": {
24+
"type": "keyword"
25+
},
26+
"city_location": {
27+
"type": "geo_point"
28+
}
29+
}
30+
}

x-pack/plugin/esql/qa/testFixtures/src/main/resources/spatial.csv-spec

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -484,6 +484,42 @@ centroid:geo_point | count:long
484484
POINT (42.97109629958868 14.7552534006536) | 1
485485
;
486486

487+
centroidFromAirportsAfterIntersectsCompoundPredicateNoDocValues
488+
required_capability: st_intersects
489+
490+
FROM airports_no_doc_values
491+
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
492+
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
493+
;
494+
495+
centroid:geo_point | count:long
496+
POINT (42.97109629958868 14.7552534006536) | 1
497+
;
498+
499+
centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexedNorDocValues
500+
required_capability: st_intersects
501+
502+
FROM airports_not_indexed_nor_doc_values
503+
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
504+
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
505+
;
506+
507+
centroid:geo_point | count:long
508+
POINT (42.97109629958868 14.7552534006536) | 1
509+
;
510+
511+
centroidFromAirportsAfterIntersectsCompoundPredicateNotIndexed
512+
required_capability: st_intersects
513+
514+
FROM airports_not_indexed
515+
| WHERE scalerank == 9 AND ST_INTERSECTS(location, TO_GEOSHAPE("POLYGON((42 14, 43 14, 43 15, 42 15, 42 14))")) AND country == "Yemen"
516+
| STATS centroid=ST_CENTROID_AGG(location), count=COUNT()
517+
;
518+
519+
centroid:geo_point | count:long
520+
POINT (42.97109629958868 14.7552534006536) | 1
521+
;
522+
487523
###############################################
488524
# Tests for ST_INTERSECTS on GEO_POINT type
489525

0 commit comments

Comments
 (0)