Skip to content

Conversation

@smalyshev
Copy link
Contributor

@smalyshev smalyshev commented Apr 14, 2025

This is a partial fix for how ESQL works with CCS and filters. This contains:

  • Track the clusters that have been excluded by the filters or empty wildcards
  • Do not error out on filtered-out indices
  • Check for disconnects and missing indices on both lookups when filter is involved
  • Add tests for filter interactions with CCS mechanics

Missing index handling for index expressions including existing/non-existing mix, and some LIMIT 0 scenarios, are still broken, not intending to fix it in this patch.

Implements tests for #118054

@smalyshev smalyshev changed the title Test for CCS with filters Improve filter handling for ESQL CCS Apr 17, 2025
@smalyshev smalyshev force-pushed the fix-filter-ccs branch 2 times, most recently from 26a7147 to 7fea91d Compare April 18, 2025 19:30
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review. I didn't review the tests yet and I need to do a deeper round on the core logic, but these are my initial thoughts.

@smalyshev smalyshev requested review from astefan and dnhatn April 21, 2025 16:26
@smalyshev smalyshev added :Analytics/ES|QL AKA ESQL :Search Foundations/CCS v8.19.0 auto-backport Automatically create backport pull requests when merged >bug >non-issue labels Apr 21, 2025
return originalTypes;
}

public String toString() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required for the fix strictly speaking, but it improves observability of columns and makes it much easier to inspect them when debugging.

@smalyshev smalyshev marked this pull request as ready for review April 21, 2025 18:02
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Apr 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)


private String getInexistentIndexErrorMessage() {
return "\"reason\" : \"Found 1 problem\\nline 1:1: Unknown index ";
return "Unknown index ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any valid reason for this change?

Fyi, the line:column information there is relevant for the accuracy of the parsing error messaging system. Those two numbers are linked to the Node itself (which comes from the ANTLR parser) and is used by the Verifier to build a consistent error message, no matter the error message itself and which Node it comes from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is too strict as it doesn't allow changing any details of how the errors are reported, and that led to some failures, but this may be older versions of the patch and I'll re-check why exactly it is needed. In general, I think our tests relying on specific error messages makes some things hard to refactor or fix (i.e. we have some parts of code produce "unknown index" and some "no such index" and we have a lot of tests that rely on that, and that means we essentially are testing implementation details instead of functionality, down to exact wording of the error message). In this particular case, beyond checking that there's an unknown index, it also checks that this is the only problem and that it is reported in a very specific way - which means if we changed how exactly unknown indices are handled in a particular case (e.g., say, moved the check from runtime to planning time, as we may have to do), this test would likely break. It may not be necessary now for this patch - going to check that - but I think this is something we may want to consider.

assertThat(clusterMetatata.getFailedShards(), equalTo(0));
}

protected void assertClusterMetadataNoShards(EsqlExecutionInfo.Cluster clusterMetatata, int shards, long took, String indexExpression) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be simplified by calling the one above assertClusterMetadataSuccess and passing 0 as the shards parameter. Also, this method doesn't even use the int shards parameter and can be completely removed from the method signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can clean up those. I left them somewhat verbose to make the debugging easier but now I can condense them. I do want to keep the different methods because that makes it easier to see which case is happening from looking at the test code, but I certainly can make it less copy-pasty.

}

protected EsqlQueryResponse runQuery(String query, Boolean ccsMetadataInResponse, QueryBuilder filter) {
EsqlQueryRequest request = EsqlQueryRequest.syncEsqlQueryRequest();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this, also, be execute with async search as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I didn't even look into async side yet, had my hands full with the sync case. I'll look into that now.

try (EsqlQueryResponse resp = runQuery("from logs-*,c*:logs-*", randomBoolean(), filter)) {
List<List<Object>> values = getValuesList(resp);
assertThat(values, hasSize(docsTest1));
// FIXME: this is currently inconsistent with the non-wildcard case, since empty wildcard is not an error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the issue here and the expected behavior.

I will describe a test I performed in non-CCS scenario where, from my point of view, things look ok. Data and index names are the one we use in CSV unit and IT tests.

  1. from airports*,emp* metadata _index | keep *name*, _index

This returns columns as first_name | last_name | name | _index and data from indices

airports_mp                        
employees_incompatible             
employees                          
airports_web                       
airports_not_indexed_nor_doc_values
airports_not_indexed               
airports_no_doc_values             
airports
  1. I am adding a filter to the request to filter out all the airports* indices:
    "query":"from airports*,emp* metadata _index | keep *name*, _index",
    "filter": {
        "bool": {
            "filter": [
                {
                    "exists": {
                        "field": "emp_no"
                    }
                }
            ]
        }
    }

and I get back columns as first_name | last_name | _index and data from indices

employees_incompatible
employees 

If I understand the FIXME right, the issue is that the remote columns are not added to the response, and this is consistent with the existent non-CCS behavior. And those columns shouldn't be added to the response, since the filter eliminates completely those indices (and, thus, their columns).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple, reproduceable scenario is appreciated.
Would be worth having this scenario reproduced (and compared with) in both CCS and non-CCS scenarios.

Thank you 🙏

Copy link
Contributor Author

@smalyshev smalyshev Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the columns are sometimes added and sometimes aren't, depending on obscure factors like presence of wildcards (even though the underlying index is the same), other indices, structure of the query, etc. Basically, if anything is not resolved on the first lookup, the second lookup will add the fields - but whether or not second lookup happens may be completely unrelated to the index in question (e.g. second lookup can happen because of something related to another index), and I think this would look completely random to the user. IMO there should be a consistent system of when the fields are there, and it should not depend on the implementation details of two lookups that it is now.
I think it doesn't really depend on CCS, just easier to set up and reproduce in CCS scenarios. This patch is not intending to fix it, just make the more serious failures (like filtered queries resulting in unknown index errors) go away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smalyshev I think it is important to see those very specific scenarios of inconsistency, since this is a new functionality that haven't been tried by our users much.

To give a bit more context, the way field names are resolved using field_caps has been a functionality that didn't change for at least 5 years in various projects (but using more or less the same code).

ES|QL is the first one where this field_caps filtering is happening and the possible inconsistency of the columns in the response is exactly a likely side-effect/bug/known issue of this change. On one side, ES|QL gains (in most cases) performance improvement by having the filter applied to field_caps, but there may be unknown inconsistencies we might not have foreseen, thus my previous asks about a specific use case without CCS where this might be an inconsistency. If we know of the specific use cases we could either improve the code or update our documentation with this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into whether I can reproduce the inconsistency without CCS.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor requests/questions. LGTM

&& executionInfo.getClusterStates(EsqlExecutionInfo.Cluster.Status.RUNNING).findAny().isEmpty()) {
// for a CCS, if all clusters have been marked as SKIPPED, nothing to search so send a sentinel Exception
// to let the LogicalPlanActionListener decide how to proceed
LOGGER.debug("No more clusters to search, ending analysis stage");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logging useful? If we see that in the logs, how does it help? Is it useful for debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is as it shows what is happening in the debug log - and yes, this is mainly for debugging.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @smalyshev

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smalyshev smalyshev enabled auto-merge (squash) April 23, 2025 14:04
@smalyshev smalyshev merged commit 12451b6 into elastic:main Apr 23, 2025
16 of 17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126807

@smalyshev
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

smalyshev added a commit to smalyshev/elasticsearch that referenced this pull request Apr 23, 2025
* Test for CCS with filters
* Partial fix for CCS/filters problems

(cherry picked from commit 12451b6)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/ColumnInfoImpl.java
elasticsearchmachine pushed a commit that referenced this pull request Apr 23, 2025
* Test for CCS with filters
* Partial fix for CCS/filters problems

(cherry picked from commit 12451b6)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/ColumnInfoImpl.java
@smalyshev smalyshev deleted the fix-filter-ccs branch April 24, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug >non-issue :Search Foundations/CCS Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants