Skip to content

Conversation

@ioanatia
Copy link
Contributor

Should close #127208

Part of #121950

AFAICS we don't need to ask for all fields when FORK is used and the current field resolution should work since FORK is a N-ary plan.

@ioanatia ioanatia added >non-issue :Analytics/ES|QL AKA ESQL Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.1.0 labels May 20, 2025
@ioanatia ioanatia marked this pull request as ready for review May 20, 2025 18:24
@ioanatia ioanatia requested a review from ChrisHegarty May 20, 2025 18:24
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:SearchOrg Meta label for the Search Org (Enterprise Search) labels May 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From code and tests in csv-spec files I noticed that FORK is "generating" a _fork (sort of hidden) field. Can you, please, add tests with queries that make use of this _fork field?

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query below (from rerank.csv-spec) is using as a field name _fork which is strange (many other queries using fork and rrf do this). This is the full list of field names it's asking for: book_no, author, title, _fork, title.*, _fork.*, author.*, book_no.*.

FROM books METADATA _id, _index, _score
            | FORK ( WHERE title:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
            ( WHERE author:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
            | RRF
            | RERANK "Tolkien" ON title WITH test_reranker
            | LIMIT 2
            | KEEP book_no, title, author

EsqlSession.fieldNames automatically ignores MetadataAttributes (like _id, _score, _index and others). I'm sure you've discussed about this: what was the reason for not considering _fork as a MetadataAttribute?

| WHERE a > 2000
| EVAL b = a + 100
| FORK (WHERE c > 1 AND a < 10000 | EVAL d = a + 500)
(STATS x = count(*), y=min(z))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm - now that I am looking at this more - I think we should return all fields, because one branch is not bounded by any command like KEEP or STATS that resets the output to a known list of fields. will fix 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more query where I am confused, in part because of probably not knowing what is the expectation for fork.

FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 )

This shows these columns:

min(salary) | gender | _fork | salary

Should these be a "union" kind of set of columns and fieldNames is only limiting it to what it "sees" in the fork's first branch? If that's true, this means that fieldNames should consider a union kind of field names from all the branches of fork. As a shortcut, the first branch that it finds that's the "widest" it should stop checking the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FORK output should be a union of the outputs of the fork branches.

If we detect that the same field exists in multiple fork branches outputs, but the field has different type, we fail with a validation error.

FORK will pad with null columns the output of FORK branches that are missing fields.
For example:

ROW a = [1, 2, 3], b = "foo"
| MV_EXPAND a
| FORK (WHERE true | LIMIT 2)
        (STATS x = count(*))
        (WHERE a % 2 == 0 | EVAL y = 2)
| SORT _fork, a
| KEEP _fork, a, b, x, y

Will produce:

     _fork     |       a       |       b       |       x       |       y       
---------------+---------------+---------------+---------------+---------------
fork1          |1              |foo            |null           |null           
fork1          |2              |foo            |null           |null           
fork2          |null           |null           |3              |null           
fork3          |2              |foo            |null           |2          

I think fixed the field resolution for this case.
In this test example here we should ask for all fields, since not all branches are bounded by an Aggregate or Project.

@ioanatia
Copy link
Contributor Author

ioanatia commented May 21, 2025

EsqlSession.fieldNames automatically ignores MetadataAttributes (like _id, _score, _index and others). I'm sure you've discussed about this: what was the reason for not considering _fork as a MetadataAttribute?

We could consider _fork a metadata attribute - but unlike _id, _index, you could actually have a _fork mapped field.
Promise to take this as a separate question that we can address, but for now we end up asking field_caps for the _fork field when we see it referenced.

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please, also, add that query I mentioned in one of my comments: FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 ) to IndexResolverFieldNamesTests. Thank you.

@ioanatia ioanatia merged commit c8581b0 into elastic:main May 21, 2025
17 of 18 checks passed
@ioanatia ioanatia deleted the fork_field_names branch May 21, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ES|QL: tests for FORK's evaluation of field names used in field_caps resolve calls

5 participants