-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Support geohash, geotile and geohex grid types #129581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support geohash, geotile and geohex grid types #129581
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @craigtaverner, I've created a changelog YAML for you. |
Added EsqlCapabily to block older BWC Changed license IT to new API
long geoGridId = getGridId(point, gridId, gridType); | ||
return gridId == geoGridId; | ||
} else { | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easy to implement though. It is ok to do it in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More over, I think we should do it generic, no point in distinguish a point from another geometry, we have code that that this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like a lot this approach as a grid cell cannot be really represented by a geometry. I am +1 to this approach as far as ESQL folks are good with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to carefully consider how to deal with the breaking change that this proposes, because we remove some functions altogether. Update: Breaking change itself was performed in #129839, this PR isn't breaking anymore but adds functionality.
Additionally, the new functions added by this aren't snapshot-only, either, and will be made available immediately. It's better to have them bake a little before un-snapshotting them in a dedicated PR.
Finally, this adds multiple data types and wires them immediately - I much prefer we do that in a dedicated PR, this is a big change and can easily introduce bugs.
...sql/src/main/java/org/elasticsearch/xpack/esql/expression/function/EsqlFunctionRegistry.java
Outdated
Show resolved
Hide resolved
#125143 added 9 spatial grid functions and released them into Serverless. We think this is not the best long-term approach and the functions in #129581 are likely better. As a first step, rmove the spatial grid functions added in #125143 from release builds so they don't get released into 8.19/9.1. --------- Co-authored-by: Craig Taverner <[email protected]>
elastic#125143 added 9 spatial grid functions and released them into Serverless. We think this is not the best long-term approach and the functions in elastic#129581 are likely better. As a first step, rmove the spatial grid functions added in elastic#125143 from release builds so they don't get released into 8.19/9.1. --------- Co-authored-by: Craig Taverner <[email protected]> (cherry picked from commit efb1397) # Conflicts: # docs/reference/query-languages/esql/_snippets/lists/spatial-functions.md # docs/reference/query-languages/esql/functions-operators/spatial-functions.md
The simplification favoured placing the error on the first parameter, but we'd rather place it on the second.
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/type/DataType.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is lovely. Thanks for splitting this @craigtaverner, much much easier to digest this way.
I found one inconsistency in the behavior of the conversion functions that should be addressed: you cannot convert a geotile type to itself, in contrast to our other data types.
Otherwise I only have minor test suggestions.
Please proceed at your own discretion and feel free to address anything in follow-up PRs if it's easier to get this chonky PR merged, first; I imagine the merge conflicts are not fun to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to support count_distinct
in a follow-up? It currently raises a verification exception when used with geohash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated and it appears CountDistinct uses isExact
to ensure that only actual fields (not reference attributes) are ever used. Not sure why this distinction is there, but it does block all types that are not fields (and the grid types are never es fields). So if we ever want to support this in future it is a bigger lift because it means supported non-fields types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my mistake, I read the code wrong. It does not require fields, but does have explicit support for each type, unlike Count. However, I can see that none of the spatial types are supported. It would be easy to support all spatial types as BytesRef, and grid types as long, but I think we should do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we add support for this, let's also add a csv test?
We could index some documents with keywords representing geohashes, sometimes being null. Then from idx | eval x = hash_as_keyword::geohash | stats count(x)
should return the non-null count. And similarly for hex and tiles.
Also applies to the other agg functions that get geotile/hash/hex support in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated the existing tests to use COUNT(grid)
instead of COUNT(*)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a csv test that uses this with geotiles/hex/hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grid converters are missing the trivial conversion for the same type to itself. Our other converters support this, but for instance row x = (\"u3bu\"::geohash)::geohash
raises a verification exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add csv tests with the MV_... functions and the new data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added an MV test for each testing a few MV functions, including the new MV_CONTAINS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here, a csv test to confirm this working e2e would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one doesn't have a corresponding test update, neither in union tests nor in csv tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, dear, this one is a bit of a can of worms, as there seems to be wider support in the operator than in covered by the tests, outside of grid support. Notably datetime. I think a separate issue to resolve all inconsistencies is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR at #129581 did not carefully check that all tests also work as release tests. This lead to some CI failures, which we are fixing here.
) In the work at #129581 we added new geo-grid types: `GEOHASH`, `GEOTILE` and `GEOHEX`, as well as support functions for creating these from `geo_point`, `long` and `keyword`. However, one of the key use cases we wish to replicate from the Query DSL is the grid search, and this involves the need to include the grid id inside a search predicate, in particular the `ST_INTERSECTS` function. For example: ``` FROM airports | WHERE ST_INTERSECTS(location, TO_GEOTILE("3/4/3")) | STATS count = COUNT(*), centroid = ST_CENTROID_AGG(location) ``` Since ST_INTERSECTS and ST_DISJOINT are converse of each other, we also added this support to that function.
The PR at #129581 added three new types: `geohash`, `geotile` and `geohex`, and support functions for creating these from strings, longs and geo_point fields. However all this was done under SNAPSHOT. Now it's time to move it into `tech-preview`.
The PR at #125143 added support for
ST_GEOHASH
,ST_GEOTILE
andST_GEOHEX
. However, since it usedlong
as the internal type for the grid id, there was need for many additional functions for converting thelong
to and fromkeyword
as well as generatinggeo_shape
cell bounds for map display. Each pf these involved many more files (for the functions, their docs and the generated evaluators). With inspiration from PostGIS we decided to take a different direction, and instead use a new internal type for each grid:geohash
for theST_GEOHASH
function, created from literal using eitherTO_GEOHASH(hash)
orhash::geohash
geotile
for theST_GEOTILE
function, created from literal using eitherTO_GEOTILE(tile)
ortile::geotile
geohex
for theST_GEOHEX
function, created from literal using eitherTO_GEOHEX(h3)
orh3::geohex
This also leads to much stricter type checking as we can no longer use the
long
as a plainlong
in all functions that accept longs, or inadvertently using a geohash in a geotile function. However, the addition of new types involves a lot of boilerplate, especially considering the large number of functions that operate on all types, and need to be informed of the existence of these three new types.One of the main goals of this work was to also support the concept of a geogrid search. This means "find all documents that intersect a grid cell", similar to the Query DSL version. This is achieved by enabling the use of these three new types in the
ST_INTERSECTS
function, but that will be done in a followup PR.Checklist