-
Notifications
You must be signed in to change notification settings - Fork 202
feat: add tiledb_query_add_predicate API to parse a SQL expression string into a QueryCondition
#5566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…han a copy in order to add Rust binding (#5567) Differences in copy and move semantics between Rust and C++ mean that they are generally not very good at passing values across the boundary - most of the time it is necessary to pass references instead. #5566 intends to call `Attribute::get_enumeration_name` from Rust. Prior to these changes, that function returns `std::optional<std::string>`, which cannot be passed across the boundary (neither `std::optional` nor `std::string` can be). We can either pass a pointer to a string, or pass a string contained within a smart pointer. To avoid the additional memory allocations required for both copying the `std::string` and placing it in a `unique_ptr`, we choose the former. However, `std::optional` does not naturally support references, so we cannot do `std::optional<std::string&>`. Instead we change the return type to `std::optional<std::reference_wrapper<std::string>>`. In addition to enabling passing the result of this function to Rust without allocating additional memory, this also avoids making additional copies of the string. This is not likely to matter from a performance perspective, but is still nice! --- TYPE: NO_HISTORY DESC: Rust binding for `Attribute::get_enumeration_name`
3e877a3 to
6017582
Compare
teo-tsirpanis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review
tiledb/sm/query/query_condition.cc
Outdated
| } | ||
| } else { | ||
| throw std::logic_error( | ||
| throw QueryConditionException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, the trend of our exception (and previously status) types is to indicate where the error occured, rather than what the error was, putting us against the paradigm of exception types in C++ and other languages. This is not a good thing, and leads to C APIs almost always returning the monolithic TILEDB_ERR status code when they fail, which leads to poor error handling and reporting practices downstream, including the very fragile parsing of the error message, which I have seen doing both in testing and production code.
Of course we cannot fix this problem at once and in this PR, but what we can do is prevent the "where"-typed exceptions from proliferating in new code, or at the very least in existing code. Which is a long-winded way for me to say "I think we should revert this line (and the one below)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree with the root notion that the error reporting needs improvement. I prefer leaving this in however. What proper error handling looks like is a question for the future; and what this change does today is at least make it easier to identify all the instances of thrown exceptions in this file by searching on a common string.
| #include "tiledb/sm/subarray/subarray.h" | ||
|
|
||
| #ifdef HAVE_RUST | ||
| #include "tiledb/oxidize/rust.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is rust.h? If it adds large amounts of code to the compilation, we should avoid including it in other headers, for example by converting LogicalExpr and Session to self-contained types, and storing tdb_unique_ptrs of them in the class. There is also the concern that it will affect developers' inner loop, if the header's content is too sensitive to changes in the Rust code's implementation details, it will force recompilations of the C++ code, which will be limited by moving the include to the implementation file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this header are pinned to the version of our cxx crate dependency. I wouldn't be shocked if it never changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header is generated by the cxxbridge dependency and provides C++ definition which correspond to Rust standard library types, e.g. rust::Box and rust::Vec and etc. It isn't generated by compiling any of our Rust crates.
| fn attribute_by_name(&self, name: &CxxString) -> *const Attribute; | ||
|
|
||
| #[cxx_name = "get_enumeration"] | ||
| fn const_enumeration_cxx(&self, name: &CxxString) -> SharedPtr<ConstEnumeration>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have ConstSharedPtr instead of const editions of the individual classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. The #[cxx::bridge] macro is pretty strict about what types are allowed. SharedPtr is a special type which it knows how to expand to code on both the Rust and C++ side; custom opaque types are never allowed to be returned or passed by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that an example in C is useful to have. As I have said in the past, the C API is hard to use directly, and users should prefer using a higher-level language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the C examples are useful for developers, if not customers (who I agree are better off using higher-level APIs). I have referred to them often.
| // Sanity check | ||
| if (sanity_check(ctx, query) == TILEDB_ERR) { | ||
| return TILEDB_ERR; | ||
| } else if (predicate == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else if (predicate == nullptr) { | |
| } | |
| if (predicate == nullptr) { |
tiledb/sm/cpp_api/query.h
Outdated
| /** | ||
| * Adds a predicate. The predicate will be analyzed and evaluated | ||
| * in the subarray step, query condition step, or both. | ||
| * | ||
| * The predicate is parsed as a SQL expression and must evaluate | ||
| * to a boolean. | ||
| * | ||
| * @param predicate a SQL representation of the predicate | ||
| * @return Reference to this Query | ||
| */ | ||
| Query& add_predicate(const std::string& predicate) { | ||
| auto& ctx = ctx_.get(); | ||
| ctx.handle_error(tiledb_query_add_predicate( | ||
| ctx.ptr().get(), query_.get(), predicate.c_str())); | ||
| return *this; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /** | |
| * Adds a predicate. The predicate will be analyzed and evaluated | |
| * in the subarray step, query condition step, or both. | |
| * | |
| * The predicate is parsed as a SQL expression and must evaluate | |
| * to a boolean. | |
| * | |
| * @param predicate a SQL representation of the predicate | |
| * @return Reference to this Query | |
| */ | |
| Query& add_predicate(const std::string& predicate) { | |
| auto& ctx = ctx_.get(); | |
| ctx.handle_error(tiledb_query_add_predicate( | |
| ctx.ptr().get(), query_.get(), predicate.c_str())); | |
| return *this; | |
| } |
Remove experimental API from stable header?
| return Status::Ok(); | ||
| } | ||
|
|
||
| Status Query::add_predicate([[maybe_unused]] const char* predicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code should not use Status.
tiledb/sm/query/query.cc
Outdated
| if (!expr->is_predicate(array_schema())) { | ||
| return Status_QueryError("Expression does not return a boolean value"); | ||
| } | ||
| if (expr->has_aggregate_functions()) { | ||
| return Status_QueryError( | ||
| "Aggregate functions in predicate is not supported"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would slightly prefer to have Rust do the expression's validation. Might make the FFI simpler.
tiledb/sm/query/query_condition.h
Outdated
| #ifdef HAVE_RUST | ||
| /** Constructor from a datafusion expression tree */ | ||
| QueryCondition( | ||
| const ArraySchema& array_schema, | ||
| rust::Box<tiledb::oxidize::datafusion::logical_expr::LogicalExpr>&& expr); | ||
| #endif | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it should better belong in a separate DataFusionQueryCondition class, and either have a common ancestor with QueryCondition, or be stored as an std::variant<QueryCondition, DataFusionQueryCondition>.
From the looks of the Datafusion nested class, it looks like we are halfway there.
tiledb/oxidize/CMakeLists.txt
Outdated
|
|
||
| cxxbridge( | ||
| NAME | ||
| session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session is ambiguous; I would call it df-session.
tiledb/sm/query/query_condition.h
Outdated
| bool rewrite_to_datafusion( | ||
| const ArraySchema& array_schema, | ||
| tiledb::oxidize::arrow::schema::WhichSchema which); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the implementation of a query condition in-place, looks like a code smell to me. 🤔
| /** | ||
| * Adds a predicate to be applied to a read query. The added predicate | ||
| * will be analyzed and evaluated in the subarray step, query condition | ||
| * step, or both. | ||
| * | ||
| * The predicate is parsed as an Apache DataFusion SQL expression and must | ||
| * evaluate to a boolean. | ||
| * | ||
| * **Example:** | ||
| * | ||
| * @code{.c} | ||
| * const char* pred = "(row BETWEEN 1 AND 10) OR (column BETWEEN 1 AND 10)"; | ||
| * tiledb_query_add_predicate(ctx, query, pred); | ||
| * @endcode | ||
| * | ||
| * @param ctx The TileDB context. | ||
| * @param query The TileDB query. | ||
| * @param predicate A text representation of the desired predicate. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document here when query predicates are not supported (such as dense queries).
davisp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a first pass review, but skimming things there's a few things that strike me as things that could be improved.
I think the enumeration stuff could be improved if we split the validation logic up. The reason for enumerations to be lazy loaded is so that they're not sent back and forth when making REST API calls. Loading them on the server during query execution is a thing that already happens for normal query conditions, so I don't see why we can't do the same here. The downside here is that we're moving validation errors from tiledb_query_add_predicate to tiledb_query_submit, but I think that's preferable to not supporting enumerations. I'll also add after I wrote the rest of my comment, we're actually not able to shoot our Rust state across the wire which means our evaluation stuff has to live server side where we can easily load the enumerations.
My bigger issue though is that the FFI interface seems to be getting unnecessarily complex. There were a lot of comments about safety of transmuting that I was giving some side eye in terms of "That's seems plausibly true, but I can't prove it" which doesn't give me the warm fuzzies. However, it sure seems like we could make a lot of that go away by using a simpler interface that only exposes the required APIs. Something like such maybe?
extern "Rust" {
type QueryEvaluator;
fn new_query_evaluator(schema: ArraySchemaRef) -> Box<QueryEvaluator>;
fn add_predicates(&self, expr: Vec<CxxString>) -> Result<()>;
fn evaulate(&self, data: RecordBatch) -> Result<RecordBatch>; // Or the bitmap or w/e
fn validate_expr(schema: ArraySchemaRef, expr: &str) -> Result<()>; // Used by add_predicate to do any initial verification
}On the less important/hand wavy side, it seems like mixing query conditions and predicates is just adding complexity when we don't really need to. I'd just enforce one or the other rather than attempting to convert from query conditions to expressions. It might still be useful in tests for quick validation, but it feels like the sort of thing that'll be easier to not support long term (plus, what's the use case of mixing?).
test/src/unit-query-add-predicate.cc
Outdated
| "received NativeType::UInt64, DataType: UInt64.\nThis was likely " | ||
| "caused by a bug in DataFusion's code and we would welcome that you " | ||
| "file an bug report in our issue tracker"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a not great error message to be spitting out from our API.
This is a good requirement (i.e. that the enumeration is not loaded on the client side) to add and test for, good (implicit) suggestion.
Here's what I believed a few months ago (in the PR description):
For query conditions this didn't seem to matter because the expression isn't really validated until it is evaluated. But for datafusion we need the schema to even parse the expression into a I was wondering if With respect to enumerations is there a concern for this pull request, or are these details you would like to see for CORE-287? |
I have this in progress and it does simplify quite a lot. The code here ended up this way as I was a little fixated on extending the "evaluator" notion started in #5546 but I agree it is a lot cleaner to step away from that instead. |
Resolves CORE-25.
#5546 enabled
QueryConditionto evaluate datafusion expression trees in itsapply_sparsefunction. This was used only in test scenarios, with a configuration parameter which would haveQueryConditiontranslate its syntax tree into a datafusionExprand then evaluate it.This pull request builds upon this capability and connects it to C and C++ level APIs. We add
tiledb_query_add_predicate(tiledb_ctx_t*, tiledb_query_t*, const char* predicate)which uses datafusion to parsepredicateinto anExprwhich can be evaluated byQueryCondition.This enables users to add a much more expressive set of predicates to their queries, which can perform arithmetic, function calls, conditional (
CASE) expressions, and so on.Design
Datafusion has a
SessionContextwhich contains the resources and catalog of a Datafusion session. This object is used to parse and evaluate expressions and is where one registers tables, UDFs, etc.We will lazily attach a
SessionContextto aQuerywhentiledb_query_add_predicateis first invoked. Doing so lazily means that existing applications should observe no changes.The
tiledb_query_add_predicateinitializes theSessionContextif needed and uses it to parse the expression intoLogicalExpr. We accumulate a list ofLogicalExprand then combine them into a single conjunctive expression tree (possibly together with aQueryConditionsyntax tree) and embed that in a newQueryCondition.Enumerations
To reduce the scope of this pull request, we hoped to defer supporting attributes with enumerations in predicates to a future pull request (see CORE-287). However, #5546 demonstrated some examples of rewriting a QueryCondition into a DataFusion expression tree when attributes with enumerations were used in the predicates. This occurred after
QueryCondition::rewrite_for_schema, which resolves the literal into a key for comparison with the key stored in the tile. Continuing to pass these tests requires some amount of enumeration support to be implemented here.The "obvious" thing to do would be to represent enumerations in the schema using the Arrow
Dictionarydata type. However, this notion has a few shortcomings:DictionaryArrayrequires that each of the keys is a valid index into the dictionary values, whereas TileDB enumerations specifically do not have this requirement.To work around these things, we sometimes need the Arrow schema that we generate to use the enumeration key type, and sometimes we will want it to use the enumeration value type.
This pull request adds a switch to
tiledb::oxidize::arrow::schema::createand the functions coming afterwards which implement this toggle, for the generated schema to be the "storage" schema or the "view" schema. We use the "storage" schema to support the existing tests. CORE-287 will use the "view" schema to support text predicates on attributes with enumerations.There is some more code related to enumerations added here which was in draft along the way to implementing a correct "view" schema. I left it in but we can remove it if preferred.
Testing
query_condition_sparseAPI examples, as well as some more complicated predicates which cannot be expressed as a query condition.unit-query-add-predicate.cc.unit-sparse-global-order-reader.ccto optional render the generatedQueryConditionsyntax tree as a SQL string, which we then add as a predicate instead of applying it to the query normally.Unsupported
Next Steps
This does not analyze predicates for attributes/dimensions to improve R-tree traversal (CORE-28) or add the desired intersection function (CORE-26).
TYPE: FEATURE
DESC: add
tiledb_query_add_predicateAPI to parse a SQL expression string into aQueryCondition