-
Notifications
You must be signed in to change notification settings - Fork 26
INTPYTHON-635 Improve join performance by pushing simple filter conditions to $lookup #356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INTPYTHON-635 Improve join performance by pushing simple filter conditions to $lookup #356
Conversation
f1ce6d4
to
9728f2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some code comments please! There's no description of what a "pushed expression" is. :-)
Can you elaborate? Is this demonstrated by one of the tests? |
I added some docstrings, let me know if it explains well the idea. |
Well, maybe I messed up trying to explain all the things that I forgot to structure well the text. There was a bug in the join, I will make a reference but I took the previous for iterator to generate some columns, all the test worked because they don't have more than one extra expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One non-blocking comment on making a JIRA ticket. I'm excited for this optimization!
def test_negated_related_filter_is_not_pushable(self): | ||
with self.assertNumQueries(1) as ctx: | ||
list(Book.objects.filter(~models.Q(author__name="John"))) | ||
query = ctx.captured_queries[0]["sql"] | ||
self.assertEqual( | ||
query, | ||
"db.queries__book.aggregate([{'$lookup': {'from': " | ||
"'queries__author', 'let': {'parent__field__0': '$author_id'}, " | ||
"'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': " | ||
"['$$parent__field__0', '$_id']}]}}}], 'as': 'queries__author'}}, " | ||
"{'$unwind': '$queries__author'}, {'$match': {'$expr': " | ||
"{'$not': {'$eq': ['$queries__author.name', 'John']}}}}])", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a JIRA ticket for this case? I think it can and should be solved for, but not in the scope of this ticket. As well, could that JIRA ticket be added as a comment to this function? An INTPYTHON
ticket. If you need help making it, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 what would be the scope of the follow up ticket? I mean, it cold handle simple negated expressions or any kind of expressions (or with negations and complex scenarios)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, let's start small for this ticket and leave it to "handling negations".
It's feature-level work to handle something more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you are:
https://jira.mongodb.org/browse/INTPYTHON-713
django_mongodb_backend/compiler.py
Outdated
where = self.get_where() | ||
promote_filters = defaultdict(list) | ||
for expr in where.children if where and where.connector == AND else (): | ||
if ( | ||
isinstance(expr, Lookup) | ||
and isinstance(expr.lhs, Col) | ||
and (is_direct_value(expr.rhs) or isinstance(expr.rhs, Value | Col)) | ||
): | ||
promote_filters[expr.lhs.alias].append(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a comment that explains the particularities of all these conditions and the general purpose of this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to promote (push) only basic lookups. No subqueries or complex conditions.
First, we check that the LHS (left-hand side) is a column, not another expression. The RHS (right-hand side) must also be simple for the same reason.
If both sides meet these criteria, we push the condition.
We only push the LHS alias to avoid applying the same condition twice in two different subqueries. In most cases, the join Django performs on the LHS already covers the main improvements.
This approach is a bit tricky. The pushed conditions are applied only to INNER JOINs, because with LEFT OUTER JOINs, null checks are sometimes used to detect if the RHS does not exist in the join.
In short, the following lines annotate the simple expressions to be pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we check that the LHS (left-hand side) is a column, not another expression. The RHS (right-hand side) must also be simple for the same reason.
If both sides meet these criteria, we push the condition.
We only push the LHS alias to avoid applying the same condition twice in two different subqueries. In most cases, the join Django performs on the LHS already covers the main improvements.
This approach is a bit tricky. The pushed conditions are applied only to INNER JOINs, because with LEFT OUTER JOINs, null checks are sometimes used to detect if the RHS does not exist in the join.
In short, the following lines annotate the simple expressions to be pushed.
Could this be the code comment?
django_mongodb_backend/query.py
Outdated
extra.replace_expressions(replacements).as_mql(compiler, connection) | ||
) | ||
|
||
# pushed_expression is a filter expression from the outer WHERE clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with docs, please wrap comments at 79 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them are 79 or less. Only the function docstring is bigger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that some lines are wrapped early and could be flowed a little longer. I'll make the edits since I might tweak some wording anyway.
django_mongodb_backend/compiler.py
Outdated
@@ -548,10 +550,25 @@ def get_combinator_queries(self): | |||
|
|||
def get_lookup_pipeline(self): | |||
result = [] | |||
where = self.get_where() | |||
promote_filters = defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So promote_filters
are used to create pushed_expression
. Is there a reason to use two different terms (promote, pushed)? I know there is some usage of "promote" in Django's SQL generation. My understanding is that Django uses the term "join promotion" to describe the conversion of left outer join to inner join, where possible.
I don't know whether it makes any difference to use "promote" or "push" here, but I tentatively propose pushed_filters
and pushed_filter_expression
and will make that edit in my language review, if okay.
self.assertEqual( | ||
query, | ||
"db.queries__book.aggregate([" | ||
"{'$lookup': {'from': 'queries__author', " | ||
"'let': {'parent__field__0': '$author_id'}, 'pipeline': [" | ||
"{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0', " | ||
"'$_id']}, {'$eq': ['$name', 'John']}]}}}], 'as': " | ||
"'queries__author'}}, {'$unwind': '$queries__author'}, {'$match': " | ||
"{'$expr': {'$and': [{'$eq': ['$queries__author.name', 'John']}, " | ||
"{'$eq': ['$title', 'Don']}]}}}])", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we need a ticket for it (maybe you'd have time next week after code freeze), but how much nicer would an assertion helper like this be:
assertAggregateQuery(table, query)
self.assertAggregateQuery(
"queries__book",
{
{
'$lookup': ...
}
)
I think the diffs when the current version fails are not so easy to debug, not to mention the readability and tedious formatting of the current query strings!
Emanuel will add a |
Added new unit test. I handle a way to write a more human readable expected_queries. The bad thing is, the test file gets very long. |
495b009
to
9831e9b
Compare
This is the continuation of #345
With some bug fixes and unit test
In this PR we tackle the slow join pushing the simple conditions inside the $lookup. the condition is also fixing this bug in the col target cloning. the previous version copy from the last expr that loop in the for.
Also it fixes lhs_fields that only take fields from lhs, now it can take from lhs and rhs. (was expr.lhs.as_mql instead of hand_side_value.as_mql)
New query:
Previous query
fixes #309