Skip to content

INTPYTHON-635 Improve join performance by pushing simple filter conditions to $lookup #356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Aug 5, 2025

This is the continuation of #345
With some bug fixes and unit test

In this PR we tackle the slow join pushing the simple conditions inside the $lookup. the condition is also fixing this bug in the col target cloning. the previous version copy from the last expr that loop in the for.
Also it fixes lhs_fields that only take fields from lhs, now it can take from lhs and rhs. (was expr.lhs.as_mql instead of hand_side_value.as_mql)

New query:

[{'$lookup': {'as': 'm2m_regress_selfrefer_references',
              'from': 'm2m_regress_selfrefer_references',
              'let': {'parent__field__0': '$_id'},
              'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0',
                                                                   '$to_selfrefer_id']},
                                                          {'$eq': ['$from_selfrefer_id',
                                                                   ObjectId('687ad601daccf7b757ff46ec')]}]}}}]}},
 {'$unwind': '$m2m_regress_selfrefer_references'},
 {'$match': {'$expr': {'$eq': ['$m2m_regress_selfrefer_references.from_selfrefer_id',
                               ObjectId('687ad601daccf7b757ff46ec')]}}}]

Previous query

[{'$lookup': {'as': 'm2m_regress_selfrefer_references',
              'from': 'm2m_regress_selfrefer_references',
              'let': {'parent__field__0': '$_id'},
              'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0',
                                                                   '$to_selfrefer_id']}]}}}]}},
 {'$unwind': '$m2m_regress_selfrefer_references'},
 {'$match': {'$expr': {'$eq': ['$m2m_regress_selfrefer_references.from_selfrefer_id',
                               ObjectId('687ad676396a5de149f2ea4d')]}}}]

fixes #309

@WaVEV WaVEV force-pushed the INTPYTHON-635-Slow-joins-for-ManyToMany-fields branch from f1ce6d4 to 9728f2b Compare August 5, 2025 13:33
Copy link
Collaborator

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some code comments please! There's no description of what a "pushed expression" is. :-)

@timgraham
Copy link
Collaborator

the condition is also fix a bug.

Can you elaborate? Is this demonstrated by one of the tests?

@timgraham timgraham changed the title Intpython 635 slow joins for many to many fields INTPYTHON-635 Improve join performance by pushing simple conditions to $lookup Aug 5, 2025
@WaVEV
Copy link
Collaborator Author

WaVEV commented Aug 5, 2025

Need some code comments please! There's no description of what a "pushed expression" is. :-)

I added some docstrings, let me know if it explains well the idea.

@WaVEV
Copy link
Collaborator Author

WaVEV commented Aug 5, 2025

the condition is also fix a bug.

Can you elaborate? Is this demonstrated by one of the tests?

Well, maybe I messed up trying to explain all the things that I forgot to structure well the text. There was a bug in the join, I will make a reference but I took the previous for iterator to generate some columns, all the test worked because they don't have more than one extra expression

Copy link
Contributor

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking comment on making a JIRA ticket. I'm excited for this optimization!

Comment on lines +130 to +157
def test_negated_related_filter_is_not_pushable(self):
with self.assertNumQueries(1) as ctx:
list(Book.objects.filter(~models.Q(author__name="John")))
query = ctx.captured_queries[0]["sql"]
self.assertEqual(
query,
"db.queries__book.aggregate([{'$lookup': {'from': "
"'queries__author', 'let': {'parent__field__0': '$author_id'}, "
"'pipeline': [{'$match': {'$expr': {'$and': [{'$eq': "
"['$$parent__field__0', '$_id']}]}}}], 'as': 'queries__author'}}, "
"{'$unwind': '$queries__author'}, {'$match': {'$expr': "
"{'$not': {'$eq': ['$queries__author.name', 'John']}}}}])",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a JIRA ticket for this case? I think it can and should be solved for, but not in the scope of this ticket. As well, could that JIRA ticket be added as a comment to this function? An INTPYTHON ticket. If you need help making it, let me know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 what would be the scope of the follow up ticket? I mean, it cold handle simple negated expressions or any kind of expressions (or with negations and complex scenarios)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, let's start small for this ticket and leave it to "handling negations".

It's feature-level work to handle something more robust.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 553 to 561
where = self.get_where()
promote_filters = defaultdict(list)
for expr in where.children if where and where.connector == AND else ():
if (
isinstance(expr, Lookup)
and isinstance(expr.lhs, Col)
and (is_direct_value(expr.rhs) or isinstance(expr.rhs, Value | Col))
):
promote_filters[expr.lhs.alias].append(expr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a comment that explains the particularities of all these conditions and the general purpose of this code?

Copy link
Collaborator Author

@WaVEV WaVEV Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to promote (push) only basic lookups. No subqueries or complex conditions.
First, we check that the LHS (left-hand side) is a column, not another expression. The RHS (right-hand side) must also be simple for the same reason.
If both sides meet these criteria, we push the condition.
We only push the LHS alias to avoid applying the same condition twice in two different subqueries. In most cases, the join Django performs on the LHS already covers the main improvements.

This approach is a bit tricky. The pushed conditions are applied only to INNER JOINs, because with LEFT OUTER JOINs, null checks are sometimes used to detect if the RHS does not exist in the join.

In short, the following lines annotate the simple expressions to be pushed.

Copy link
Contributor

@Jibola Jibola Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we check that the LHS (left-hand side) is a column, not another expression. The RHS (right-hand side) must also be simple for the same reason.
If both sides meet these criteria, we push the condition.
We only push the LHS alias to avoid applying the same condition twice in two different subqueries. In most cases, the join Django performs on the LHS already covers the main improvements.

This approach is a bit tricky. The pushed conditions are applied only to INNER JOINs, because with LEFT OUTER JOINs, null checks are sometimes used to detect if the RHS does not exist in the join.
In short, the following lines annotate the simple expressions to be pushed.

Could this be the code comment?

extra.replace_expressions(replacements).as_mql(compiler, connection)
)

# pushed_expression is a filter expression from the outer WHERE clause
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with docs, please wrap comments at 79 characters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them are 79 or less. Only the function docstring is bigger

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that some lines are wrapped early and could be flowed a little longer. I'll make the edits since I might tweak some wording anyway.

@@ -548,10 +550,25 @@ def get_combinator_queries(self):

def get_lookup_pipeline(self):
result = []
where = self.get_where()
promote_filters = defaultdict(list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So promote_filters are used to create pushed_expression. Is there a reason to use two different terms (promote, pushed)? I know there is some usage of "promote" in Django's SQL generation. My understanding is that Django uses the term "join promotion" to describe the conversion of left outer join to inner join, where possible.

I don't know whether it makes any difference to use "promote" or "push" here, but I tentatively propose pushed_filters and pushed_filter_expression and will make that edit in my language review, if okay.

Comment on lines +37 to +62
self.assertEqual(
query,
"db.queries__book.aggregate(["
"{'$lookup': {'from': 'queries__author', "
"'let': {'parent__field__0': '$author_id'}, 'pipeline': ["
"{'$match': {'$expr': {'$and': [{'$eq': ['$$parent__field__0', "
"'$_id']}, {'$eq': ['$name', 'John']}]}}}], 'as': "
"'queries__author'}}, {'$unwind': '$queries__author'}, {'$match': "
"{'$expr': {'$and': [{'$eq': ['$queries__author.name', 'John']}, "
"{'$eq': ['$title', 'Don']}]}}}])",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need a ticket for it (maybe you'd have time next week after code freeze), but how much nicer would an assertion helper like this be:
assertAggregateQuery(table, query)

self.assertAggregateQuery(
    "queries__book",
    {
        {
            '$lookup': ...
    }
)

I think the diffs when the current version fails are not so easy to debug, not to mention the readability and tedious formatting of the current query strings!

@timgraham
Copy link
Collaborator

Emanuel will add a ManyToManyField test to confirm this fixes the original report.

@WaVEV
Copy link
Collaborator Author

WaVEV commented Aug 9, 2025

Emanuel will add a ManyToManyField test to confirm this fixes the original report.

Added new unit test. I handle a way to write a more human readable expected_queries. The bad thing is, the test file gets very long.

@timgraham timgraham changed the title INTPYTHON-635 Improve join performance by pushing simple conditions to $lookup INTPYTHON-635 Improve join performance by pushing simple filter conditions to $lookup Aug 9, 2025
@timgraham timgraham force-pushed the INTPYTHON-635-Slow-joins-for-ManyToMany-fields branch from 495b009 to 9831e9b Compare August 9, 2025 18:19
@timgraham timgraham merged commit 9831e9b into mongodb:main Aug 9, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

INTPYTHON-635 Slow joins for ManyToMany fields
4 participants