Skip to content

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Sep 9, 2025

Doc here

In this PR a unified approach for generating MQL from Django expressions was implemented. The core idea is to centralize the control flow in a base_expression method, which decides whether the expression can be translated into a direct field: value match (index-friendly) or must fall back to $expr. This keeps the logic for wrapping and dispatching in one place, while each lookup/function only defines its own expression-building logic.

This approach also allows mixing direct field: value matches with $expr clauses within the same $match. As a result, multiple $expr entries may coexist alongside index-optimized conditions, depending on the shape of the query.

Most lookups now follow this pattern by simply implementing as_mql_expr (and optionally as_mql_path when a match-based translation is possible). Only a few special cases like Col, Func operators (except the KeyTransform) , and many more, override the base behavior directly. This structure also leaves room for future optimizations (e.g. constant folding) without having to change the overall flow.

Additionally, since MongoDB 6 does not allow nesting $expr inside another $expr, the flow in base_expression ensures that such cases are flattened. In practice, expressions are generated without redundant wrapping, so the final MQL never contains $expr within $expr.

NOTE: Some polish will be made, but the main idea and the majority of the code is already rendered.

@WaVEV WaVEV force-pushed the lookup-refactor branch 3 times, most recently from 529e0ff to a78f26b Compare September 15, 2025 21:21
@timgraham timgraham changed the title WIP lookup refactor INTPYTHON-751 Make query generation omit $expr unless required Sep 20, 2025
Substr.as_mql = substr
Trim.as_mql = trim("trim")
TruncBase.as_mql = trunc
Cast.as_mql_expr = cast
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the function does not support as_mql_path. It could be added latter if we try to simplify constants expressions

return value


def base_expression(self, compiler, connection, as_path=False, **extra):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the common handler for all expressions. It defines if an expr is needed or not.

@WaVEV WaVEV marked this pull request as ready for review September 26, 2025 04:52
@WaVEV WaVEV requested review from Jibola and timgraham and removed request for Jibola September 26, 2025 04:52
Comment on lines 258 to 259
KeyTransformExact.as_mql_expr = key_transform_exact_expr
KeyTransformExact.as_mql_path = key_transform_exact_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check alphabetization of the classes and functions (not only in this file).

}

def range_match(a, b):
## TODO: MAKE A TEST TO TEST WHEN BOTH ENDS ARE NONE. WHAT SHALL I RETURN?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI says, "If either the start or end value provided to the BETWEEN operator is NULL, the entire BETWEEN condition will typically evaluate to UNKNOWN (and thus FALSE in a WHERE clause), unless explicitly handled." (I confirmed this for SQLite and PostgreSQL)

However, to match the semantics implemented here where None is treated as min/max date, I would expect __range=[None, None] not to filter any values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise FullResultSet is the way to go, and EmptyResultSet if the low is bigger than high.

Comment on lines +69 to +79
Aggregate.as_mql_expr = aggregate
Count.as_mql_expr = count
StdDev.as_mql_expr = stddev_variance
Variance.as_mql_expr = stddev_variance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have as_mql_expr(), as_mql_path(), and as_mql(..., as_path=...). If this is the way we keep it, it would be good to explain in the design document which objects (aggregate, func, expression, etc.) get which.

I wonder about renaming as_mql_expr() or as_mql_path() to as_mql() (i.e. treating one of paths as the default). Do you think it would be more or less confusing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the idea. I’ll explain it in the docs, and we might also consider renaming some methods. The core concept is:

  • Every expression has an as_mql method.
  • In some cases, it’s simpler to implement as_mql directly, so those methods don’t follow the common expression flow.
  • For other expressions, as_mql is a composite function that delegates to as_path or as_expr when applied.
  • The base_expression.as_mql method controls when these are called and performs boilerplate checks to prevent nesting an expr inside another expr (a MongoDB 6 restriction).

In short: every object has as_mql. Some also define as_path and as_expr. The base_expression coordinates how these methods are used, except for cases where as_mql is defined directly.

Copy link
Collaborator Author

@WaVEV WaVEV Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc here: link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgraham I actually like the decoupling of as_mql and as_mql_(path | expr). I view it as, you need to define at least two:
as_mql and as_mql_expr.

Then if you want to have the more optimized function you define as_mql_path. It feels less confusing to me that way.

Copy link
Collaborator Author

@WaVEV WaVEV Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 An expression needs at least one of the methods, as_mql_expr is enough (not optimal but functional). The aggregation module is a good example. Because the as_mql is defined in baseExpression class. In this approach most of the expressions don't need as_mql (if as_mql is defined, expr or path aren't needed).
EDIT: Sorry, I rushed the answer before read all the text, you got the point well. 😬

@WaVEV WaVEV force-pushed the lookup-refactor branch 4 times, most recently from 5827580 to 5b9fa93 Compare October 4, 2025 02:50
raise FullResultSet
return {"$and": conditions}

# match, path, find? don't know which name use.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, mongo_??_operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mongo_match_operators does sound good. It still needs a comment have them briefly explained

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider that "match" is the default path unless expr is needed, I see an argument from simply calling it mongo_operators and leaving mongo_expr_operators as the name that needs some distinction.

Similarly, the as_path=False default argument might become as_expr=False. (Would that result in more or fewer cases to be handled with as_expr=True?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes, as_expr=False seems more consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

conditions.append({a: {"$lte": b[1]}})
if start is not None and end is not None:
if isinstance(start, Decimal128):
start = start.to_decimal()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potencial overflow risk here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a failing test without the to_decimal() calls? I'm not sure why there would be a mix of Decimal/Decimal128 here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a BSON type held at the database and returned locally. I think it should be good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is: model_fields_.test_polymorphic_embedded_model.QueryingTests.test_range. Maybe this case shouldn't be handled here, and let the query optimizer to remove those False cases.

Copy link
Contributor

@Jibola Jibola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my first round of comments. I'm going to review the tests as a second phase of the PR.

Comment on lines 113 to 114
if b:
return {"$or": [{a: {"$exists": False}}, {a: None}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does b represent in this case? Not negating?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 well, it is a bit confusing here. but this function checks for nullability. So, if b is True then we check null. maybe the parameter should be: field, null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field and is_null are good clarifiers

raise FullResultSet
return {"$and": conditions}

# match, path, find? don't know which name use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mongo_match_operators does sound good. It still needs a comment have them briefly explained

Comment on lines +69 to +79
Aggregate.as_mql_expr = aggregate
Count.as_mql_expr = count
StdDev.as_mql_expr = stddev_variance
Variance.as_mql_expr = stddev_variance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgraham I actually like the decoupling of as_mql and as_mql_(path | expr). I view it as, you need to define at least two:
as_mql and as_mql_expr.

Then if you want to have the more optimized function you define as_mql_path. It feels less confusing to me that way.

conditions.append({a: {"$lte": b[1]}})
if start is not None and end is not None:
if isinstance(start, Decimal128):
start = start.to_decimal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a BSON type held at the database and returned locally. I think it should be good.

Comment on lines 936 to 945
def as_mql_expr(self, compiler, connection):
lhs_mql = process_lhs(self, compiler, connection, as_path=False)
value = process_rhs(self, compiler, connection, as_path=False)
return {"$gte": [lhs_mql, value]}

def as_mql_path(self, compiler, connection):
lhs_mql = process_lhs(self, compiler, connection, as_path=True)
value = process_rhs(self, compiler, connection, as_path=True)
return {lhs_mql: {"$gte": value}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a $gte query in a search.text lookup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To convert a score function into a filter I decided to express the following proposition: score_func(...) > 0.

return self.is_simple_column

@cached_property
def is_simple_column(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this validation for embedded models used in multiple places. Can you consolidate this to the query_utils file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 will try. The overall structure of the function is similar, but the type varies. I could create a meta-function that generates the appropriate function for a given type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get better. 😬 . I will let as it is. there is some details like:
previous._field.column, previous.key_name when extract the field_name or the path. Then are different from EMFA and EMF the return, one has to validate the inner transform while the other doesn't

where = self.get_where()
try:
expr = where.as_mql(self, self.connection) if where else {}
match = where.as_mql(self, self.connection, as_path=True) if where else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we enforce as_path=True here, can we set that as the default on WhereNode.as_mql(..., as_path=True)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will refactor. Also it may change the name, as_expr=False. but it is scenically the same



def process_lhs(node, compiler, connection):
def process_lhs(node, compiler, connection, as_path=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the delegation logic, should as_path always default to True?

If the caller requests a path filter (as_path=True) and the expression supports it, as_mql_path is executed.
If the expression cannot be represented as a path filter but as_path=True, it is automatically encapsulated in $expr.
If the caller requests an expression (as_path=False), as_mql_expr is executed directly. No additional encapsulation is needed, since the query generation ensures proper nesting of $expr.

Comment on lines +68 to +70
if isinstance(value, CombinedExpression):
# Temporary: treat all CombinedExpressions as non-constant until
# constant cases are handled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a follow-up ticket for this? If it already exists, could you link it in the comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def valid_path_key_name(key_name):
return bool(re.fullmatch(r"[A-Za-z0-9_]+", key_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.mongodb.com/docs/manual/core/dot-dollar-considerations/

Values like hashtags are also valid for path names and don't require $expr. To my knowledge so long as it's not (.) or ($) it's good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will adjust the expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there is some emoji or some non ascii character? 🤔

Comment on lines 211 to 215
def as_mql_expr(self, compiler, connection):
columns, parent_field = self._get_target_path()
mql = parent_field.as_mql(compiler, connection)
for key in columns:
mql = {"$getField": {"input": mql, "field": key}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially out of scope:
https://github.com/mongodb/django-mongodb-backend/pull/392/files#diff-0a6ce30a131a00fa88086c4c4d0d6e6232845fd11ef2bc67891fdf92e10c3743R18-R45

Is it possible to still remove $getField in as_mql_expr or is it expected that routing to as_mql_expr for embedded model queries is because of needing a getField call?

Comment on lines +141 to +147
@property
def can_use_path(self):
simple_column = getattr(self.lhs, "is_simple_column", False)
constant_value = is_constant_value(self.rhs)
return simple_column and constant_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Comment on lines 113 to 114
if b:
return {"$or": [{a: {"$exists": False}}, {a: None}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field and is_null are good clarifiers

self.assertAggregateQuery(
query,
"model_fields__nullableintegerarraymodel",
[{"$match": {"field": {"$in": ([1], [2])}}}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does $in now expect a tuple?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it wasn’t a new convention or expectation. There was already a test that checks the RHS $in as a tuple, so I just followed that convention. It doesn’t affect the query behavior.

"$match": {
"$expr": {
"$eq": [
{"$getField": {"input": "$data", "field": "integer_"}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a value that could get rid of the getField.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could get rid of those getField if we use $data.integer_ but I thought it was out of scope for this refactor. This behavior is the current behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's out of scope!

[
{
"$match": {
"$expr": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to the callback chain on this one since the null-check could actually be converted.

May best be an improvement added later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants