Skip to content

add support for QuerySet.aggregate() #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Jul 16, 2024

Fixes #12 and #79

Copy link
Collaborator

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use git rebase -i HEAD~20 to remove the commits that shouldn't be in here.

@WaVEV WaVEV force-pushed the support-aggregate branch from 54de66a to 17dfa68 Compare July 16, 2024 15:46
@timgraham timgraham changed the title Support aggregate add support for QuerySet.aggregate() Jul 16, 2024
@WaVEV WaVEV force-pushed the support-aggregate branch 2 times, most recently from c43a1df to 9ff6438 Compare July 24, 2024 04:55
@WaVEV WaVEV force-pushed the support-aggregate branch from a528112 to 7e80c1a Compare July 26, 2024 04:56
@WaVEV WaVEV marked this pull request as ready for review July 26, 2024 04:57
@WaVEV WaVEV requested a review from timgraham July 29, 2024 13:41
@WaVEV WaVEV force-pushed the support-aggregate branch from f1e9281 to f6b0a9d Compare July 30, 2024 00:10
@WaVEV WaVEV force-pushed the support-aggregate branch from 10fd0ca to 3f85f34 Compare July 30, 2024 03:19
@@ -17,15 +19,190 @@ class SQLCompiler(compiler.SQLCompiler):
"""Base class for all Mongo compilers."""

query_class = MongoQuery
SEPARATOR = "10__MESSI__3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator Author

@WaVEV WaVEV Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have a problem, We cannot define a separator and be sure that it wouldn't have any collision in the future.
This things is used in the case that we are grouping with a foreign field. like

select max(A.id), B.name
from A
join B
group by B.name

In the grouping stage I cannot set names like table.field, so I have to use a separator. But any string could be a collision with the table name or field name.
Then I think in two options:

  1. Generate a random separator (and regenerate if it has a collision)
  2. Rename all the columns (as Django does __col#),

For now, I think we can be ok with some random separator like 'GROUP_FOREIGN_SEPARATOR'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about double underscore: __. That seems fairly unlikely to collide. I think it's important to keep the queries readable and inserting random strings doesn't seem so nice. Django field names cannot include double underscore due to a collision with Djanog's lookup separator. A developer could set db_column="something__something" but this seems unlikely (famous last words). We could just say, "don't do that!"

Also, I think we should use a more descriptive variable name than SEPARATOR. Django has LOOKUP_SEP = "__". I think GROUP_SEP could be fine here.

Copy link
Collaborator Author

@WaVEV WaVEV Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! but I decide to add three, we use __aggregation so, __ was wrong.

@timgraham timgraham force-pushed the support-aggregate branch 2 times, most recently from 554dacb to 4a70fd0 Compare August 5, 2024 18:26
group[alias] = sub_expr.as_mql(self, self.connection)
replacing_expr = inner_column
# Count must return 0 rather than null.
if isinstance(sub_expr, Count):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this special handling of Count/Variance can live in aggregations.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so as well, but I couldn't find a solution. Let's see the following example:

select count(*) from T1 where 1 = 0

The where clause is executed first, resulting in an empty set, so the count will be 0. But if in MongoDB I do:

[
  {
    "$match": {"$expr": {"$eq": [1, 0]}}
  },
  {
    "$group": {"_id": null, "group": {"$sum": 1}}
  }
]

It will return an empty list. To handle this, I created a $facet for the group by null elements. Then post-processing is needed, and this way is shorter.

return Col(self.collection_name, column_target)

def _prepare_expressions_for_pipeline(self, expression, target, count):
"""Prepare expressions for the aggregation pipeline."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods would benefit from some more inline comments or at least an explanation of what it means to "prepare an expression".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Indeed. Will add

if resolve_inner_expression:
return inner_expression
return {"$sum": inner_expression}
# If distinct=True or resolve_inner_expression=False=False, sum the size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited this from "When count is called with distinct without the flag". I think "the flag" meant resolve_inner_expression=False but please confirm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is.

@mongodb mongodb deleted a comment from timgraham Aug 5, 2024
source_expressions = node.get_source_expressions()
filter_ = deepcopy(self.filter)
filter_.add(
WhereNode([Exact(source_expressions[0], Value(None))], negated=True),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @timgraham I deleted your comment by mistake, the answer is:
The count only counts values if they aren't none. if the expression result is a string, number or something it sums as 1. Looking again the code, maybe there is a bug when there is filter and distinct options. Will check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that both branches are excluding null values. My question is how does the query end up with null values? I could probably work through some tests to understand it better. Thought you might be able to give a quick example.

Copy link
Collaborator Author

@WaVEV WaVEV Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there are like 3 places with nulls. 😬 or maybe two.
The example when this code is use is:

Select count(name) filter (where surname = 'Lupi')
from T1

Here we have to sum of the elements that fulfill the filter, we handle it with a case and the first (and only) source_expression would be the transformed (idk if there is others, but I just copy the mechanism from source)

I will explain it one by one.

But we have to change the exact(value, None) for IsNull(value, True) they are not the same.


def count(self, compiler, connection, resolve_inner_expression=False, **extra_context): # noqa: ARG001
"""
When resolve_inner_expression is True, return the argument as MQL that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some examples to help me understand this docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, maybe it should be in another function, and I forgot to support that flag in aggregates (there isn't any test covering it 😬). But I can explain the idea: I translated SELECT COUNT(DISTINCT name) FROM T1 into addToSet and then the aggregation operation. So, I only need to calculate the RHS to store its values in a set. After storing, we can calculate the sum, size, variance, or whatever is needed.

source_expressions = node.get_source_expressions()
filter_ = deepcopy(self.filter)
filter_.add(
WhereNode([Exact(source_expressions[0], Value(None))], negated=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that both branches are excluding null values. My question is how does the query end up with null values? I could probably work through some tests to understand it better. Thought you might be able to give a quick example.

@@ -17,15 +19,190 @@ class SQLCompiler(compiler.SQLCompiler):
"""Base class for all Mongo compilers."""

query_class = MongoQuery
SEPARATOR = "10__MESSI__3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about double underscore: __. That seems fairly unlikely to collide. I think it's important to keep the queries readable and inserting random strings doesn't seem so nice. Django field names cannot include double underscore due to a collision with Djanog's lookup separator. A developer could set db_column="something__something" but this seems unlikely (famous last words). We could just say, "don't do that!"

Also, I think we should use a more descriptive variable name than SEPARATOR. Django has LOOKUP_SEP = "__". I think GROUP_SEP could be fine here.

else:
group["_id"] = ids
pipeline.append({"$group": group})
sets = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename sets -> add_fields ?

value = f"$_id.{key}"
if self.SEPARATOR in key:
subtable, field = key.split(self.SEPARATOR)
if subtable not in sets:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sets = defaultdict(dict) so this isn't necessary?

try:
query.mongo_query = {"$expr": self.query.where.as_mql(self, self.connection)}
where = getattr(self, "where", self.query.where)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback is a little opaque. It looks to me self.query.where is used for Update and Delete compilers. I would either add a comment about this or add a get_where() hook to the compilers that returns the appropriate value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the where from query is very confusing, sometimes it is the same where that camper has, other times it has a where with the having expressions. The compiler (the main one) uses self.where, that has the expressions separated. but add a get_where in the compiler could be good enough.

stack.extend(expr.get_source_expressions())

def get_aggregation_pipeline(self):
return self._group_pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be a private variable with a get_ method? Could it be self.aggregation_pipeline like other variables in SQLCompiler.__init__()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, It can

"aggregation.tests.AggregateTestCase.test_aggregation_default_passed_another_aggregate",
"aggregation.tests.AggregateTestCase.test_annotation_expressions",
"aggregation.tests.AggregateTestCase.test_reverse_fkey_annotate",
# Manage empty result when the flag elide_empty is False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO even though the test probably won't work with SQL-specific: Func("book", function="COUNT")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a TODO, but maybe this test is not the right to do that. I think, this should be moved to the raw sql queries...

Also add support for Count() in QuerySet.annotate().
@timgraham timgraham merged commit 6f409db into mongodb:main Aug 8, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add QuerySet.aggregate() support
2 participants