-
Notifications
You must be signed in to change notification settings - Fork 26
INTPYTHON-522, INTPYTHON-524 Add support for Atlas and vector search queries #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
449b6a3
to
ca8a7cf
Compare
9935b25
to
a467a57
Compare
ea2118b
to
206b554
Compare
456028d
to
65f22e6
Compare
eb6eb07
to
e7f4d22
Compare
eed2499
to
99f6548
Compare
docs/source/ref/models/search.rst
Outdated
``SearchEquals`` objects can be reused and combined with other search | ||
expressions. | ||
|
||
See :ref:`search-operations-combinable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could structure things so we don't need to repeat this boilerplate on every(?) expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I think we cannot scape, unless we list the operations that could be combined in the section of combinable operations. I like to have this link meanwhile I am reading the docs, so it gives an introduction of some (cool?) behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to adjust our sphinx theme (or find a new one) that has a "Contents" on the right of the page like Django's docs do (see https://docs.djangoproject.com/en/dev/ref/models/fields/). It will help the browseabilty greatly and the "combined expressions" section won't be buried.
I'm trying to improve the heading structure by adding "Atlas Search expressions" top-level heading, then "combined expressions" might also be at the top-level (to address your concern and make it more visible). Maybe CombinedSearchExpression should be a subsection of "combined expressions" since it's more of a private/advanced API compared to bitwise operators?
Probably "Vector search queries" is another top-level (for SearchVector) and SearchScoreOption is like a utility/helper class?
I didn't make my way through the entire documentation page, but if you're going to work more this weekend, please take a look at my edits and try to make similar adjustments, like adding .. class::
under each heading.
tests/queries_/test_search.py
Outdated
delayedAssertCountEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertCountEqual) | ||
delayedAssertListEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertListEqual) | ||
delayedAssertQuerySetEqual = _delayed_assertion(timeout=2)( | ||
TransactionTestCase.assertQuerySetEqual | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the non-delayed versions ever used? Maybe it's better to overwrite the original names so we don't have to write "delayedXXXXX" everywhere. Or maybe the waiting could be done in setUp()
after data is inserted? Unless some test inserts more data, essentially only the first test's waiting is needed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I all the checks are delayed...
Regarding to the second question: right, any test that insert data need to wait. If the data is inserted in the init class, we could only wait once. So If we want to get rid of those delayed, we can wait in the creation part.
tests/queries_/test_search.py
Outdated
|
||
|
||
@skipUnlessDBFeature("supports_atlas_search") | ||
class SearchEqualsTest(SearchUtilsMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to be consistent in this project about using "Tests" (plural) in the class names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 mmh I didn't notice that. will change.
tests/queries_/test_search.py
Outdated
boost_score = SearchScoreOption({"boost": {"value": 3}}) | ||
|
||
qs = Article.objects.annotate( | ||
score=SearchEquals(path="headline", value="cross", score=boost_score) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline boost_score
, or at least omit the blank line. (Only some tests are inconsistent.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look great, but I've gone through about half of the code (due to size). I will check the test code tomorrow!
django_mongodb_backend/compiler.py
Outdated
if not has_search: | ||
raise ValueError( | ||
"Cannot combine two `$vectorSearch` operator. " | ||
"If you need to combine them, consider restructuring your query logic or " | ||
"running them as separate queries." | ||
) | ||
raise ValueError( | ||
"Only one $search operation is allowed per query. " | ||
f"Received {len(search_replacements)} search expressions. " | ||
"To combine multiple search expressions, use either a CompoundExpression for " | ||
"fine-grained control or CombinedSearchExpression for simple logical combinations." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two ValueErrors
need to be switched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 the second is the case when:
has_vector_search but it does not has search. I think I should refactor this. It is a bit confusing. the not at the beginning is not helping.
# Apply De Morgan's Laws. | ||
operator = node.operator.negate() if negated else node.operator | ||
negated = negated != (node.operator == Operator.NOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a little confusing because it requires some understanding of negate
and the state changes.
I'll leave this as a comment here to be reviewed later.
What's an example of a NOT combinable?
I.e., how would I construct NOT (A AND B)
or can this only be done via negate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied De Morgan's Law to get something in the scope of A' operator B'. So:
NOT (A AND B) = Not A or Not B => {SHOULD: [MUST_NOT(A), MUST_NOT(B)]
with minimum should in 1.
The other way to handle this is push everything in a must not, but in order to handle: NOT (NOT A))
as A I decided to apply this kind of simplifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, I almost forgot, we could handle double negation on should. (That are the ors if the minimumShouldMatch is 1 ). long story short
A and B => MUST
not C => MUST_NOT
A or B => SHOULD with minimumShouldMatch is 1
not (A or B) => not A and not B => MUST(MUST_NOT(A), MUST_NOT(B))
When A, B, C are atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it a bit. It is simpler now, don't know if it is simple enough 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall PR looks great! I've got some minor corrections, but other than that, it is good to merge from me. Great work! 🚀
It also looks like there's a ReadTheDocs error:
/home/docs/checkouts/readthedocs.org/user_builds/django-mongodb-backend/checkouts/325/docs/source/ref/models/search.rst:654: WARNING: unknown document: 'atlas:atlas-search/scoring/' [ref.doc]
tests/queries_/test_search.py
Outdated
def create_search_index(cls, model, index_name, definition, type="search"): | ||
collection = cls._get_collection(model) | ||
idx = SearchIndexModel(definition=definition, name=index_name, type=type) | ||
collection.create_search_index(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: For the sake of testing, we can make this a blocking call and check for the index before continuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 If I don't understand wrong, I have to add a wait for predicate. Right?
docs/source/ref/models/search.rst
Outdated
<QuerySet [<Article: headline: title>]> | ||
The ``path`` argument can be either the name of a field (as a string), or a | ||
:class:`~django.db.models.F` instance. The ``value`` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it be preferable to wrap the field name in an F object? I don't see F in any tests. (and similarly why wrap strings in Value?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I can add that in the test. For the values, I copied the idea from here. It could work if we don't wrap it, because MongoDB supports directs values here. In the case of the path, it should be an F object. Sometimes the string makes reference to an embedded model, or a column. In that case when F is resolved returns the corresponding column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Django's version: "The arguments to SearchVector can be any Expression or the name of a field."
I don't see any F usage, but there is: SearchVector(Value("This week everything is 10% off")
. For paths, they are strings: SearchVector("scene__setting", "dialogue")
.
I think the point is that while F
could be passed (since it's an expression), that's not really something that needs to be mentioned because it doesn't make much sense to add the extra complication compared to a raw string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I see. So remove then the models.F from the documentation is the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, isn't wrapping in F
not semantically correct? An F
object represents the value of field, not its path. That's why you had to add the as_path
argument as a workaround. Should the path strings be wrapped in Col
instead?
docs/source/ref/models/search.rst
Outdated
``{"maxEdits": 1}``. | ||
- ``token_order``: Controls token sequence behavior. Accepts values like | ||
``"sequential"`` or ``"any"``. | ||
- ``score``: An optional score expression such as ``{"boost": {"value": 5}}``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit redundant to say "optional" in the section "Optional arguments". ;-)
Isn't the expression actually SearchScoreOption({...}
, not dictionary?
Maybe something like:
A :class:`SearchScoreOption` to tune the relevance score.
There's a lot of wording variations:
"An optional score
argument can be used to customize relevance scoring."
"An optional score expression to adjust relevance."
"An optional score
argument may be used to adjust relevance scoring."
"An optional score expression to influence relevance"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It was a dictionary. It should be changed
) | ||
) | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording of "Args" is inconsistent. Sometimes we have "Required arguments", "Arguments", "Args", "Optional", "Optional arguments", sometimes no heading. I don't demand absolute adherence to one standard if it doesn't make sense, but using so many different styles doesn't help the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the size of this PR, I'm okay with having that consistency be made in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read this comment. Sorry to reaching it so late. The idea behind was:
if the parameters (optional or not) is only one, do not make any title, just listed it and that it. But it is inconsistent, I forgot to add the title in some sections 😬.
Now I am wondering if any class should be documented, because this one is a private class.
docs/source/ref/models/search.rst
Outdated
|
||
This expression is used internally when combining search expressions with | ||
Python’s bitwise operators (``&``, ``|``, ``~``), and corresponds to | ||
logical operators such as ``and``, ``or``, and ``not``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"such as" ... Are the any more not listed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are all.
params["score"] = self.score.as_mql(compiler, connection) | ||
if self.fuzzy is not None: | ||
if self.fuzzy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be if self.fuzzy is not None
because fuzzy={}
is valid input
@@ -548,30 +544,31 @@ def search_operator(self, compiler, connection): | |||
} | |||
if self.score: | |||
params["score"] = self.score.as_mql(compiler, connection) | |||
if self.fuzzy is not None: | |||
if self.fuzzy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.fuzzy: | |
if self.fuzzy is not None: |
) | ||
) | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the size of this PR, I'm okay with having that consistency be made in a separate PR.
b7ea348
to
87c2ecd
Compare
docs/source/ref/models/search.rst
Outdated
An optional ``score`` :class:`SearchScoreOption` argument to tune the | ||
relevance score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a standalone sentence. Suggestion: "The optional score argument is a SearchScoreOption that tunes the relevance score."
``SearchAutocomplete`` | ||
---------------------- | ||
|
||
.. class:: SearchAutocomplete(path, query, *, fuzzy=None, token_order=None, score=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add these for the rest of the file.
docs/source/ref/models/search.rst
Outdated
``SearchExists`` | ||
---------------- | ||
|
||
Atlas Search expression that matches documents where a field exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the style:
chop "Atlas Search expression that ..." prefixes in favor of "Matches..." and "Uses..." (next paragraph).
docs/source/ref/models/search.rst
Outdated
This expression uses the | ||
:doc:`moreLikeThis operator <atlas:atlas-search/morelikethis>` to retrieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflow cases like this. You can break anywhere inside a refs/doc/, e.g.
This expression uses the :doc:`moreLikeThis operator
<atlas:atlas-search/morelikethis>` to retrieve...
self.assertAlmostEqual(scored.score, 10.0, places=2) | ||
|
||
|
||
@unittest.expectedFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem? (add a comment...)
@@ -11,3 +11,4 @@ Model API reference. | |||
querysets | |||
models | |||
indexes | |||
search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a mention on docs/source/index.rst too (have forgotten this on some recent changes)
def __str__(self): | ||
cls = self.identity[0] | ||
kwargs = dict(self.identity[1:]) | ||
arg_str = ", ".join(f"{k}={v!r}" for k, v in kwargs.items()) | ||
return f"<{cls.__name__}({arg_str})>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it tested? (I'm trying to get a coverage report working but I didn't spot any str(
in tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not. Will add a test.
assertCountEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertCountEqual) | ||
assertListEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertListEqual) | ||
assertQuerySetEqual = _delayed_assertion(timeout=2)(TransactionTestCase.assertQuerySetEqual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future readers, please a a comment explaining the reasons the delayed assertions are needed (as discussed in PR comments). Likewise for index waiting.
tests/atlas_search_/test_search.py
Outdated
from .models import Article, Location, Writer | ||
|
||
|
||
def wait_until_index_ready(collection, index_name, timeout: float = 30, interval: float = 0.5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 seconds is a long time. It would be nice to explain if it really could take that long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so and I hope it never takes that long, will set 5 seconds as a default.
tests/atlas_search_/test_search.py
Outdated
raise TimeoutError(f"Index {index_name} not ready after {timeout} seconds") | ||
|
||
|
||
def _delayed_assertion(timeout: float = 120, interval: float = 0.5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it really take 120 seconds... or 2 minutes?
To generate a coverage report:
Pretty good. Tests for some methods and some expression parameters are missing. No tests for |
token_order: Optional value for `"tokenOrder"`; controls sequential vs. | ||
any-order token matching. | ||
score: Optional[SearchScore] expression to adjust score relevance | ||
(e.g., `{"boost": {"value": 5}}`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SearchScore({...}
(or maybe the example isn't needed since none of the other docstrings have it. (I don't think docstrings should be very elaborate documentation, more just to help developers working on the code... they can always consult the proper docs if need be.)
docs/source/ref/models/search.rst
Outdated
<QuerySet [<Article: headline: title>]> | ||
The ``path`` argument can be either the name of a field (as a string), or a | ||
:class:`~django.db.models.F` instance. The ``value`` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, isn't wrapping in F
not semantically correct? An F
object represents the value of field, not its path. That's why you had to add the as_path
argument as a workaround. Should the path strings be wrapped in Col
instead?
Added. |
No description provided.