-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-1256: specify how drivers set timeouts on explain commands #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f6e6cc5
to
9acd138
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! mostly small clarifications
source/client-side-operations-timeout/client-side-operations-timeout.md
Outdated
Show resolved
Hide resolved
source/client-side-operations-timeout/client-side-operations-timeout.md
Outdated
Show resolved
Hide resolved
|
||
If a driver provides an explain helper, drivers MUST take care to ensure that timeoutMS is correctly applied to the | ||
explain command, when specified. Care should be taken by drivers with a fluent API - the following example | ||
should apply a timeoutMS of 1000 to the `explain` command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this example. Per the discussion with product, the most desirable behavior would be to ignore the timeout passed to the find
in favor of a timeout given explicitly to explain
.
I don't see any guidance in the CSOT spec on fluent API handling for other commands, but I imagine we want to do the same thing for all of them (so for us, if we chain a count
, we would expect the maxTimeMS
count
option or CSOT equivalent to be respected, and not inherit it from the find
). But I could be mistaken, so perhaps a CSOT expert could chime in here (@ShaneHarvey maybe?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxTimeMS needs special handling for explain because maxTimeMS takes two separate meanings: it can mean the value of maxTimeMS applied to the explained command, or it can be the value applied to the explain command. because maxTimeMS impacts the command being built by the driver, it leads to ambiguity in a fluent-api when used with explain. Does collection.find(.., { maxTimeMS: 1000 }).explain()
build { explain: { find: ..., maxTimeMS: 1000 } }
or { explain: { find: .. }, maxTimeMS: 1000 }
? That's why it's preferable to make users explicitly state the maxTimeMS they want to use for the explain. In Node, that looks like collection.find({}, { explain: { maxTimeMS: 1000, .. })
or collection.find({}).explain({ maxTimeMS: ... })
.
On the other hand, timeoutMS is only ever applied to the actual operation that is executed and doesn't impact the command the driver constructs. The outcome of collection.find({}, { timeoutMS: 1000 }).explain()
should be the same as collection.find({}).explain({ timeoutMS: 1000 })
because 1. the command being constructed and executed doesn't change based on timeoutMS in either example and 2. there's a single command being executed, which should be governed by the timeoutMS 1000 setting.
It seems like a separate question as to whether fluent-api helpers should accept a timeoutMS option. I think yes, and I think that's already covered in the Operations section of the spec - fluent APIs fall under CRUD or other specs mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outcome of
collection.find({}, { timeoutMS: 1000 }).explain()
should be the same ascollection.find({}).explain({ timeoutMS: 1000 })
That's the part I would like a second opinion on.
If I didn't know anything about the driver's implementation, I personally would expect collection.find({}, { timeoutMS: 1000 }).explain()
to ignore the timeoutMS
set in the options of the find
stage (because if I am writing this code, then I probably copy-pasted my .find
from production code in its entirety with the expectation that when I run explain that it will attach any applicable options to the explained command the same way as in production code and ignore ones that aren't relevant).
If I understand correctly, it sounds like you are thinking of the command builder chain as .find(<find predicate>, <chain options>).explain(<chain options>)
; whereas I am thinking of it as .find(<find predicate>, <find execution options>).explain(<explain execution options>)
. Is that correct? Or are you thinking of it as .find(<find predicate>, <find command or chain execution options>).explain(<explain command or chain execution options>)
, where timeoutMS
is always considered a "chain execution" option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was I was thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a some cases where the proposed behavior is difficult to implement or leads to unexpected results:
Non-fluent APIs
The proposed behavior is suited to fluent APIs, but could be impractically difficult to implement for drivers that do not use fluent APIs for explain
helpers. For example, consider a possible Go explain
helper:
client.Explain(context.Background(), func(ctx context.Context) error {
ctx, cancel := context.WithTimeout(ctx, 10 * time.Second)
defer cancel()
return collection.FindOne(ctx, bson.D{}).Err()
})
It may be impractically difficult to detect and apply the FindOne
timeout to Explain
.
Calculated maxTimeMS
Consider the following operation:
collection.find({ name: 'john doe' }, { timeoutMS: 1000 });
The CSOT spec requires drivers to append a calculated maxTimeMS
value to commands:
If
timeoutMS
is set, drivers MUST append amaxTimeMS
field to commands executed against a MongoDB server using theminRoundTripTime
field of the selected server.
So the driver should send a command like:
{ find: <collection>, query: { name: 'john doe' }, maxTimeMS: <1000 - min RTT> }
Now consider an explain
for the same operation:
collection.find({ name: 'john doe' }, { timeoutMS: 1000 }).explain();
If we apply the timeout to the explain
operation, the driver would change the command to be explained:
{
explain: { find: <collection>, query: { name: 'john doe' } },
maxTimeMS: <1000 - min RTT>
}
Instead, the driver should send the same command, but as the body of an explain
command:
{
explain: { find: <collection>, query: { name: 'john doe' }, maxTimeMS: <1000 - min RTT> },
}
Low timeout
If a user is running explain
on an operation with a very low timeout, applying the operation timeout to the explain
command may cause it to always time-out.
For example, consider the explain operation:
collection.find({}, { timeoutMS: 10 }).explain();
A 10ms timeout may be too short to get a reply, especially if the user is trying to run the explain
from a different host than the base operation. The user would have to change the operation to be explained and hope the result is still valid.
Considering those cases, I think we should not apply the timeout on the base operation in explain
helpers. Users should specify a separate timeout if that's the behavior they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response, Matt.
Non-fluent APIs
Yeah, this is a good point and something I hadn't considered. I'm also not familiar enough with Go to even really understand how this code snippet would work (would the explain context somehow override command execution logic in the driver to intercept any commands being sent on it and wrap them in explain?).
Calculated maxTimeMS
So, you've actually outlined exactly what I understood as the expected behavior.
First, the find
is not being executed. The explain
is the command being executed. I wouldn't expect CSOT logic to apply to apply to the decorated command.
Second, this may be related to differences in explain implementations, but it's infeasible to attach maxTimeMS to the internal command (find in this case). When the Node driver runs a find().explain()
, we first construct the find command, and then wrap it in explain. This happens before connection checkout. maxTimeMS
is calculated and attached before writing to the socket - we'd need special logic to determine if the command being decorated was an explain command, and if we should decorate the internal explain command.
Low timeout
Conversely, it may be a better UX to default to applying a timeout than to default to not applying a timeout. Suppose a user copy-pasted code directly from their prod environment into a test environment and explained it (this is more-or-less an actual use case for the Node driver via mongosh). The user might be surprised to run an explain and have their timeout ignored, and need to resort to killOp to cancel it if they need to. Defaulting to a timeoutMS could result in timeouts if timeoutMS is set too low, but that at least immediately informs the user of the issue and they can rectify it and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect CSOT logic to apply to apply to the decorated command.
I'm more concerned about the command document sent to explain
being different than what's sent when the driver runs the op, which could make the explain output invalid. Can the presence or value of maxTimeMS
on a command ever change the execution plan?
I'm also not familiar enough with Go to even really understand how this code snippet would work
That API is purely speculative, but is based on the pattern the Go Driver uses for WithTransaction
and could be implemented in a similar way (i.e. by modifying op behavior by checking for special values on ctx
). Implementing the proposed timeout behavior in that API would be difficult and nonintuitive.
Conversely, it may be a better UX to default to applying a timeout than to default to not applying a timeout.
I think whether or not the UX is intuitive depends a lot on the driver/language. IMO requiring all drivers to "pass-forward" the operation timeout to explain
is too prescriptive, but I can see where it makes sense for some drivers. Should we change this requirement to "MAY" instead of "MUST"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the presence or value of maxTimeMS on a command ever change the execution plan?
I don't think so, but I started a thread with the query team to confirm: https://mongodb.slack.com/archives/CKABWR2CT/p1726167755099669.
I think whether or not the UX is intuitive depends a lot on the driver/language. IMO requiring all drivers to "pass-forward" the operation timeout to explain is too prescriptive, but I can see where it makes sense for some drivers. Should we change this requirement to "MAY" instead of "MUST"?
The ask from product is to ensure that drivers provide users a way to timeout the explain command specifically. I think "MUST" is necessary to satisfy that requirement. I rephrased this section a bit - I think it's a bit more flexible but still mandates that drivers have a timeout for explain
. Let me know what you think of the new phrasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated description sounds great! Also, thanks for starting the conversation about maxTimeMS
effect on explain
!
source/client-side-operations-timeout/client-side-operations-timeout.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchable suggestions
95f26d1
to
6801ed1
Compare
source/crud/crud.md
Outdated
// sends: | ||
{ | ||
explain: { find: <collection>, query: { name: 'john doe' } }, | ||
maxTimeMS: <1000 - average rtt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of maxTimeMS
based on timeoutMS
uses minimum round-trip time (see here).
maxTimeMS: <1000 - average rtt> | |
maxTimeMS: <1000 - min rtt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
This PR specifies that drivers with explain helpers must provide a mechanism to set timeouts on the explain command.
Node POC: mongodb/node-mongodb-native#4207
Please complete the following before merging:
clusters, and serverless).