-
Notifications
You must be signed in to change notification settings - Fork 35
Improve node upsert performances #5966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CodSpeed Performance ReportMerging #5966 will degrade performances by 20.7%Comparing Summary
Benchmarks breakdown
|
fdf3acd to
f653f07
Compare
98a5e4d to
5914a60
Compare
5914a60 to
ccaee6d
Compare
0227a88 to
7ec17d7
Compare
ca98d9d to
2eb8923
Compare
40c6ad2 to
ea953d1
Compare
| for constraint_group in schema_constraint_path_groups: | ||
| for schema_attribute_path in constraint_group: | ||
| for uniqueness_constraint_path in uniqueness_constraint_paths: | ||
| for schema_attribute_path in uniqueness_constraint_path.attributes_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that we generally leave migrations alone once they are merged, but not sure if others agree with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code imported by the migration might change as well, so migration behavior might be slightly different anyway
| if uniqueness_constraint == hfid: | ||
| return UniquenessConstraintType.HFID | ||
| if all(path in hfid for path in uniqueness_constraint): | ||
| return UniquenessConstraintType.SUBSET_OF_HFID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought this was not possible.
I thought that with your latest changes to how we build the HFID that the uniqueness_constraints always contain the HFID. do you have an example of when a uniqueness constraint is a subset of the HFID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the uniqueness_constraints always contain the HFID.
Yes, but nothing prevents from having another uniqueness constraint that is a subset of the hfid. For instance the schema {"uniqueness_constraints": [["name__value"]], "human_friendly_id": ["name__value, color__value"]} will contain 2 uniqueness constraints after schema processing: [["name__value", "name__value, color__value"]]. While checking the name__value one, we would hit above condition
| class UniquenessConstraintViolation: | ||
| nodes_ids: set[str] | ||
| fields: list[str] | ||
| typ: UniquenessConstraintType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is out of scope for this PR, but it would be more SOLID to move all of this logic for uniqueness constraints and parsing schema paths into their own separate module(s) instead of putting it all on the BaseNodeSchema
I believe I tried to do this at some point, but did not succeed for a reason that I do not recall
| # Currently support for creating a node with a given id is not supported, | ||
| # so this will raise an error if node id does not exist in db. | ||
| # Note that upserting with an `id` in the payload should not happen then, as it would mean the node already | ||
| # exists, in which case we should update client side instead of upsert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments should only explain what this code is doing, anything referencing how the client should call this mutation belongs in a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe comments should more explain why over what, it is important when reading this code to know why we need to handle id here and that this code could be probably removed. Maybe what you suggest is to create a dedicated sdk issue, and to link this issue here? But imo we need this information here, it would have saved me some time if I had them in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a change is required on the SDK, then this should be an issue on the SDK
this code should not be concerned with how the SDK builds its GraphQL requests
the first two lines of the comment are just describing the behavior of NodeManager.get_one(..., raise_on_error=True), which is unnecessary
the final two lines should be covered by an issue on the SDK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right the first two lines are unnecessary, and I definitely agree we need dedicated issues for clients
What I mean is that here it's worth to indicate that we need this path handling id because of how clients currently use this API, otherwise we could be tempted to remove this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're worried about someone mistakenly thinking this code is unnecessary and deleting it, then I'd say add a unit test that covers it and include the comments there
| if "hfid" in data: | ||
| # Node might already exist or not. Note that if it exists, an extra query is performed | ||
| # thus client should avoid specifying `hfid` as input. | ||
| # It is supposed to be pointless as payload already contains fields composing hfid (mandatory fields while creating). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. comments should only reference this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same response than above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also should just be an issue on the SDK
ea953d1 to
1c89f5e
Compare
1c89f5e to
b1ef45c
Compare
| if violation.typ == UniquenessConstraintType.HFID: | ||
| error_msg = f"Violates uniqueness constraint '{'-'.join(violation.fields)}'" | ||
| errors = [ValidationError({field_name: error_msg}) for field_name in violation.fields] | ||
| raise HFIDViolatedError(errors, matching_nodes_ids=violation.nodes_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add unit tests for this, probably in TestNodeGroupedUniquenessConstraint
and graphql-level tests for upserting using an HFID would be good, but they might already exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some unit tests, we already have multiple ones for GraphQL upserts.
| # Currently support for creating a node with a given id is not supported, | ||
| # so this will raise an error if node id does not exist in db. | ||
| # Note that upserting with an `id` in the payload should not happen then, as it would mean the node already | ||
| # exists, in which case we should update client side instead of upsert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a change is required on the SDK, then this should be an issue on the SDK
this code should not be concerned with how the SDK builds its GraphQL requests
the first two lines of the comment are just describing the behavior of NodeManager.get_one(..., raise_on_error=True), which is unnecessary
the final two lines should be covered by an issue on the SDK
| if "hfid" in data: | ||
| # Node might already exist or not. Note that if it exists, an extra query is performed | ||
| # thus client should avoid specifying `hfid` as input. | ||
| # It is supposed to be pointless as payload already contains fields composing hfid (mandatory fields while creating). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also should just be an issue on the SDK
163479b to
6923eb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this all looks good. only outstanding thing it the comments that I've commented on
| # Currently support for creating a node with a given id is not supported, | ||
| # so this will raise an error if node id does not exist in db. | ||
| # Note that upserting with an `id` in the payload should not happen then, as it would mean the node already | ||
| # exists, in which case we should update client side instead of upsert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're worried about someone mistakenly thinking this code is unnecessary and deleting it, then I'd say add a unit test that covers it and include the comments there
6923eb9 to
e294000
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think we are just missing a newsfragment for the release note
Improve node upsert performances by going from a check-if-exist-then-create-or-update to a try-create-else-update behavior. This avoids the extra cost of checking whether the node already exists in the database. Some modifications around
NodeGroupedUniquenessConstraintnow differentiate hfid violation than the other ones, as violating hfid uniqueness constraint would mean the node already exists in the database.Note that the upsert mutation does not benefit from this performance boost if:
Regarding performances, the current benchmarks do not work properly yet, so they should be ignored (they are also failing here: #6109). I tested locally to create 5000 nodes on non-main branch with this fix versus release-1.2 and performance boost is 12% (76s vs 86s). Based on previous real scenarios observations (cf a slack discussion) I believe this may vary up to 15-20% depending on database state / schemas.