Skip to content

Conversation

@AlexVelezLl
Copy link
Member

Summary

When many sync sessions were happening concurrently, there were occasions where two certificates could have the same lft and rght. This was due to some concurrency problems in the internals of MTTP. Wrapping these Certificate creations on the remote server and on the local server (after the remote server's response) solved the problem.

TODO

  • Have tests been written for the new code?
  • Has documentation been written/updated?
  • New dependencies (if any) added to requirements file

Reviewer guidance

  • Try running multiple concurrent sync requests and check that the tree structure is not corrupted.

Issues addressed

Needed to fix learningequality/kolibri#13821

Documentation

If the PR has documentation, link the file here (either .rst in your repo or if built on Read The Docs)

@AlexVelezLl AlexVelezLl marked this pull request as ready for review January 7, 2026 22:55
Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly targeted fix that should address the specific observed issue well. Although, there are a few more ways and areas that Certificate is vulnerable to this issue. I'm suggesting a few revisions on what you've done so far and also suggest a way we can address this holistically for more complex scenarios.

We should also address this within push_signed_client_certificate_chain which is very similar to certificate_signing_request

Comment on lines 150 to 153
with transaction.atomic():
# lock the parent certificate row for update
parent = Certificate.objects.select_for_update().get(pk=serialized_cert.validated_data['parent'].pk)
serialized_cert.validated_data['parent'] = parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you did in the other case, where you wrapped a smaller amount of code, the same can be done here. The validation and certificate signing takes a non-negligible amount of time, since it performs cryptographic operations. So limiting what's wrapped in this transaction will reduce the time that this is actually blocking.

It should be safe to move your additions and the transaction to wrap only around the certificate.save() where MPTT should be involved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried that, but for some reason, I was still getting the concurrency errors if I just wrapped the save method in the transaction. Will give it another try and see if there was something else happening around there!

Comment on lines 351 to 359
if not Certificate.objects.filter(id=parent_cert.id).exists():
cert_chain_response = self._get_certificate_chain(
params={"ancestors_of": parent_cert.id}
)

# upon receiving cert chain from server, we attempt to save the chain into our records
Certificate.save_certificate_chain(
cert_chain_response.json(), expected_last_id=parent_cert.id
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs similar protection. Although, this is dealing with a situation where the ancestors of the certificate may not exist. Since Kolibri doesn't currently use more than 2 levels of certificates, this shouldn't encounter the same MPTT issue. Nevertheless, with 3 or more, it has potential for it.

The difference here though is that we don't have those records to lock via select_for_update. For SQLite, we can trust that the DB is configured to help us enforce this simply by using a transaction (what Richard recently fixed in Kolibri, although we should document this expectation for Morango). For PostgreSQL, we can use advisory locks to block a transaction using something we'll know to be common between parallel executions. This is something we used on studio. The advisory lock pattern is therefore a bit more flexible and allows us to create a reusable pattern for this particular situation.

We should be able to use the same lock_partitions function that is used for sync locking. We'll just need to lock based off the root certificate.

So this would become something like:

if not Certificate.objects.filter(id=parent_cert.id).exists():
    cert_chain_response = self._get_certificate_chain(
        params={"ancestors_of": parent_cert.id}
    )
    with transaction.atomic():
        lock_partitions(backend, sync_filter=cert_chain_response[0]["id"])
        # check again, now that we have a lock
        if not Certificate.objects.filter(id=parent_cert.id).exists():
            Certificate.save_certificate_chain(
                cert_chain_response.json(), expected_last_id=parent_cert.id
            )

We could use this advisory lock pattern everywhere, too, instead of select_for_update in the case we actually have records. It also has the advantage of protecting more complex certificate structures with 3+ levels, since that opens up more surface area for where the select_for_update lock may not protect the integrity of the tree during concurrent writes across it, but like I said, Kolibri doesn't leverage that currently.

@AlexVelezLl AlexVelezLl force-pushed the prevent-certificate-tree-corruption branch from bb4d936 to 8523ad3 Compare January 8, 2026 17:21
@AlexVelezLl
Copy link
Member Author

Thanks @bjester! I have moved to use lock_partitions in all cases instead of select_for_update. How does it look? The only other thing I am not entirely sure about is whether we should lock all Certificate saves? Or should we be calling this lock_mptt context manager outside the Model class?

@AlexVelezLl AlexVelezLl force-pushed the prevent-certificate-tree-corruption branch from 8523ad3 to a9bf52c Compare January 8, 2026 17:28
@bjester
Copy link
Member

bjester commented Jan 8, 2026

How does it look? The only other thing I am not entirely sure about is whether we should lock all Certificate saves? Or should we be calling this lock_mptt context manager outside the Model class?

It looks pretty great! I think what you did, wrapping all saves, is fine. I think it's fine primarily because certificates are rarely created or updated. Having the lock managed outside of the model class is a little more portable and consistent, but I don't think it's necessary to change. I'm going to try running the morango-integration tests on Kolibri with this change. If it all looks good, I will approve and merge. Thanks!

Copy link
Member

@bjester bjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlexVelezLl

@bjester bjester merged commit 92278a5 into learningequality:release-v0.8.x Jan 8, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants