Skip to content

ProjectIpAccessList: Delete-all-recreate-all update strategy causes downtime and race conditions #1489

@kundkingan

Description

@kundkingan

Summary

The MongoDB::Atlas::ProjectIpAccessList resource uses a delete-all-then-recreate-all strategy during updates, which causes:

  1. Potential connectivity downtime during deployments
  2. Race conditions when deploying to multiple regions simultaneously

Current Behavior

Looking at the update handler source code, the update operation:

  1. Deletes ALL entries (both previous and current model entries)
  2. Recreates only the entries in the current configuration
entriesToDelete := currentModel.AccessList
entriesToDelete = append(entriesToDelete, prevModel.AccessList...)

progressEvent := deleteEntriesForUpdate(entriesToDelete, ...)

Expected Behavior

The update handler should compute a diff and only:

  • Delete entries that were removed from the configuration
  • Add entries that are new
  • Update comments on unchanged IPs (if applicable)

This would be atomic with respect to unchanged entries.

Impact

1. Downtime Window

During the deletion phase, the IP access list is temporarily empty or incomplete, blocking legitimate connections until recreation completes.

2. Race Condition with Multi-Region Deployments

We experienced a critical issue when running two CDK deployments simultaneously in different AWS regions. Both deployments included a shared IP access list entry:

{
  cidrBlock: vpcCidrBlock,
  comment: `${deployEnvironment} CIDR (${this.region})`,
}

The delete-all-recreate-all strategy caused a race condition where both deployments were deleting and recreating entries concurrently. Here's the MongoDB Atlas Activity Feed showing the race condition:

Timestamp Action IP/CIDR User
11/26/25 - 01:09:33 PM Added 10.41.0.0/16 iyawkvot
11/26/25 - 01:09:32 PM Removed 10.41.0.0/16 iyawkvot
11/26/25 - 01:09:28 PM Removed 10.41.0.0/16 iyawkvot
11/26/25 - 01:09:24 PM Added 10.41.0.0/16 iyawkvot
11/26/25 - 01:09:24 PM Removed 10.41.0.0/16 iyawkvot

The entry was added, removed, added again, and removed multiple times within seconds due to the concurrent deployments fighting over the same shared resource.

Result: The VPC CIDR entry ended up being deleted, breaking connectivity for services in that VPC.

Suggested Fix

Implement a diff-based update strategy:

  1. Compute entries to add (in current model but not in Atlas)
  2. Compute entries to remove (in Atlas but not in current model)
  3. Only delete removed entries
  4. Only add new entries
  5. Leave unchanged entries untouched

This would eliminate both the downtime window and the race condition issue.

Environment

  • Using AWS CDK with awscdk-resources-mongodbatlas
  • Multi-region deployments (eu-north-1, eu-west-1, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions