Skip to content

Metadata port provisioning failures #3628

@okozachenko1203

Description

Summary

Multiple Neutron API workers concurrently updating the same metadata port can cause data inconsistency between Neutron and OVN databases, where OVN ends up with fewer IP addresses than Neutron despite having matching revision numbers.

Environment

  • OpenStack with Neutron using OVN backend
  • Multiple Neutron API workers (distributed across multiple nodes)
  • Galera cluster for Neutron database
  • OVN with clustered OVSDB (Raft consensus)

Problem Description

Observed Behavior

When creating multiple subnets on a network in rapid succession, the metadata port gets updated by multiple workers simultaneously. This results in:

  • OVN database having only 2 IP addresses for the metadata port
  • Neutron database having 4 IP addresses for the same port
  • Both databases showing the same revision number (5)
  • No RevisionConflict exceptions being raised

All updates happened within 29ms across different controller nodes.

Root Cause Analysis

Transaction Isolation Issue

  • Neutron uses REPEATABLE READ isolation level
  • Each worker starts a transaction and reads the metadata port state
  • Due to REPEATABLE READ, workers cannot see concurrent updates from other workers
  • All workers read the same initial state and proceed with stale data

Code Flow

The race occurs in update_metadata_port (neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py:2784-2849):

def update_metadata_port(self, context, network, subnet=None):
    # Worker reads metadata port once at the beginning
    metadata_port = self.create_metadata_port(context, network)  # Line 2814

    # Uses this stale metadata_port throughout the function
    port_subnet_ids = {ip['subnet_id'] for ip in metadata_port['fixed_ips']}  # Line 2820

    # Updates based on stale data
    if subnet_ids != port_subnet_ids:
        update_metadata_port_fixed_ips(metadata_port,  # Passes stale port
                                       subnet_ids - port_subnet_ids,
                                       port_subnet_ids - subnet_ids)

Why StaleDataError Isn't Raised

Despite SQLAlchemy having version_id_col support via revision_number:

  • Port object not directly modified: Updates to fixed_ips modify related IPAllocation objects, not the Port object itself
  • Indirect revision bumping: IPAllocation has revises_on_change = ('port',) which bumps Port revision indirectly
  • REPEATABLE READ prevents version detection: Even when revision is bumped, other workers can't see it due to transaction isolation
  • Insufficient revision checking: CheckRevisionNumberCommand only prevents revision from going backwards, doesn't check if revision changed since initial read

Call Stack

create_subnet (API call from different workers)
↓

_create_subnet_postcommit (ML2 plugin)
↓

create_subnet (OVN client)
↓

update_metadata_port (reads port once, uses stale data)
↓

update_port (ML2 plugin with REPEATABLE READ transaction)
↓

OVN database update (with stale fixed_ips)

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions