Possibility for a data race in the view? #185

AndrejMitrovic · 2025-06-24T11:19:02Z

AndrejMitrovic
Jun 24, 2025

Hi,

I've been trying to figure out if I'm seeing a potential for a data race. I could be wrong of course, so I'd appreciate your feedback~

This specific routine:

cqrs/src/cqrs.rs

Lines 166 to 190 in 5f31437

    
               pub async fn execute_with_metadata( 
        
                   &self, 
        
                   aggregate_id: &str, 
        
                   command: A::Command, 
        
                   metadata: HashMap<String, String>, 
        
               ) -> Result<(), AggregateError<A::Error>> { 
        
                   let aggregate_context = self.store.load_aggregate(aggregate_id).await?; 
        
                   let aggregate = aggregate_context.aggregate(); 
        
                   let resultant_events = aggregate 
        
                       .handle(command, &self.service) 
        
                       .await 
        
                       .map_err(AggregateError::UserError)?; 
        
                   let committed_events = self 
        
                       .store 
        
                       .commit(resultant_events, aggregate_context, metadata) 
        
                       .await?; 
        
                   if committed_events.is_empty() { 
        
                       return Ok(()); 
        
                   } 
        
                   for processor in &self.queries { 
        
                       let dispatch_events = committed_events.as_slice(); 
        
                       processor.dispatch(aggregate_id, dispatch_events).await; 
        
                   } 
        
                   Ok(()) 
        
               }

Consider the following flow where we have two separate lambda instances, they each handle a separate command around the same time that will end up updating the same aggregate. The timestamps are incrementing and shown on the left:

T1 [lambda 1] 
       |
T2 [lambda loads aggregate A]
       |
T3 [aggregate handles command C1]    
       |
T4 [aggregate applies new event E1]
       |
T5 [lambda commits event E1]
       |
T6 [*network delay*]

Lambda 1 experiences a network delay just before it's about to read the latest view. In the meantime Lambda 2 started processing a separate command:

T7 [lambda 2] 
       |
T8 [lambda loads aggregate A by replaying committed event E1]
       |
T9 [aggregate handles command C2]    
       |
T10 [aggregate applies new event E2]
       |
T11 [lambda commits event E2]

Note that there was no lock contention here. This is essentially saying that both lambdas are now in dispatch() awaiting to fetch the view:

cqrs/src/cqrs.rs

Line 187 in 5f31437

processor.dispatch(aggregate_id, dispatch_events).await;

In the meantime Lambda 1's network delay has been resolved. Now we have a race:

T12 [lambda 1] - Network delay resolved

T13 [lambda 1] - Reads view, context version is 0
T13 [lambda 2] - Reads view, context version is 0

T14 [lambda 1] - Generates a view by applying E1
T14 [lambda 2] - Generates a view by applying E2

T15 Both lambdas try to write their own view

Depending on the view repository implementation this could either result in one of the two writes failing, or potentially silently allowing both writes to happen.

The dynamo-es crate has some protections against this in the update_view() routine where it checks for the proper version / sequence:

.condition_expression("attribute_not_exists(ViewVersion) OR (ViewVersion  = :expected_view_version)")

That could potentially make execute_with_metadata() throw an error for one of the two lambdas. The client code could then try to re-apply the command which failed. This sounds good on the surface (as mentioned here).

But there's a problem here. We've already committed the event. So if we try to push the command again we'll end up with a duplicate event in the store.

It seems like views can drift and become out of sync with events.

I don't have a proof of concept for this yet though. I'd appreciate some feedback before I work on the PoC.

But the problems or potential solutions that I see here are:

The views / view repositories are not being fed all the historic events. That could solve this issue, but is it perhaps costly?
The execute_with_metadata() routine doesn't hold a single event store lock throughout its running time for that aggregate, instead there's 3 separate locks held at different times: when the aggregate is being loaded, when the events being committed, and when the view is being projected.

Thanks a lot~

davegarred · 2025-06-25T14:23:51Z

davegarred
Jun 25, 2025
Collaborator

You are correct that race conditions can occur in queries where the command side is completely protected. This can even be common in some scenarios because CQRS is designed explicitly to allow these two sides to be very far apart (logically and physically).

For instance, let's say you have an aggregate Account, but a query that is collecting Customer information by gathering any relevant events and placing them on a Kafka stream using the customer_id as the key. These events are will be added to partitions based on that key so there is no way to maintain any semblance of account based order that they will be processed (particularly if you have an unbalanced partition problem in your stream).

This is similar to the problem that CQRS provides no guarantees that an event is successfully processed by all queries. Its' guarantee ends at the event committal, further protections are needed within the query itself.

It's a good practice to always ensure ordering logic (usually aggregate type + aggregate id + sequence number) is passed all the way down to the lowest level of logic so that queries can recognize where events are out of order. With that information there are a variety of ways to deal with this problem, a few are:

Include idempotent information where it is an important value, named in a way that it's clear only the latest event should be the source of truth (e.g., the current account balance).
Replay all or some portion of the events in question where a possible conflict arises.
Keep an (ordered) copy of all events locally, this is helpful to include additional fields in order to recognize when an event hasn't been processed successfully and allow for retries or other corrections.

3 replies

AndrejMitrovic Jun 26, 2025
Author

It's a good practice to always ensure ordering logic (usually aggregate type + aggregate id + sequence number) is passed all the way down to the lowest level of logic so that queries can recognize where events are out of order. With that information there are a variety of ways to deal with this problem, a few are:

Thanks for your feedback.

That makes sense to me. But I don't see a way of doing that in the existing framework of cqrs-es. But I could be misunderstanding your points.

For example:

GenericQuery calls the view repo to load the last view. The view repo has no information about the expected sequence at this point. Neither load() nor load_with_context() are provided with the sequence ID of the first event.
Then the view is updated, but view.update() is not allowed to fail.
ViewRepo.update_view() could potentially validate the expected sequence here, but as an example postgres-es doesn't do this, it only updates the sequence +1.

I suppose I could just pipe the events somewhere else and the projections there, and use cqrs-es only for the aggregates. But I wanted to use the neat view/query combo that's in the library.

davegarred Jun 26, 2025
Collaborator

You are correct that there is no way of doing that in the current queries. Those provided are really basic examples (thus the GenericQuery name) meant more a development spike or quick demo that being run in production (it appears that is not very clear in the documentation though). I've used these for debugging queries in live systems but any query that is critical will have specific requirements that aren't easy to generalize.

AndrejMitrovic Jun 27, 2025
Author

Alright, will come up with a better solution for the query for my use-case. Thanks a lot for the feedback, @davegarred~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possibility for a data race in the view? #185

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Possibility for a data race in the view? #185

Uh oh!

AndrejMitrovic Jun 24, 2025

Replies: 1 comment · 3 replies

Uh oh!

davegarred Jun 25, 2025 Collaborator

Uh oh!

AndrejMitrovic Jun 26, 2025 Author

Uh oh!

davegarred Jun 26, 2025 Collaborator

Uh oh!

AndrejMitrovic Jun 27, 2025 Author

AndrejMitrovic
Jun 24, 2025

Replies: 1 comment 3 replies

davegarred
Jun 25, 2025
Collaborator

AndrejMitrovic Jun 26, 2025
Author

davegarred Jun 26, 2025
Collaborator

AndrejMitrovic Jun 27, 2025
Author