Spike Testing

For Spike testing, we reused the scripts created in the load testing of the service and added the additional constraint of Synchronized timers to each request. By adding a synchronized timer, we were able to make all the threads sleep and wake them up together, causing a sudden spike in the requests being pooled to the server.

We tested each of the functionalities provided by the system for up to 10000 concurrent users. We scaled the system to two replicas and were able to see significant improvements in the result with the additional replica.

Summary

Based on the tests conducted, we were able to conclude that our system can easily handle loads of 1000 concurrent user spike with an error rate of less than 0.5% for any of the requests made to the service. The systems capacity can be increased to handle work loads of around 10000 users where the error rate is quite high for a single replica. There are some places where the system had to be adopted to make sure that the system can scale with replicas, like the image-service.

Some implementation logic can be changed to better handle the requests, like changing the async/await calls in the gateway-service to stop unnecessary waiting on the session-log service for the logging of the session. These changes were also made.

We decided to test the system as a whole functionality to functionality as we were not able to find any good tools for spike testing gRPC services. JMeter does not support gRPC out of the box.

Sign In

The sign in functionality works without any visible hiccups till 1000 users. On a single replica, the failure rate is about 0.20% for 1000 users and around 28% for 10000 simultaneous users. When the number of replicas is set to 2, the error rate for 1000 users falls to 0% and the error rate for 10000 users falls to just just below 8%. Thus, the system can handle more concurrent users with more replicas and the requests are being distributed properly among the pods.

Here are the screenshots from the results for the JMeter Script Executions. For 500 users with one replica,

500-signin-replica1 500-signin-replica1-summary

For 500 users with two replica,

500-signin-replica2 500-signin-replica2-summary

For 1000 users with one replica,

1000-signin-replica1 1000-signin-replica1-summary

For 1000 users with two replica,

1000-signin-replica2 1000-signin-replica2-summary

For 10K users with one replica,

10k-signin-replica1 10k-signin-replica1-summary

For 10K users with two replica,

10k-signin-replica2

Signup

The sign up functionality starts to break at around 1000 user created spike with an error rate of 3.40% and the error rate of 0.20% for 2 replicas. The service shows error rate of just 0.57% for 10000 users on 2 replicas.

For 500 users with one replica,

500-signup-replica1 500-signup-replica1-summary

For 1000 users with one replica,

1000-signup-replica1 1000-signup-replica1-summary

For 1000 users with two replica,

1000-signup-replica2 1000-signup-replica2-summary

For 10K users with two replica,

10k-signup-replica1 10k-signup-replica1-summary

Image Functionalities

The image related functionalities require significant time as they are dependent on external system calls. Hence the error rate for these functionalities are higher. The error rate for 10000 users on single replica is around 38.30% which goes down to 19.42% on 2 replicas. Surprisingly however, the functionalities show no error for 1000 users even on a single replica.

For 500 users with one replica,

500-image-replica1 500-image-replica1-summary