Skip to content
This repository was archived by the owner on Dec 19, 2025. It is now read-only.

Fault Tolerance Testing

Vishesh edited this page Apr 21, 2021 · 1 revision

Fault Tolerance Testing

Setup

To test fault tolerance, we need to check scenarios where if one of the service crashes, how the system would behave in this scenario.

The options that we had was to use Kube-Monkey, which is a tool based on Netflix's Chaos Monkey that randomly crashes a service based on certain selection and scheduling parameters. We explored this option, and even though simple, the tool did not make a lot of sense as it needed good amount of configuration. The tool is an excellent option for complex architectures with complicated dependencies between microservices. However, since our architecture is simple and the dependencies are easily identifiable, we decided to manually decrease the number of pods of a given service to study the behavior of the system.

Tool

We used kubectl's scale command along with Postman to test whether the system is available or not when a service is crashed.

Observations

Following observations were made with regards to fault tolerance of the micro-service architecture:

  1. In case of multiple replicas, if a pod is removed, subsequent requests are taken over by the replicas. This works seamlessly as all the services are stateless in nature.

  2. If all replicas of gateway or react UI are crashed, the application as a whole would not be available as these two are the single point of failures for the system. This limitation would be addressed in the subsequent milestones using blue-green deployment.

Gateway Service -- Working

When the service is crashed,

Gateway Service -- Crashed

  1. If all replicas of auth-service are crashed, the login and sign up functionalities would stop working, however the user who is logged in would be able to still use the service for uploading and downloading images.

GET /imageList Works

Auth Service -- Crashed -- GET Image List

POST /image Works

Auth Service -- Crashed -- POST Image

GET /image Works

Auth Service -- Crashed -- GET Image

  1. If all the replicas of the user service are crashed, the sign up functionality of the application would stop working.

GET /imageList Works

User Service -- Crashed -- GET ImageList

POST /signUp Fails

User Service -- Crashed -- POST SignUp

  1. If all the replicas of image service are crashed, the image upload, download and view list of all images stops working, however, login and sign up work independent of the failure.

POST /signin Works

ImageService -- Crashed -- POST SignIn

POST /image Fails

ImageService -- Crashed -- POST Image

  1. If all the replicas of the session service are killed, the loading of the landing page and sign in and sign up page works correctly, however, the application as a whole would not be able to provide the service as the session validation fails.

GET /image Fails

SessionService -- Crashed -- GET Image

  1. If all the replicas of the session log service are killed, the application functionality remains unaltered except for the fact that no session logs would be available to re-create a lost session.

GET /imageList Works

SessionLogService -- Crashed -- GET ImageList

POST /signin Works

SessionLogService -- Crashed -- POST /signin

Clone this wiki locally