Skip to content

qsync testing plan

Sergey Bronnikov edited this page Jun 15, 2020 · 17 revisions

current state of RFC - 15.06.2020 (https://github.com/tarantool/tarantool/commit/a0236e5891f97426a62634557560c4adf32fc967)

1st iteration

  • [RFC, summary] switch async replicas into sync ones and vice versa, expected success and data consistency on a leader and replicas
  • [RFC, summary] switch from leader to replica and vice versa, expected success and data consistency on a leader and replicas
  • [RFC, quorum commit] happy path: write/read data to a leader in sync cluster, expected data consistency on a leader and replicas
  • happy path: read/write data to a sync cluster with max allowed replicas number, expected success and data consistency on a leader and replicas
  • [RFC, quorum commit] no quorum achieved, expected transaction rollback and data consistency on a leader and replicas
  • [RFC, quorum commit] check behaviour with no answer from a replica during write, expected to set failure answer
  • [RFC, quorum commit] check behaviour with failure answer from a replica during write, expected disconnect from the replication
  • [RFC, quorum commit] attempt to write multiple transactions, expected the same order as on client in case of achieved quorum
  • [RFC, quorum commit] attempt to write multiple transactions, expected that latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it
  • [RFC, quorum commit] failure on a leader transaction confirm, expected rollback and data consistency on a leader and replicas
  • leader got a quorum but one replica participated in a quorum leave cluster right after answering to a leader, expected (TBD)
  • [RFC, quorum commit] проверить ситуацию, когда в WAL записали и ответили SUCCESS, но потом потеряли WAL
  • почитать код для rollback ("guarantee of rollback on leader and sync replicas")
  • consistency on replicas on enabling and disabling sync replication (TBD)
  • [RFC, connection liveness] replication_connect_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_sync_lag works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_sync_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_synchro_quorum_timeout
  • [RFC, connection liveness] replication_synchro_quorum
  • [RFC, connection liveness] when Leader has no response for another heartbeat interval, it should consider the replica is lost
  • [RFC, connection liveness] when leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests
  • [RFC, connection liveness] leader stopped to accept write requests can be switched back to write mode when configuration of a cluster updated.
  • [RFC, connection liveness] some of replicas become unavailable during the quorum collection, expected - a leader should wait at most for replication_synchro_quorum_timeout after which it issues a rollback pointing to the oldest TXN in the waiting list
  • fault injections on a different steps to fail "WAL Ok" from replica: network, disk, etc (TBD)
  • test with time difference on leader and replicas, expected success
  • test with a leader and a single replica in a cluster, expected ??? (TBD)

2nd iteration

  • test new cluster cli

Notes

  • Testing should be done with both engines: memtx and vinyl
  • How many nodes should be in a cluster?
    • Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm:

Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes. 84% will manifest on no more than 2 nodes…. It is not necessary to have a large cluster to test for and reproduce failures.

References

Clone this wiki locally