Skip to content

Commit 9cb3776

Browse files
Issue #244 Initial draft of performance testing page including summarized case study of DSPT use of APDEX (#245)
1 parent f639ed0 commit 9cb3776

File tree

11 files changed

+121
-3
lines changed

11 files changed

+121
-3
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
.DS_Store
22
*.code-workspace
33
!project.code-workspace
4+
.vs/*
101 KB
Loading
52.6 KB
Loading
41.7 KB
Loading
137 KB
Loading
65.6 KB
Loading

practices/observability.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Rather than blunt "uptime" measures, the paper proposes an interpretation of ava
112112

113113
Monitoring dashboards are a primary way for people like support engineers and product managers to visualise and understand service health and behaviour.
114114

115-
![Grafana Dashboard](grafana-dashboard.jpg)
115+
![Alt](./images/grafana-dashboard.jpg "Grafana Dashboard")
116116

117117
Dashboards should be easy to understand at a glance. Some tips to achieve this are:
118118
* Limit the dashboard to a small set of individual graphs or charts, no more than 10.

practices/performance-testing.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Performance Testing
2+
3+
## Context
4+
5+
* These notes are part of a broader set of [principles](../principles.md)
6+
* This is related to [Engineering quality-checks](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/adopt-appropriate-cyber-security-standards)
7+
* Related community of practice: [Test Automation Working Group](../communities/pd-test-automation-working-group.md)
8+
* See also:
9+
* [Quality Metrics](../quality-checks.md)
10+
* [Continuous integration](continuous-integration.md)
11+
* [Governance as a side effect](../patterns/governance-side-effect.md)
12+
* [Testing](testing.md)
13+
14+
## Introduction
15+
16+
Performance testing has a somewhat ambiguous meaning across the IT industry and is often used interchangeably with other testing terms such as load testing, stress testing, soak testing, etc.
17+
18+
For the sake of clarity this page will consider Performance Testing as per the definition on [Wikipedia](https://en.wikipedia.org/wiki/Software_performance_testing), namely:
19+
20+
> performance testing is in general a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.
21+
22+
## How to start?
23+
24+
### Know your audience
25+
26+
* Identify common user interactions or journeys with your system
27+
* Identify how many users are typically accessing your system at any given moment
28+
* Calculate or estimate what percentage of those users will be performing a given interaction or journey at any given moment
29+
* This information can then be used to design your thread groups in JMeter or similar grouping of interactions with other testing tools
30+
* The information can then also be used to determine a "typical" load as well as being useful to realistically scale up load as part of your tests
31+
32+
### What does good look like?
33+
34+
* Identify clear targets for performance: performance testing should be an **objective** not subjective exercise
35+
* Examples of possible targets might be:
36+
* SLA based, e.g. all pages must respond within 4 seconds
37+
* Relative, e.g. any given release must not deteriorate performance by more than 5%
38+
* Weighted by interaction: if a user performs a particular interaction once every 3 months they are liable to be more accepting of a 8 second delay than a task which they perform many times a day
39+
* Weighted by load: in busy periods you may be willing to have a slightly longer response time
40+
* Consider how your targets may be influenced by your architecture - for example if you are using a serverless "scale on demand" architecture your targets might be cost based
41+
42+
Ultimately your targets are a red flag that you need to investigate further
43+
44+
## Use of the APDEX index
45+
46+
[APDEX](https://en.wikipedia.org/wiki/Apdex) is a simple formula for calculating performance based on a target response time which would satisfy your users. The reason it is useful is that it gives a definite figure between 0 and 1.0, where 1.0 means all of your users are happy and satisfied, and 0 means they are all unhappy and dissatisfied.
47+
48+
APDEX acts as a "smoothing" function and helps ameliorate the effect of outliers by purely classing performance times in terms of whether the user is satisfied, tolerating or frustrated. Therefore, if you have a strict SLA around every page response time it may not be appropriate for you to use. It is also important to choose a realistic target response time as otherwise, if it is overly lenient or overly generous, you will struggle to make much distinction between different performance test runs. Repeated results of 0 or 1.0 aren't very useful.
49+
50+
APDEX is a useful index for pipelines as it gives a definite figure and is therefore a very objective measure as opposed to the more subjective, manual interpretation of standard load testing reports such as the JMeter example below:
51+
52+
![Alt](./images/jmeter-report-sample.png "Sample JMeter Report")
53+
54+
As such APDEX can help us answer and take action (e.g. fail the pipeline) on such fundamental questions as:
55+
56+
* Are our users happy?
57+
* Have we made performance worse?
58+
* Would our users become unhappy when there is an increased load?
59+
60+
## A case study
61+
62+
For the Data Security Protection Toolkit (DSPT) we decided to use APDEX so that, prior to a fortnightly release, we could answer the question:
63+
64+
> Has this release made the performance of the system worse?
65+
66+
### A case study (know your audience)
67+
68+
Previously we had defined a list of user scenarios for the typical actions undertaken on the system which we named after Mr Men characters. We also defined for every 100 users how many (i.e. what percentage) would be likely to be performing a given Mr Man scenario
69+
70+
We used these scenarios to define our thread groups within JMeter and decided we would run our performance tests for 250 users at a time which would represent a heavy load for the system.
71+
72+
### A case study - what does good look like?
73+
74+
Previously we had performed Performance Testing in a fairly adhoc manner and even when we hooked just the JMeter tests into our Release Candidate pipeline we were forgetting to check the resulting report often. When we did check we were finding it hard to compare against previous reports (if we still had them) and the whole process felt rather loose and subjective.
75+
76+
The [quality-checks section of our Engineering Dashboard](../quality-checks.md) was reporting this gap in best-practice: we needed to improve this working practice.
77+
78+
In order to have an automatic, quantifiable quality gate we decided that we wanted to know if the performance for a particular scenario had degraded by more than 5% compared to previous average performance. If it did we wanted to fail the pipeline so we could investigate any new pieces of code further. Using the following approach we were able to achieve this aim and are currently using it.
79+
80+
### A case study - approach
81+
82+
Although you can apply APDEX figures to JMeter it only calculates them per endpoint whereas we wanted to aggregate our APDEX figures at the Mr Man scenario level.
83+
84+
We therefore wrote a Python program which would take the raw JMeter results file (a sample of shown below) and using regular expressions to group results by thread names matching a Mr Man scenario, would calculate the aggregate APDEX score per scenario.
85+
86+
![Alt](./case-studies/performance-dspt/images/jmeter-output.png "Sample of raw JMeter result file")
87+
88+
The Python program produces a file with the output below:
89+
90+
![Alt](./case-studies/performance-dspt/images/aggregated-apdex-scores.png "Aggregated APDEX results file")
91+
92+
These figures were compared against the average by scenario of previous results files which had been stored in an S3 bucket by using an Athena database over the S3 bucket and the following query:
93+
94+
> SELECT type, key, avg(apdex) AS average FROM "dspt"."performance_test_results" GROUP BY type, key
95+
96+
Using the results of this query we could calculate any deterioration and fail the pipeline if needed. If the results were within the 5% limit then the results file was simply added to the S3 bucket. Additionally for information the results of the calculation are written to the Jenkins log as shown below:
97+
98+
![Alt](./case-studies/performance-dspt/images/degradation-output.png "Degradation result in Jenkins log")
99+
100+
### A case study - some caveats
101+
102+
Whilst we have found this approach useful there are certain caveats to it, for example:
103+
104+
* What if performance slowly degrades over time but always by less than 5% per release?
105+
* It potentially hides an individual page or end point whose performance has degraded due to the aggregation/smoothing effect of APDEX.
106+
* Recognition that we probably want in the future to also apply an absolute target e.g. APDEX of >= 0.9
107+
108+
### A case study - architecture
109+
110+
The following diagram summarises the approach taken to using APDEX by DSPT:
111+
112+
![Alt](./case-studies/performance-dspt/images/architecture.png "DSPT Performance test architecture")
113+
114+
### A case study - impact
115+
116+
By integrating APDEX into our build pipelines we have significantly improved our performance testing: a working process that was previously subjective, manual, and applied infrequently, is now applied to every release candidate, always applied consistently and objectively, and with no human effort.

practices/testing.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
* [Continuous integration](continuous-integration.md)
99
* [Governance as a side effect](../patterns/governance-side-effect.md)
1010
* [Quality Metrics](../quality-checks.md)
11+
* [Performance Testing](performance-testing.md)
1112

1213
## General Testing Principles
1314

@@ -78,7 +79,7 @@
7879

7980
* BDD tools to encode acceptance criteria in business terms as automated tests where appropriate.
8081
* Chaos engineering / resilience testing e.g. using AWS Fault Injection Simulator (see [AWS FIS](../tools/aws-fis) for sample code)
81-
* Performance tools to check load, volume, soak and stress limits
82+
* Performance tools to check load, volume, soak and stress limits (see [Performance Testing practices](performance-testing.md) for further details)
8283

8384
## Further reading and resources
8485

0 commit comments

Comments
 (0)