You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DDIA/part-1.md
+51-1Lines changed: 51 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,18 +80,68 @@ Tips for reliable systems tolerant to human errors, as quoted in DDIA:
80
80
- provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users
81
81
- Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests
82
82
- make it fast to roll back configuration changes, roll out new code gradually, and provide tools to recompute data
83
-
-
83
+
-detailed and clear monitoring, such as performance metrics and error rates
84
84
#### How Important Is Reliability?
85
+
Business implications - "outages of ecommerce sites can have huge costs in terms of lost revenue and damage to reputation."
85
86
### Scalability
87
+
Scalability is the term we use to describe a system’s ability to cope with increase load
86
88
#### Describing Load
89
+
Load Parameters may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else
87
90
#### Describing Performance
91
+
**Throughput** is the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.
92
+
93
+
Most important for an online system is the **response time**, being the time between a client sending a request and receiving a response. This can be affected by "a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack" etc.
94
+

95
+
The median, 50th percentile, is abbreviated as p50 where half of users requests are served in less than this time and the other half are served in more than this time. e.g. if p50 is 200ms then half of your users are served in less than 200ms and the other half is served in more than 200ms. This is seen in the 95th and 99th percentiles.
96
+
Nevertheless, it is important to examine and address some of your outliers, but you must also consider the diminishing returns. i.e. optimizing to address p95 and p99 can have critical business effects, but p999 (99.9 percentile - e.g. 1 in 10,000) may have diminishing returns.
97
+
*"Amazon has also observed that a 100 ms increase in response time reduces
98
+
sales by 1%"*
99
+
100
+

88
101
#### Approaches for Coping with Load
102
+
**scaling up** - *vertical scaling*, moving to a more powerful machine. This is often simpler but more expensive.
103
+
**scaling out** - *horizontal scaling*, distributing the load across multiple smaller machines. Distributing load across multiple machines is also
104
+
known as a *shared-nothing* architecture.
105
+
**elastic systems** - a more complex way of scaling systems with highly unpredictable load that automatically add computing resources when they detect a load increase
106
+
89
107
### Maintainability
108
+
**Maintainable Software**
109
+
- Operability: Ops should be able to keep system running smoothly
110
+
- Simplicity: barrier to entry to codebase should be low, i.e. new engineers should be able to hit the ground running
111
+
- Evolvability (modifiability, extensibility, plasticity): It should be easy for engineers to make changes to the codebase
112
+
90
113
#### Operability: Making Life Easy for Operations
114
+
**Ops Team Responsibility**
115
+
- health monitoring
116
+
- performance and failure tracking and trouble shooting
117
+
- Keeping software up to date including security patches
118
+
- Examine effects of systems on each other
119
+
- anticipate future problems
120
+
- tools and practices for deployment and configuration management
121
+
- maintenance tasks such as migrating platforms
122
+
- maintaining security of the system
123
+
- keep production environment stable
124
+
- preserve organization knowledge about the system, e.g. documentation
125
+
- Providing good documentation
126
+
- Self-healing where appropriate, but also giving administrators manual control over the system state when needed
91
127
#### Simplicity: Managing Complexity
128
+
simplicity yields maintainability
129
+
92
130
#### Evolvability: Making Change Easy
131
+
"The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-tounderstand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability "
93
132
### Summary
133
+
Applications should meet:
94
134
135
+
**functional requirements:** what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways
136
+
**non-functional requirements:** general properties like security, reliability compliance, scalability, compatibility, and maintainability
137
+
138
+
**Concerns in Most Software Systems**
139
+
-**Reliability:**
140
+
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
141
+
-**Scalability:**
142
+
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
143
+
-**Maintainability:**
144
+
Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively
0 commit comments