1
-
2
1
The Recipe
3
2
==========
4
3
@@ -9,21 +8,20 @@ The Recipe
9
8
.. warning ::
10
9
11
10
This chapter is going to be rewritten with reference to David Agan's
12
- `Debugging book <https://debuggingrules.com/ >`_. Until that time I have left
11
+ `Debugging book <https://debuggingrules.com/ >`_. Until that time, I have left
13
12
it intact in case it may be helpful. If you came here chasing a link to
14
- ``understand the system `` or ``don't think, look `` then I refer you to his
13
+ ``understand the system `` or ``don't think, look ``, then I refer you to his
15
14
book for the time being.
16
15
17
-
18
16
This chapter presents a recipe for debugging performance regressions in Haskell.
19
- Often times when we debug code it becomes too easy to trace execution or use a
17
+ Often, when we debug code, it becomes too easy to trace execution or use a
20
18
shotgun approach; we apply a bunch of best-guess changes and retest to see if
21
19
our stimulus presents a response. You should do your best to avoid these urges.
22
20
Instead, use a scientific approach and develop a hypothesis and conceptual model
23
21
of the failure mode or bug. Every bug or performance regression is a learning
24
22
opportunity and should be considered as such. By treating regressions as
25
- learning opportunities you gain knowledge of your system and the systems it
26
- interacts with, and in turn become a better software engineer. This chapter
23
+ learning opportunities, you gain knowledge of your system and the systems it
24
+ interacts with, and in turn, become a better software engineer. This chapter
27
25
provides a sequence of questions and reminders to help you take a scientific
28
26
approach to performance regression debugging. We hope it aids you well.
29
27
@@ -32,7 +30,7 @@ Vocabulary
32
30
33
31
Unless otherwise noted, we use the following vocabulary to describe various
34
32
aspects of our optimization journey. Because these do not have a formal
35
- definition we present them here instead of in the :ref: `glossary `:
33
+ definition, we present them here instead of in the :ref: `glossary `:
36
34
37
35
1. *The system *: The system is the local infrastructure and computational
38
36
edifice your program operates in. This includes your operating system, your
@@ -41,7 +39,7 @@ definition we present them here instead of in the :ref:`glossary`:
41
39
2. *The program *: The program is the program we are trying to optimize that runs
42
40
on the system.
43
41
44
- 3. *The problem *: The problem is an observable phenomena of the program. It is
42
+ 3. *The problem *: The problem is an observable phenomenon of the program. It is
45
43
the performance regression we are trying to characterize, understand, fix and
46
44
prevent.
47
45
@@ -59,7 +57,7 @@ Characterize the Problem
59
57
60
58
The first step to solving any kind of problem is characterization. The goal of
61
59
this step is to observe how the problem *presents * itself in the system in terms
62
- of the sub-systems the system uses. No phenomena exists without leaving a trail
60
+ of the sub-systems the system uses. No phenomenon exists without leaving a trail
63
61
of evidence, and our purpose in this step is to find this trail and re-state the
64
62
problem description *in terms * of the system. You should begin by asking
65
63
yourself the following questions:
@@ -68,10 +66,10 @@ yourself the following questions:
68
66
why does it continue to happen? What are the chain of events that have caused
69
67
it to occur again?
70
68
71
- #. Is the problem deterministic? Or is it stochastic? If it is stochastic what
72
- is the rate at which we observe the problem phenomena ?
69
+ #. Is the problem deterministic? Or is it stochastic? If it is stochastic, what
70
+ is the rate at which we observe the problem phenomenon ?
73
71
74
- #. Is there anything *unique * about the environment the problem manifest in?
72
+ #. Is there anything *unique * about the environment the problem manifests in?
75
73
Specifically:
76
74
-. Does it manifest only on unique hardware?
77
75
-. Does it manifest at a particular time of day or only on one person's machine?
@@ -130,33 +128,33 @@ Of course, this is a toy example and the failure mode in a large complex system
130
128
may be very long. If this is the case then begin writing the failure mode in
131
129
broad strokes, e.g.:
132
130
133
- Input ``Foo `` is input to the system, it then propogates to sub-system
134
- ``Bar ``, is changed to ``FooFoo `` and then propogates to sub-system ``Baz ``.
131
+ Input ``Foo `` is input to the system, it then propagates to sub-system
132
+ ``Bar ``, is changed to ``FooFoo `` and then propagates to sub-system ``Baz ``.
135
133
136
- In this style you are not overly concerned with the exact functions which do the
134
+ In this style, you are not overly concerned with the exact functions that do the
137
135
work. Rather, you are simply laying out the path the problem input takes through
138
136
the system. You can fill in the details as you gain insight into the failure
139
137
mode through testing.
140
138
141
139
This step is concluded when you have identified and written down one or more
142
140
hypothetical failure modes.
143
141
144
- Create The Smallest Reproducible Test of the Problem
142
+ Create the Smallest Reproducible Test of the Problem
145
143
----------------------------------------------------
146
144
147
- Once you have characterized the problem and have possible failure modes you
145
+ Once you have characterized the problem and identified possible failure modes, you
148
146
should try to create an isolated, minimal test to reproduce the problem. The
149
- idea is to try to capture the problem so you can begin analyzing it. A test is a
150
- light switch; the idea outcome of this step is that you have a light switch
147
+ idea is to capture the problem so you can begin analyzing it. A test is a
148
+ light switch; the ideal outcome of this step is that you have a light switch
151
149
where you can "turn on" and "turn off" the problem at will. Try to construct the
152
150
test such that it interacts with as few sub-systems and external systems as
153
151
possible to limit the scope of the investigation. At the end of the
154
- investigation, you can add this test to your testsuite to ensure the problem
155
- does not manifest again. If you have many possible failure modes, then try to
152
+ investigation, you can add this test to your test suite to ensure the problem
153
+ does not manifest again. If you have many possible failure modes, try to
156
154
have one test per failure mode.
157
155
158
156
Creating a reproducible test is never the easy part, but it is not impossible.
159
- To construct the test case try the following steps:
157
+ To construct the test case, try the following steps:
160
158
161
159
#. Try to isolate the sub-systems and external systems that you suspect are
162
160
likely to be in the failure mode or failure modes.
@@ -167,27 +165,26 @@ To construct the test case try the following steps:
167
165
168
166
#. Try to isolate the code you believe to be in the failure mode. This should
169
167
follow almost directly from characterizing the problem and defining the
170
- failure mode or modes. Tools such as valgrind , which provide line by line
168
+ failure mode or modes. Tools such as Valgrind , which provide line by line
171
169
information of source code, are helpful here if CPU cycle counts are a
172
170
meaningful metric for your system.
173
171
174
172
#. Remove all domain-specific information. Think of the possible failure mode
175
173
from the perspective of the system. Do not think in terms of your business
176
174
logic; using concepts such as ``Customer ``, ``Bank Account ``, or ``Payment
177
175
Information ``. Instead, think in terms of the realization of these concepts
178
- in your system. ``Customer `` is a ``String ``, ``Bank Account `` is a
176
+ in your system. ``Customer `` is a ``String ``, ``Bank Account `` is an
179
177
``Integer ``, ``Payment information `` is a ``Text ``. Now re-describe the
180
178
failure mode in terms of the implementation: "When I send sub-system ``Foo ``
181
- a ``String `` that contains the character ``U+03BB `` I observe the problem".
179
+ a ``String `` that contains the character ``U+03BB ``, I observe the problem".
182
180
183
- #. Create slightly different tests to test different code paths on the failure
181
+ #. Create slightly different tests to test different code paths of the failure
184
182
mode. Run tests to see if you can deterministically observe the problem. You
185
- should be able to state "When I input ``Foo `` with properties ``Bar `` I
186
- observe the problem", and "When I input ``Baz `` with properties ``Qux `` I
183
+ should be able to state "When I input ``Foo `` with properties ``Bar ``, I
184
+ observe the problem", and "When I input ``Baz `` with properties ``Qux ``, I
187
185
observe the baseline". You know you have found the right code path in the
188
- failure mode when you can reproducibly force the problem to occur *and * to
189
- not occur.
190
-
186
+ failure mode when you can reproducibly force the problem to occur *and * not to
187
+ occur.
191
188
192
189
Define a Hypothesis
193
190
-------------------
@@ -197,10 +194,10 @@ The Objects of the Hypothesis
197
194
198
195
Think of each sub-system, external system, and component of your system as
199
196
characters in a story. Any system that takes an action to produce a result that
200
- your code interacts with or causes, is a character. Each data structure your
201
- code directly or indirectly uses, is a character. Each function you have
202
- written, is a character; and so on. These are the objects of your hypothesis;
203
- they are what the hypothesis makes a statement about, and define the sequence of
197
+ your code interacts with or causes is a character. Each data structure your
198
+ code directly or indirectly uses is a character. Each function you have
199
+ written is a character; and so on. These are the objects of your hypothesis;
200
+ they are what the hypothesis makes a statement about and define the sequence of
204
201
interactions that constitutes the failure mode.
205
202
206
203
Defining a Good Hypothesis
@@ -209,11 +206,11 @@ Defining a Good Hypothesis
209
206
Of course, not all hypotheses are equal. Good hypotheses have the following
210
207
properties:
211
208
212
- #. They make progress, i.e, they are *falsifiable *; a good hypothesis yields
209
+ #. They make progress, i.e. , they are *falsifiable *; a good hypothesis yields
213
210
information when confirmed *and * when invalidated. A bad hypothesis *keeps
214
211
constant * the level of information you have about the phenomena. In other
215
212
words, a bad hypothesis is one where you only gain information if the
216
- hypothesis is validated, not when the hypothesis validated *or * invalidated.
213
+ hypothesis is validated, not when the hypothesis is either validated *or * invalidated.
217
214
218
215
#. They are *specific and testable *: Good hypotheses are specific enough *to be *
219
216
invalidated. For example, the hypothesis "The total runtime of the system is
@@ -225,16 +222,16 @@ properties:
225
222
the heap. But in addition to that, this hypothesis also adds information
226
223
*even if * it is shown to be wrong. It could be the case that the runtime *is
227
224
not * dominated by garbage collection, or it could be the case that the cache
228
- *is not * storing thunks. Either way, by testing an invalidating the
225
+ *is not * storing thunks. Either way, by testing and invalidating the
229
226
hypothesis we learn where runtime is spent, and what is stored in the cache.
230
227
231
228
Predict the Response and Test
232
229
-----------------------------
233
230
234
- Now that you have a hypothesis, a hypothetical failure mode and a minimal test
235
- case you can begin testing. Each change made to your code should be in pursuit
236
- of validating or invalidating the hypothesis. Do your best to resist the urge to
237
- begin shotgun debugging! [# ]_ The work flow should be:
231
+ Now that you have a hypothesis, a hypothetical failure mode, and a minimal test
232
+ case, you can begin testing. Each change made to your code should be
233
+ in pursuit of validating or invalidating the hypothesis. Do your best to resist
234
+ the urge to begin shotgun debugging! [# ]_ The workflow should be:
238
235
239
236
1. Review the hypothesis and predict the response. State "if the hypothesis is
240
237
true, then ``Foo `` should happen, or I should observe ``Bar ``".
@@ -252,7 +249,7 @@ begin shotgun debugging! [#]_ The work flow should be:
252
249
hypothesis.
253
250
254
251
..
255
- Let's consider the previous example again, our hypothesis was that that the
252
+ Let's consider the previous example again, our hypothesis was that the
256
253
cache was accumulating thunks, and that these thunks were dominating runtime.
257
254
258
255
@@ -270,7 +267,7 @@ Summary
270
267
-------
271
268
272
269
273
- .. [# ] Be sure to have a reproducible testing environment setup before you begin
270
+ .. [# ] Be sure to have a reproducible testing environment set up before you begin
274
271
gathering data. :ref: `Repeatable Measurements `
275
272
276
273
.. [# ] Shotgun debugging is usually an indication that you have not properly
@@ -281,5 +278,5 @@ Summary
281
278
then you know you have stumbled upon the failure mode of the problem. If
282
279
you do not get a response, then you know that the sub-systems you've
283
280
altered are not in the failure mode of the problem. This search for the
284
- failure mode is characterization of the problem and thus so is shotgun
281
+ failure mode is characterization of the problem and thus, so is shotgun
285
282
debugging.
0 commit comments