Skip to content

Commit 6a37316

Browse files
committed
add new post for bug in OTP
1 parent e7abece commit 6a37316

File tree

2 files changed

+144
-0
lines changed

2 files changed

+144
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
title: "The Case of Rejected Certificates: An OTP Mystery"
3+
date: 2025-03-08 22:30:00 +1100
4+
tags: Erlang OTP
5+
header:
6+
image: /assets/images/2025-03-08/red_bug.jpg
7+
image_description: "Banff National Park"
8+
teaser: /assets/images/2025-03-08/red_bug.jpg
9+
overlay_image: /assets/images/2025-03-08/red_bug.jpg
10+
overlay_filter: 0.4
11+
caption: >
12+
Photo by [Jill Heyer](https://unsplash.com/@jillheyer)
13+
on [Unsplash](https://unsplash.com/photos/closeup-photography-of-ladybug-perched-on-green-leafed-plant-U9x5mG0pBiQ)
14+
excerpt: Tracking down a production issue to a subtle bug in Erlang/OTP
15+
---
16+
17+
## The Problem Arises
18+
19+
A couple of weeks ago, we attempted to upgrade our Erlang/OTP version from 24 to
20+
25.3.2.16, which was the latest release at the time. Unfortunately, shortly
21+
after the new release containing this change was deployed to production, our
22+
Customer Service team reported that a specific payment feature had stopped
23+
working. In fact, they noticed that we had stopped receiving this type of
24+
payment almost immediately after the new release hit production. The timing was
25+
too suspiciously close for this to be a coincidence.
26+
27+
## The Investigation
28+
29+
When investigating this issue, I had no idea what the cause was, but I did have
30+
significant time pressure due to the nature of the problem—payments not being
31+
processed is always urgent!
32+
33+
First thing I did was to understand where the failure was happening and managed
34+
to replicate it in my local environment. Next, I methodically went through the
35+
all the changes in this release, reverting suspicious-looking changes one by
36+
one. Surprisingly, none of our actual code changes was the culprit.
37+
38+
The OTP version upgrade had seemed like one of the most innocent changes with
39+
regard to the payment issue we were facing. However, after exhausting other
40+
possibilities, I tested against OTP 24 since the OTP upgrade was a relatively
41+
major change in the same release. I was quite shocked to discover that the new
42+
version of OTP was indeed the guilty party.
43+
44+
## The (Partial) Solution
45+
46+
Since I'm not an expert on certificate validation in Erlang, the error message
47+
we got when making requests to the bank looks cryptic:
48+
49+
```
50+
TLS :client: In state :wait_cert_cr at ssl_handshake.erl:2123 generated CLIENT ALERT: Fatal - Unsupported Certificate
51+
- {:key_usage_mismatch,
52+
{ {:Extension, {2, 5, 29, 15}, true, [:keyCertSign, :cRLSign]},
53+
{:Extension, {2, 5, 29, 37}, false,
54+
[{1, 3, 6, 1, 5, 5, 7, 3, 2}, {1, 3, 6, 1, 5, 5, 7, 3, 1}]}}}
55+
```
56+
57+
But armed with this error message, I was able to find a [Github issue][gh-issue]
58+
in the official OTP repository about the same problem. Apparently other
59+
developers making HTTP requests with Erlang/Elixir had encountered the same
60+
issue.
61+
62+
Thanks to Ingela Andin, the maintainer, and the community's efforts, a fix had
63+
already been released for OTP 26 and 27. But unfortunately for us, there was an
64+
impression that OTP 25 wasn't affected, so no fix had been done for it. Given
65+
our urgent situation, we decided to revert back to OTP 24 to restore payment
66+
processing as quickly as possible.
67+
68+
It's worth noting that after I reported that OTP 25 was indeed affected by the
69+
same issue, Ingela responded quickly and worked on backporting the fix. A new
70+
patch version with the fix was released about two weeks ago, clearing our path
71+
to safely upgrading to OTP 25.
72+
73+
Now that we had a solution, I wanted to better understand what caused the
74+
problem in the first place.
75+
76+
## Understanding Digital Certificates
77+
78+
To understand the bug, we need a quick primer on SSL/TLS certificates: digital
79+
certificates are like digital ID cards that websites use to prove their
80+
identity. Each certificate contains:
81+
82+
* The website's public key
83+
* Information about the website (domain name, etc.)
84+
* Information about how the certificate can be used
85+
* A signature from a trusted Certificate Authority (CA)
86+
87+
Certificates have "extensions" that specify what they can be used for. Two
88+
important ones are:
89+
90+
* Key Usage (KU): Broadly defines what the certificate's key can do (sign things, encrypt things, etc.)
91+
* Extended Key Usage (EKU): More specifically defines the certificate's purpose (web server authentication, email, etc.)
92+
93+
## The Bug in OTP
94+
95+
The bug occurred because recent versions of OTP was enforcing a rule that wasn't
96+
actually specified in the certificate standards (RFC 5280).
97+
98+
In simple terms:
99+
100+
* The certificates from certain CAs like Entrust had a flag set indicating they could sign other certificates (keyCertSign)
101+
* They also had flags set saying they could be used for web server authentication
102+
* OTP thought these two purposes were contradictory and rejected the certificate
103+
104+
It's like if you're qualified as both a teacher and a restaurant chef, but then
105+
a bureaucrat refused to accept because "you can't possibly do both these
106+
unrelated jobs." In reality, of course, there's no reason someone couldn't be
107+
qualified for both roles independently.
108+
109+
And same goes for digital certificates. The certificate standard (RFC 5280)
110+
allows certificates to serve multiple purposes simultaneously, but OTP's new
111+
validation logic was too restrictive.
112+
113+
For those interested in further technical details, there are extensive
114+
discussions in the [Github issue][gh-issue] and here is the [PR][gh-pr] that
115+
fixed it.
116+
117+
## Takeaways
118+
119+
A few interesting lessons from this experience:
120+
121+
1. Hidden Complexity: Even mature, well-tested software like Erlang/OTP can have subtle bugs in complex areas like SSL/TLS.
122+
2. Implementation vs. Specification: The bug wasn't a coding error but an overly strict interpretation of a technical standard.
123+
3. Community Matters: Thanks to the Erlang community for identifying and fixing this issue very quickly.
124+
125+
## Summary
126+
127+
In this post, we started with an unexpected payment issue in production from
128+
upgrading the OTP version to 25. After identifying the new OTP version as the
129+
culprit, we had to revert back to OTP 24.
130+
131+
We also dove into understanding how the bug happened, which was essentially an
132+
overly strict interpretation of certificate standards. Thanks to the responsive
133+
Erlang community and OTP maintainers, a fix was backported to OTP 25, resolving
134+
the bug.
135+
136+
For me this was quite an interesting experience, because the overwhelming
137+
majority of bugs we face as developers are introduced by ourselves in the
138+
application layer. Sometimes we do encounter bugs in the library or framework
139+
that we use, but that's pretty rare. It is ultra rare to face a bug in the
140+
underlying programming language. In fact, this was the very first one I had in
141+
my whole career as a developer, and I've been doing this for almost 20 years.
142+
143+
[gh-issue]: https://github.com/erlang/otp/issues/9208
144+
[gh-pr]: https://github.com/erlang/otp/pull/9286
115 KB
Loading

0 commit comments

Comments
 (0)