-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Due to a double encoding error in SecureRandom.urlsafe_base64 that method is unsafe. The fix is easy and posted by @cheerfulstoic in PR #13.
Description
This issue is more pernicious than a simple inefficiency—it is a major bug that effectively breaks the implicit contract of random ID generation. Base.encode64/1 outputs a Base64 string from the source entropy bytes. That's fine. But that output is piped into Base.url_encode64/1 that also "expects" raw random bytes (in this context), and that is not what it receives. As a result, the generated IDs are not only longer than expected, but their characters are no longer evenly distributed. This has serious implications for both the Shannon entropy of the IDs and the implicit assumption that the IDs themselves are uniformly distributed.
Shannon entropy is maximized when each possible character in a sequence is equally probable. However, passing the already Base64-encoded output of Base.encode64/1 into Base.url_encode64/1 skews those probabilities. For example, among the 64 possible Base64 characters (65 if you count the padding character =, which we can ignore here), some characters such as 'f', '7', '-', and '_' never appear, while others occur with vastly different frequencies (e.g., '0' occurs about 29 times more often than '8'). This happens because the output of Base.encode64/1 is not truly random—its bit patterns reflect the encoding scheme. Consequently, Base.url_encode64/1 never encounters certain bit patterns and over-encounters others, producing highly uneven character distributions. The result is that calculating the actual entropy of the generated IDs is tedious at best (I didn't attempt it), but it is certainly less than optimal.
Character distribution issues compound the problem. As noted, four characters never appear at all, and others occur with varying frequency. But it gets worse: suppose a consumer of this method expected an even distribution—for example, partitioning data by the first character of the ID. Because of the double-encoding bug, IDs can only ever start with 19 of the 64 possible characters. That means roughly 70% of the partitions would remain empty.