Optimize by reducing heap allocations#25
Open
jgaskins wants to merge 1 commit intocomandeo:masterfrom
Open
Conversation
This commit optimizes the number of heap allocations performed in each request. For example, `Array(UInt8).new(0)` allocates an array on the heap whereas `Bytes.empty` involves no heap allocations — the `Slice` is allocated on the stack with no buffer on the heap because it isn't necessary to allocate an empty buffer. Another (micro-)optimization used here is to replace the hash of opcodes with a NamedTuple. Indexing into a NamedTuple happens at compile time so there is no runtime performance cost at all.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR optimizes the number of heap allocations performed in each request. For example,
Array(UInt8).new(0)allocates an array on the heap whereasBytes.emptyinvolves no heap allocations — theSliceis allocated on the stack with no buffer on the heap because it isn't necessary to allocate an empty buffer.I wrote up a benchmark to show how much of a difference this made. It's difficult to benchmark because each query spends only a small fraction of its time actively using the CPU in comparison to the amount of time it spends waiting on I/O. At scale, though, with several requests all hitting the cache at once, this can matter a lot so I thought I'd try to optimize where I could.
Benchmark code
Raw output
Before
After
The summary is that we save 50-55% on heap allocations per query — 50% per
getmiss (320 bytes->160) and 55% perincrement(352 bytes->160) withsetdirectly in the middle of the two (336 bytes->160). This results in saving 52% CPU time (183ms->88ms) and 60% reduction in the amount of memory allocated (232MB->94MB) when fetching 1M keys at once withget_multi.Another (micro-)optimization used here is to replace the hash of opcodes with a NamedTuple. Indexing into a NamedTuple happens at compile time so there is no runtime performance cost at all. I considered using an
enum, which offers the same benefit but doesn't require digging into a data structure at all (we would be able to pass:getinstead ofOPCODES[:get]), but maybe that could come in a followup PR.Also, in the Crystal ecosystem, it seems to be a common convention to log routine queries and messages at the
debuglevel and let app operators opt into seeing those logs by setting theLOG_LEVELenv var.HTTP::ClientAMQP::ClientIn hindsight, that could've (and probably should've) been a separate PR, but I did it in order to run the benchmark, too. So it's not entirely unrelated.