Skip to content

Optimize by reducing heap allocations#25

Open
jgaskins wants to merge 1 commit intocomandeo:masterfrom
jgaskins:optimize-heap-allocations
Open

Optimize by reducing heap allocations#25
jgaskins wants to merge 1 commit intocomandeo:masterfrom
jgaskins:optimize-heap-allocations

Conversation

@jgaskins
Copy link

This PR optimizes the number of heap allocations performed in each request. For example, Array(UInt8).new(0) allocates an array on the heap whereas Bytes.empty involves no heap allocations — the Slice is allocated on the stack with no buffer on the heap because it isn't necessary to allocate an empty buffer.

I wrote up a benchmark to show how much of a difference this made. It's difficult to benchmark because each query spends only a small fraction of its time actively using the CPU in comparison to the amount of time it spends waiting on I/O. At scale, though, with several requests all hitting the cache at once, this can matter a lot so I thought I'd try to optimize where I could.

Benchmark code
require "benchmark"

require "../src/memcached"

mc = Memcached::Client.new

puts
puts "Getting a million values"
keys = Array.new(1_000_000, &.to_s)
pp Benchmark.measure { mc.get_multi keys }
pp Benchmark.memory { mc.get_multi keys }

Benchmark.ips do |x|
  x.report "get hit" { mc.get "hit" }
  x.report "get miss" { mc.get "miss" }
  x.report "set" { mc.set "hit", "value" }
  x.report "increment" { mc.increment "counter", 1, 0 }
end
Raw output

Before

Getting a million values
#<Benchmark::BM::Tms:0x10456d200
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.183867041,
 @stime=0.015734,
 @utime=0.167988>
242849552
  get hit  54.47k ( 18.36µs) (± 2.84%)  352B/op   1.01× slower
 get miss  55.14k ( 18.14µs) (± 2.37%)  320B/op        fastest
      set  53.84k ( 18.57µs) (± 2.39%)  336B/op   1.02× slower
increment  53.94k ( 18.54µs) (± 4.50%)  352B/op   1.02× slower

After

Getting a million values
#<Benchmark::BM::Tms:0x102af1280
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.1175465,
 @stime=0.011751000000000001,
 @utime=0.076792>
98847824
  get hit  54.35k ( 18.40µs) (± 2.21%)  192B/op        fastest
 get miss  54.12k ( 18.48µs) (± 2.89%)  160B/op   1.00× slower
      set  54.23k ( 18.44µs) (± 2.37%)  160B/op   1.00× slower
increment  52.97k ( 18.88µs) (± 2.64%)  160B/op   1.03× slower

The summary is that we save 50-55% on heap allocations per query — 50% per get miss (320 bytes->160) and 55% per increment (352 bytes->160) with set directly in the middle of the two (336 bytes->160). This results in saving 52% CPU time (183ms->88ms) and 60% reduction in the amount of memory allocated (232MB->94MB) when fetching 1M keys at once with get_multi.

Another (micro-)optimization used here is to replace the hash of opcodes with a NamedTuple. Indexing into a NamedTuple happens at compile time so there is no runtime performance cost at all. I considered using an enum, which offers the same benefit but doesn't require digging into a data structure at all (we would be able to pass :get instead of OPCODES[:get]), but maybe that could come in a followup PR.


Also, in the Crystal ecosystem, it seems to be a common convention to log routine queries and messages at the debug level and let app operators opt into seeing those logs by setting the LOG_LEVEL env var.

In hindsight, that could've (and probably should've) been a separate PR, but I did it in order to run the benchmark, too. So it's not entirely unrelated.

This commit optimizes the number of heap allocations performed in each
request. For example, `Array(UInt8).new(0)` allocates an array on the
heap whereas `Bytes.empty` involves no heap allocations — the `Slice` is
allocated on the stack with no buffer on the heap because it isn't
necessary to allocate an empty buffer.

Another (micro-)optimization used here is to replace the hash of opcodes
with a NamedTuple. Indexing into a NamedTuple happens at compile time so
there is no runtime performance cost at all.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant