Stream PDFs in chunks instead of copying them whole into memory

nenoganchev · route · commit 3b3f648f858d · 2021-02-02T11:06:20.000+03:00
The CDP protocol supports two methods for retrieving the results of
`Page.printToPDF` commands:
1. Transfer the whole PDF file in one go over the WebSocket as a
   response from the `Page.printToPDF` command.
2. Return a "stream handle" from the `Page.printToPDF` command and let
   the caller transfer the PDF in chunks via subsequent `IO.read`
   commands.

Method 1, which was being used until now, has several drawbacks when it
comes to printing large PDF files:
- The file may be larger than the WebSocket's max receive size. This can
  be worked around by setting a larger max receive size, but that's an
  imperfect solution as there's always the chance that we will encounter
  a larger file.
- Transferring the whole file into memory before storing it into a file
  causes a memory spike in the Ruby process and its child Chrome
  processes. Furthermore, while any memory used by the Chrome processes
  is released after closing the browser, the memory used by the parent
  Ruby process is never reclaimed by the garbage collector leading to a
  permanent memory bloat.
- The points above are exacerbated by the fact that Chrome uses Base64
  encoding to transfer the PDF over the WebSocket. This means that the
  transferred data size is further inflated by 33%, and that the PDF
  contents have to be copied a _second_ time into memory when converting
  them back to binary.

This commit reimplements `Page#pdf` to use method 2 in order to prevent
that memory bloat. Here are some test results measuring memory usage of
a Rails app that converts a sample HTML to an 85M PDF file:

- Baseline: after loading the HTML, before calling `Page#pdf`

  Rails process:      242M
    Chromium parent:   61M
      Chromium child:  94M
      Chromium child:  55M
      Chromium child:  64M

- Method 1: returning the whole PDF into memory

  Rails process:      803M
    Chromium parent:  713M
      Chromium child: 404M
      Chromium child:  55M
      Chromium child:  64M

- Method 2: streaming the PDF into 128K chunks

  Rails process:      265M
    Chromium parent:  316M
      Chromium child: 405M
      Chromium child:  55M
      Chromium child:  64M

The chunk size of 128K was chosen because it represented the sweet spot
of a large enough chunk and minimal memory usage in tests of various
chunk sizes. The results (Rails process memory usage corresponding to a
given chink size) were:
-   1M: 292M
- 512K: 278M
- 256K: 278M
- 128K: 265M
-  64K: 269M
-  32K: 274M
-  16K: 285M
diff --git a/lib/ferrum/page/screenshot.rb b/lib/ferrum/page/screenshot.rb
@@ -36,12 +36,13 @@ def screenshot(**opts)
 
       def pdf(**opts)
         path, encoding = common_options(**opts)
-        options = pdf_options(**opts)
-        data = command("Page.printToPDF", **options).fetch("data")
-        return data if encoding == :base64
-
-        bin = Base64.decode64(data)
-        save_file(path, bin)
+        options = pdf_options(**opts).merge(transferMode: "ReturnAsStream")
+        stream_handle = command("Page.printToPDF", **options).fetch("stream")
+        if path
+          stream_to_file(stream_handle, path)
+        else
+          stream_to_memory(stream_handle)
+        end
       end
 
       def mhtml(path: nil)
@@ -70,6 +71,28 @@ def save_file(path, data)
         File.open(path.to_s, "wb") { |f| f.write(data) }
       end
 
+      def stream_to_file(stream_handle, path)
+        File.open(path, 'wb') do |output_file|
+          stream_to stream_handle, output_file
+        end
+      end
+
+      def stream_to_memory(stream_handle)
+        in_memory_data = ''
+        stream_to stream_handle, in_memory_data
+        in_memory_data
+      end
+
+      def stream_to(stream_handle, output)
+        loop do
+          read_result = command("IO.read", handle: stream_handle, size: 131072)
+          data_chunk = read_result['data']
+          data_chunk = Base64.decode64(data_chunk) if read_result['base64Encoded']
+          output << data_chunk
+          break if read_result['eof']
+        end
+      end
+
       def common_options(encoding: :base64, path: nil, **_)
         encoding = encoding.to_sym
         encoding = :binary if path
diff --git a/spec/screenshot_spec.rb b/spec/screenshot_spec.rb
@@ -295,7 +295,7 @@ def create_screenshot(path: file, **options)
           it "convert case correct" do
             browser.go_to("/ferrum/long_page")
 
-            allow(browser.page).to receive(:command).with("Page.printToPDF", {
+            allow(browser.page).to receive(:command).with("Page.printToPDF", hash_including(
                displayHeaderFooter: false,
                ignoreInvalidPageRanges: false,
                landscape: false,
@@ -310,8 +310,11 @@ def create_screenshot(path: file, **options)
                preferCSSPageSize: false,
                printBackground: false,
                scale: 1,
-               transferMode: "ReturnAsBase64"
-            }) { { "data" => "" } }
+            )) { { "stream" => "1" } }
+
+            allow(browser.page).to receive(:command).with("IO.read", hash_including(
+              handle: "1"
+            )) { { "data" => "", "base64Encoded" => false, "eof" => true } }
 
             browser.pdf(path: file, landscape: false,
                                     display_header_footer: false,
@@ -325,8 +328,7 @@ def create_screenshot(path: file, **options)
                                     margin_right: 0.4,
                                     page_ranges: "",
                                     ignore_invalid_page_ranges: false,
-                                    prefer_css_page_size: false,
-                                    transfer_mode: "ReturnAsBase64")
+                                    prefer_css_page_size: false)
           end
         end
       end