Skip to content

Commit 3b3f648

Browse files
nenoganchevroute
authored andcommitted
Stream PDFs in chunks instead of copying them whole into memory
The CDP protocol supports two methods for retrieving the results of `Page.printToPDF` commands: 1. Transfer the whole PDF file in one go over the WebSocket as a response from the `Page.printToPDF` command. 2. Return a "stream handle" from the `Page.printToPDF` command and let the caller transfer the PDF in chunks via subsequent `IO.read` commands. Method 1, which was being used until now, has several drawbacks when it comes to printing large PDF files: - The file may be larger than the WebSocket's max receive size. This can be worked around by setting a larger max receive size, but that's an imperfect solution as there's always the chance that we will encounter a larger file. - Transferring the whole file into memory before storing it into a file causes a memory spike in the Ruby process and its child Chrome processes. Furthermore, while any memory used by the Chrome processes is released after closing the browser, the memory used by the parent Ruby process is never reclaimed by the garbage collector leading to a permanent memory bloat. - The points above are exacerbated by the fact that Chrome uses Base64 encoding to transfer the PDF over the WebSocket. This means that the transferred data size is further inflated by 33%, and that the PDF contents have to be copied a _second_ time into memory when converting them back to binary. This commit reimplements `Page#pdf` to use method 2 in order to prevent that memory bloat. Here are some test results measuring memory usage of a Rails app that converts a sample HTML to an 85M PDF file: - Baseline: after loading the HTML, before calling `Page#pdf` Rails process: 242M Chromium parent: 61M Chromium child: 94M Chromium child: 55M Chromium child: 64M - Method 1: returning the whole PDF into memory Rails process: 803M Chromium parent: 713M Chromium child: 404M Chromium child: 55M Chromium child: 64M - Method 2: streaming the PDF into 128K chunks Rails process: 265M Chromium parent: 316M Chromium child: 405M Chromium child: 55M Chromium child: 64M The chunk size of 128K was chosen because it represented the sweet spot of a large enough chunk and minimal memory usage in tests of various chunk sizes. The results (Rails process memory usage corresponding to a given chink size) were: - 1M: 292M - 512K: 278M - 256K: 278M - 128K: 265M - 64K: 269M - 32K: 274M - 16K: 285M
1 parent d4c744c commit 3b3f648

File tree

2 files changed

+36
-11
lines changed

2 files changed

+36
-11
lines changed

lib/ferrum/page/screenshot.rb

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,13 @@ def screenshot(**opts)
3636

3737
def pdf(**opts)
3838
path, encoding = common_options(**opts)
39-
options = pdf_options(**opts)
40-
data = command("Page.printToPDF", **options).fetch("data")
41-
return data if encoding == :base64
42-
43-
bin = Base64.decode64(data)
44-
save_file(path, bin)
39+
options = pdf_options(**opts).merge(transferMode: "ReturnAsStream")
40+
stream_handle = command("Page.printToPDF", **options).fetch("stream")
41+
if path
42+
stream_to_file(stream_handle, path)
43+
else
44+
stream_to_memory(stream_handle)
45+
end
4546
end
4647

4748
def mhtml(path: nil)
@@ -70,6 +71,28 @@ def save_file(path, data)
7071
File.open(path.to_s, "wb") { |f| f.write(data) }
7172
end
7273

74+
def stream_to_file(stream_handle, path)
75+
File.open(path, 'wb') do |output_file|
76+
stream_to stream_handle, output_file
77+
end
78+
end
79+
80+
def stream_to_memory(stream_handle)
81+
in_memory_data = ''
82+
stream_to stream_handle, in_memory_data
83+
in_memory_data
84+
end
85+
86+
def stream_to(stream_handle, output)
87+
loop do
88+
read_result = command("IO.read", handle: stream_handle, size: 131072)
89+
data_chunk = read_result['data']
90+
data_chunk = Base64.decode64(data_chunk) if read_result['base64Encoded']
91+
output << data_chunk
92+
break if read_result['eof']
93+
end
94+
end
95+
7396
def common_options(encoding: :base64, path: nil, **_)
7497
encoding = encoding.to_sym
7598
encoding = :binary if path

spec/screenshot_spec.rb

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ def create_screenshot(path: file, **options)
295295
it "convert case correct" do
296296
browser.go_to("/ferrum/long_page")
297297

298-
allow(browser.page).to receive(:command).with("Page.printToPDF", {
298+
allow(browser.page).to receive(:command).with("Page.printToPDF", hash_including(
299299
displayHeaderFooter: false,
300300
ignoreInvalidPageRanges: false,
301301
landscape: false,
@@ -310,8 +310,11 @@ def create_screenshot(path: file, **options)
310310
preferCSSPageSize: false,
311311
printBackground: false,
312312
scale: 1,
313-
transferMode: "ReturnAsBase64"
314-
}) { { "data" => "" } }
313+
)) { { "stream" => "1" } }
314+
315+
allow(browser.page).to receive(:command).with("IO.read", hash_including(
316+
handle: "1"
317+
)) { { "data" => "", "base64Encoded" => false, "eof" => true } }
315318

316319
browser.pdf(path: file, landscape: false,
317320
display_header_footer: false,
@@ -325,8 +328,7 @@ def create_screenshot(path: file, **options)
325328
margin_right: 0.4,
326329
page_ranges: "",
327330
ignore_invalid_page_ranges: false,
328-
prefer_css_page_size: false,
329-
transfer_mode: "ReturnAsBase64")
331+
prefer_css_page_size: false)
330332
end
331333
end
332334
end

0 commit comments

Comments
 (0)