Desktop: Importing from OneNote: Support large attachments#15195
Conversation
…mport/improve-large-attachment-support
| let png_data = fs::read(output_dir.join("Printout").join("test4_1.pdf.png")) | ||
| .expect("should read the PNG"); | ||
| assert_eq!(png_data[0..4], [0x89, 0x50, 0x4E, 0x47], "PNG should have the correct initial bytes"); | ||
| assert_eq!(png_data.len(), 14_005, "PNG should have the correct byte length"); |
There was a problem hiding this comment.
I've verified that these values are consistent with the output before this pull request.
There was a problem hiding this comment.
🧹 Nitpick comments (4)
packages/onenote-converter/renderer/src/page/image.rs (1)
95-101: Optional: accept&mut dyn Readinstead of&mut Box<dyn Read>.Minor nit —
read_file_startdoesn't need ownership-shape of theBox; taking&mut dyn Readwould make it trivially reusable with other readers (e.g. tests) without boxing. No behaviour change.♻️ Optional refactor
-fn read_file_start(reader: &mut Box<dyn Read>) -> Result<Vec<u8>> { +fn read_file_start(reader: &mut dyn Read) -> Result<Vec<u8>> { let size: usize = 1024; let mut sub_reader = reader.by_ref().take(size as u64); let mut bytes = Vec::with_capacity(size); sub_reader.read_to_end(&mut bytes)?; Ok(bytes) }Call site would become
read_file_start(&mut *reader).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/renderer/src/page/image.rs` around lines 95 - 101, Change the function signature of read_file_start from taking &mut Box<dyn Read> to &mut dyn Read and update any call sites to pass the inner reader (e.g. read_file_start(&mut *reader) when you have a Box). Inside read_file_start keep the existing logic (use reader.by_ref().take(...), read_to_end, etc.) but operate on the &mut dyn Read parameter; this removes unnecessary boxing and makes the function usable with any mutable reader type without ownership changes.packages/onenote-converter/parser-utils/src/reader.rs (2)
338-355: Nice regression test.Good coverage for the two invariants that matter here: the data ref reads from the offset captured at
as_data_reftime, and consuming it doesn't perturb the parentReader's position. Worth adding a sibling case for theReaderData::Filebranch later (e.g. viaReader::try_from(Box<dyn FileHandle>)) since theBufferRefandFilepaths are implemented quite differently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/reader.rs` around lines 338 - 355, Add a sibling test that exercises the ReaderData::File branch to assert the same two invariants (as_data_ref captures offset and reading the DataRef does not advance the parent Reader): construct a Reader from a boxed FileHandle via Reader::try_from(Box<dyn FileHandle>) (or a test FileHandle impl), call reader.advance(...) then let data_ref = reader.as_data_ref(n).unwrap().read(), consume data_ref into a buffer and assert its contents match the captured slice, then assert reader.get_u8() returns the next byte; target symbols: Reader, Reader::try_from, Reader::as_data_ref, ReaderData::File and BufferRef.
280-309: Save/seek/restore pattern preserves the reader invariant; one tiny nit.Wrapping the read in a closure so the restore-seek always runs is the right shape for keeping the
ReaderData::Fileinvariant (file cursor ==Reader.data_offset) intact.Nit: the early-out
if offset > self.end_offsetuses strict>, so the exact-EOF case (offset == end_offset) falls through into a no-op read that still performs two seeks on the underlying file handle. Using>=would short-circuit cleanly at the boundary without changing any observable behaviour.Suggested micro-tweak
- let offset = self.start_offset; - if offset > self.end_offset { - return Ok(0); - } + let offset = self.start_offset; + if offset >= self.end_offset { + return Ok(0); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/reader.rs` around lines 280 - 309, In ReaderFilePointer's Read impl (the read method), change the early-out comparison from offset > self.end_offset to offset >= self.end_offset so the exact-EOF case (start_offset == end_offset) returns immediately and avoids the extra seek/restore; update the check that computes remaining accordingly so the rest of the logic (the closure that seeks, reads, and restores file position and the update of self.start_offset) remains unchanged.packages/onenote-converter/parser-utils/src/file_api/wasm_driver.rs (1)
151-176: Streaming writer looks correct; a couple of optional robustness notes.The truncate-then-append loop, the EOF termination, and the gradual buffer growth (capped at 50 MiB) all look right and correctly address the multi-GiB attachment case.
Two small, optional refinements to consider (not blocking):
data.read(...)?propagatesErrorKind::Interruptedas a hard error. For long streams this is worth retrying — you can get the same effect for free by delegating the read side tostd::io::copyinto a small adapterWritethat forwards toappend_file, which handlesInterruptedinternally.- The destination file is truncated up front and then grown in place via appends, so a mid-stream failure leaves a partially-written attachment on disk. If that matters for imports, consider streaming into a sibling temp path and renaming on success. This mirrors the pre-existing
write_filebehaviour, so it's not a regression introduced here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/file_api/wasm_driver.rs` around lines 151 - 176, stream_to_file currently truncates the target and appends chunks directly; implement two optional robustness improvements: 1) stream into a sibling temporary path (e.g., path + ".tmp") instead of truncating the final file, using write_file(temp_path, &[]) to create/clear it, then rename the temp to the final path on success and remove the temp on error; 2) replace the manual read loop with std::io::copy by implementing a small Write adapter that forwards writes to append_file(temp_path, buf) and maps errors through handle_error so Interrupted is handled consistently; keep ApiResult return semantics and reference stream_to_file, write_file, append_file and handle_error when making changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/onenote-converter/parser-utils/src/file_api/wasm_driver.rs`:
- Around line 151-176: stream_to_file currently truncates the target and appends
chunks directly; implement two optional robustness improvements: 1) stream into
a sibling temporary path (e.g., path + ".tmp") instead of truncating the final
file, using write_file(temp_path, &[]) to create/clear it, then rename the temp
to the final path on success and remove the temp on error; 2) replace the manual
read loop with std::io::copy by implementing a small Write adapter that forwards
writes to append_file(temp_path, buf) and maps errors through handle_error so
Interrupted is handled consistently; keep ApiResult return semantics and
reference stream_to_file, write_file, append_file and handle_error when making
changes.
In `@packages/onenote-converter/parser-utils/src/reader.rs`:
- Around line 338-355: Add a sibling test that exercises the ReaderData::File
branch to assert the same two invariants (as_data_ref captures offset and
reading the DataRef does not advance the parent Reader): construct a Reader from
a boxed FileHandle via Reader::try_from(Box<dyn FileHandle>) (or a test
FileHandle impl), call reader.advance(...) then let data_ref =
reader.as_data_ref(n).unwrap().read(), consume data_ref into a buffer and assert
its contents match the captured slice, then assert reader.get_u8() returns the
next byte; target symbols: Reader, Reader::try_from, Reader::as_data_ref,
ReaderData::File and BufferRef.
- Around line 280-309: In ReaderFilePointer's Read impl (the read method),
change the early-out comparison from offset > self.end_offset to offset >=
self.end_offset so the exact-EOF case (start_offset == end_offset) returns
immediately and avoids the extra seek/restore; update the check that computes
remaining accordingly so the rest of the logic (the closure that seeks, reads,
and restores file position and the update of self.start_offset) remains
unchanged.
In `@packages/onenote-converter/renderer/src/page/image.rs`:
- Around line 95-101: Change the function signature of read_file_start from
taking &mut Box<dyn Read> to &mut dyn Read and update any call sites to pass the
inner reader (e.g. read_file_start(&mut *reader) when you have a Box). Inside
read_file_start keep the existing logic (use reader.by_ref().take(...),
read_to_end, etc.) but operate on the &mut dyn Read parameter; this removes
unnecessary boxing and makes the function usable with any mutable reader type
without ownership changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d020cb4d-6ab0-4450-8e3c-449c8dbb0b28
📒 Files selected for processing (11)
packages/onenote-converter/parser-utils/node_functions.jspackages/onenote-converter/parser-utils/src/file_api/api.rspackages/onenote-converter/parser-utils/src/file_api/native_driver.rspackages/onenote-converter/parser-utils/src/file_api/wasm_driver.rspackages/onenote-converter/parser-utils/src/reader.rspackages/onenote-converter/parser/src/onenote/embedded_file.rspackages/onenote-converter/parser/src/onenote/image.rspackages/onenote-converter/parser/src/shared/file_data_ref.rspackages/onenote-converter/renderer/src/page/embedded_file.rspackages/onenote-converter/renderer/src/page/image.rspackages/onenote-converter/renderer/tests/convert.rs
According to the documentation, ErrorKind::Interrupted can "typically be retried". (https://doc.rust-lang.org/std/io/enum.ErrorKind.html#variant.Interrupted)
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/onenote-converter/parser-utils/src/file_api/api.rs (1)
19-21: Optional: document truncate/create semantics in the trait doc.The doc comment captures the partial-write caveat nicely, but it leaves the create/truncate contract implicit. Since both implementations (native and WASM) truncate the destination before streaming, it would help future implementers to state that explicitly on the trait so alternative backends don't diverge (e.g. appending instead of overwriting).
📝 Proposed doc tweak
- /// Writes data from `stream` to the file at `path`. - /// Note: If `stream.read` fails, the file may be left in a partially-written state. + /// Writes data from `stream` to the file at `path`, creating the file if it + /// does not exist and truncating it if it does. + /// Note: If `stream.read` fails, the file may be left in a partially-written state. fn stream_to_file(&self, path: &str, stream: &mut dyn Read) -> ApiResult<()>;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/onenote-converter/parser-utils/src/file_api/api.rs` around lines 19 - 21, Update the trait documentation for stream_to_file in the Api trait to explicitly state its create/truncate semantics: document that implementations must create the destination file if it does not exist and truncate/overwrite any existing file before writing the incoming stream (so partial-write caveat still applies), and mention that implementations should not append. Reference the stream_to_file method signature and ensure this contract is clear so native and WASM backends (and future backends) follow the same overwrite behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/onenote-converter/parser-utils/src/file_api/api.rs`:
- Around line 19-21: Update the trait documentation for stream_to_file in the
Api trait to explicitly state its create/truncate semantics: document that
implementations must create the destination file if it does not exist and
truncate/overwrite any existing file before writing the incoming stream (so
partial-write caveat still applies), and mention that implementations should not
append. Reference the stream_to_file method signature and ensure this contract
is clear so native and WASM backends (and future backends) follow the same
overwrite behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6156c400-2983-49ce-974c-2d9e0963fd04
📒 Files selected for processing (1)
packages/onenote-converter/parser-utils/src/file_api/api.rs
Problem
.onefiles with large (e.g. multi-gigabyte) attachments failed to import. Previously, theonenote-converterpackage loaded full attachments into memory before writing them to disk.Solution
Read attachments in chunks.
Follow-up to #15117.
Testing
.onefile..onefile from Joplin.Screencast.from.2026-04-24.12-25-29.webm
Screencast.from.2026-04-24.12-37-22.webm