Avoid reading ahead and then seeking back#1189
Conversation
7c9019b to
da98643
Compare
|
@rhuijben thanks a lot for both PR. Ill review them shortly |
|
These are some basic cleanups for further parser improvements. I'm wondering how stable you want to keep the tokenizer api. We can remove that initial byte and the 'readnextbyte' flag we pass everywhere to simplify things... The peek byte can handle that case. If we extend the peek to be an optionally larger buffer (can be done without changing the tokenizer api) there are many other optimizations possible.. Including using optimized searches instead of per byte walks. That can reduce at least ten to twenty percent of parsing time in the test suite.. I have some code locally that optimized the number tokenizer this way (the highest cpu user in many tests) |
|
@rhuijben I'm always happy to add performance improvements (I think ill add some benchmarking soon). @EliotJones we might want to release 0.12 soon? (Before refactoring the tokenizer api) |
|
If you do performance tests, be careful around which source you use. (Read: test both) Most tests load the whole file in ram before processing, while you want more like stream processing if doing server batches or really huge files. Currently this is still necessary for good read throughput, but that is because the stream bytes don't do efficient proper buffering. And we seek back far too often. I think I should be able to improve this quite a bit without touching any of the higher layers. Just not sure how many third party tokenizers use the current extension api... And want to remain stable. We can also wrap the old api with a newer api if necessary, to keep the public api stable... Just not sure about the current guarantees. |
|
|
||
| bytes.Seek(fetchFrom); | ||
|
|
||
| Span<byte> byteBuffer = new byte[bytes.Length - fetchFrom]; // TODO: Maybe use PoolArray? |
There was a problem hiding this comment.
byteBuffer max size is 1024 bytes, right? I believe this would be a good use of stackalloc - unless not possible
There was a problem hiding this comment.
Could be. Not sure about proper limits for stackalloc. (Usually I would look at that for a few hundred bytes max. But pdf processing is not a generic library design, that should be able to run on very limited machines)
With the next patch set this would be in the look ahead buffer of the input bytes and there would be no copying at all.
There was a problem hiding this comment.
see my comment in the PR and https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc
|
@rhuijben I've submitted a first comment, have a look when you can. Also, do you mind rebasing this branch onto |
…by using seek. Optimize the FirstPassParser to just fetch a final chunk before doing things char-by-char backwards.
da98643 to
dd50546
Compare
|
[Just rebased. No changes yet] |
|
@rhuijben thansk a lot for that. I've started the review, but can you explain why many |
|
[I worked on quite a few parsers over the years] The current parser model requires each parser and more importantly parser users to handle the first and last byte of what is parser different than all others. E.g. before using the position/offset you have to know what the last parser did. The input already hands you the next byte via the peek function so there is zero reason to make things hard on the caller, and we can just fix the parsers to do the right thing. So before looking at improving the parsers individually, and the input bytes to provide a bit more info I fixed this assumption to at least work the same for all parsers in this PR. If the parser api can be changed I would even recommend removing this flag completely. It makes it much too easy to introduce subtle offset and off-by-one bugs in corner cases. This is exactly what this PR is all about. The parser changes in this PR can be reviewed one by one. |
|
@rhuijben thanks a lot for the clarity here. I think it's fine to change the parser api, I'd just like to first release an official version (this does not mean you need to wait to do PR). @EliotJones if no objection, I'll do that early next month. Regarding I do not believe an official limit can exist, but this is what MS does: int length = 1000;
Span<byte> buffer = length <= 1024 ? stackalloc byte[length] : new byte[length];If you are sure the array will be 1024 bytes at max, then Let me know if you want to push this change, I'm happy to merge the PR as it is |
|
@rhuijben I finally managed to run some benchmarks: Array
Code available https://github.com/BobLd/UglyToad.PdfPig.Benchmarks/tree/array Memory stream
Code available https://github.com/BobLd/UglyToad.PdfPig.Benchmarks/tree/memory-stream Given these numbers, I'm unsure if this is worth the added complexity. Do you have an opinion? |
|
The real benefit is not in the current pr, but in further improvements later on. Moving to the other model allows extending the reader with a bigger peek buffer without overhead and that part will make the real difference. Not seeking back will also allow wrapping smarter readers over other readers, so we don't have to read some parts (such as image streams) multiple times. This will make things more performant and use a lot less memory (copying). I have some of this in a branch in my GitHub fork if you want to look at the direction I'm thinking. (This branch is not PR ready. It changes most tests to do things from a stream instead of an in memory copy, to allow seeing the differences earlier on during my tests. But it shows some progress) |
|
@rhuijben Thanks again for the clarity - much appreciated. I'll merge as-is and look into the work in your fork. Thanks again |
|
@BobLd I will see what parts of that branch are ready for new PRs, and I'll try if I can use your new tests to see what to 'atrack' first. The VS2026 test versions expose more profiling features without needing the very expensive versions of VS, to allow finding the hot code paths in an easy way. |
|
@rhuijben thanks for that. Side note: the full tests that run upon merging now fail, if you can have a look. Thanks! |
|
I will see where that happens. From he GitHub test output it looks like type1 font parse issue that isn't caught by the .net testrun. |
|
I added a But it is probably interesting to see where the PDF processing fails. |
|
@rhuijben yes it more complicated to download the documents than I expected. The website is there https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/ Reading at the documentation here, the document you are looking should be in I'll have a 2nd look tonight |
|
Ok, can download the file now. Thanks for mangling the url to something readable... My local tests got different urls and all ending in an Amazon 404. |
index aac323f3..6d42ca5e 100644
--- a/src/UglyToad.PdfPig.Fonts/Type1/Parser/Type1Tokenizer.cs
+++ b/src/UglyToad.PdfPig.Fonts/Type1/Parser/Type1Tokenizer.cs
@@ -72,8 +72,10 @@
case '/':
{
bytes.MoveNext();
- TryReadLiteral(out var name);
- Debug.Assert(name != null);
+
+ if (!TryReadLiteral(out var name))
+ name = ""; // Should not happen, but does
+
return new Type1Token(name, Type1Token.TokenType.Literal);
}
case '<':Restores the old behavior of just creating an empty name for no data and with that fixes the test. (Not sure if it is really the behavior we want) |
Updated [PdfPig](https://github.com/UglyToad/PdfPig) from 0.1.9 to 0.1.13. <details> <summary>Release notes</summary> _Sourced from [PdfPig's releases](https://github.com/UglyToad/PdfPig/releases)._ ## 0.1.13 ## What's Changed * Increment version to 0.1.13 by @BobLd in UglyToad/PdfPig#1207 * Simply order by offset also when not doing brute force to fix #1208 by @BobLd in UglyToad/PdfPig#1210 * Ensure no key end up missing in ResolveInternal and fix #1209 by @BobLd in UglyToad/PdfPig#1211 * update release logic to check out master before commit by @EliotJones in UglyToad/PdfPig#1212 * Return empty glyph in ReadCompositeGlyph when glyphIndex is out of range and fix #1213 by @BobLd in UglyToad/PdfPig#1215 * Handling of optional content group names without proper name by @carlokok in UglyToad/PdfPig#1216 * Minor Type1FontParser optimisations by @BobLd in UglyToad/PdfPig#1221 * Use file header offset when doing brute force find and fix #1223 by @BobLd in UglyToad/PdfPig#1224 * Do not return glyph bbox and path in Type1Font if character name is '.notdef' by @BobLd in UglyToad/PdfPig#1229 ## New Contributors * @carlokok made their first contribution in UglyToad/PdfPig#1216 **Full Changelog**: UglyToad/PdfPig@v0.1.12...unreleased ## 0.1.12 ## What's Changed * add nullability to core project by @EliotJones in UglyToad/PdfPig#1111 * Fix usage of List.Contains by @theolivenbaum in UglyToad/PdfPig#1112 * allow missing catalog type definition for catalog dictionary by @EliotJones in UglyToad/PdfPig#1113 * Performance improvements and .Net 9 support by @chuckbeasley in UglyToad/PdfPig#1116 * Update run_integration_tests.yml by @BobLd in UglyToad/PdfPig#1117 * Add global.json in tools by @BobLd in UglyToad/PdfPig#1118 * Update run_integration_tests.yml by @BobLd in UglyToad/PdfPig#1119 * Update run_integration_tests.yml by @BobLd in UglyToad/PdfPig#1120 * Update run_common_crawl_tests.yml by @BobLd in UglyToad/PdfPig#1121 * Update nightly_release.yml by @BobLd in UglyToad/PdfPig#1123 * Increase FlateFilter multiplier when preventing malicious OOM and fix #1125 by @BobLd in UglyToad/PdfPig#1126 * Update build_and_test_macos.yml by @BobLd in UglyToad/PdfPig#1129 * Update build_and_test_macos.yml by @BobLd in UglyToad/PdfPig#1130 * Prevent StackOverflow in ParseTrailer and fix #1122 by @BobLd in UglyToad/PdfPig#1127 * Lower max search depth in preventing StackOverflow in ParseTrailer by @BobLd in UglyToad/PdfPig#1131 * add container node support for BookmarksProvider.cs by @migeyusu in UglyToad/PdfPig#1133 * move file parsing to single-pass static methods by @EliotJones in UglyToad/PdfPig#1102 * Add early version of IOSSystemFontLister by @BobLd in UglyToad/PdfPig#1143 * File buffering read stream investigation by @EliotJones in UglyToad/PdfPig#1140 * Draft release on master build by @EliotJones in UglyToad/PdfPig#1145 * First create the StreamInputBytes in PdfDocument.Open() to check the stream CanRead and CanSeek by @BobLd in UglyToad/PdfPig#1147 * Fix font matrix issues by @BobLd in UglyToad/PdfPig#1150 * Properly fix #1148 by always parsing optional tables in TrueTypeFontParser and remove Type 0 font hack by @BobLd in UglyToad/PdfPig#1151 * copy other parser behavior by treating end of stream as valid end inline image by @EliotJones in UglyToad/PdfPig#1152 * add test jobs for common crawl 0000 to 0007 by @EliotJones in UglyToad/PdfPig#1153 * handle case where xobjects use same key as fonts by @EliotJones in UglyToad/PdfPig#1154 * read last line of ignore file by @EliotJones in UglyToad/PdfPig#1155 * Use correct font matrix when transforming the width in Type 0 font and fix #1156 by @BobLd in UglyToad/PdfPig#1157 * Add initial support to process CFF fonts contained inside a TrueType font by @BobLd in UglyToad/PdfPig#1159 * Handle non seekable stream by copying it into a memory stream and fix #1146 by @BobLd in UglyToad/PdfPig#1158 * handle case where offsets are out of range by @EliotJones in UglyToad/PdfPig#1160 * Use record struct in FileHeaderOffset by @BobLd in UglyToad/PdfPig#1161 * Expose letter's font via GetFont(), make Font property as obsolete and use FontDetails instead by @BobLd in UglyToad/PdfPig#1166 * Add GetDescent() and GetAscent() to IFont and loose bounding box to letter by @BobLd in UglyToad/PdfPig#1167 * Use pageFactoryCache.Clear() in Pages dispose and fix #1170 by @BobLd in UglyToad/PdfPig#1174 * Bugfix: xref-streams were not added by @ricflams in UglyToad/PdfPig#1173 * Guard against circular references in XRef tables/streams by @ricflams in UglyToad/PdfPig#1175 * Add more tests to NearestNeighbourWordExtractorTests by @BobLd in UglyToad/PdfPig#1180 * Feature/improve group indexes by @BobLd in UglyToad/PdfPig#1181 * Trim excess in long lived font collections by @BobLd in UglyToad/PdfPig#1184 * Set Type 3 font ascent to Top instead of Height, see #1164 by @BobLd in UglyToad/PdfPig#1185 * Only apply RemoveStridePadding() when bytes per pixel is one and fix #1183 by @BobLd in UglyToad/PdfPig#1187 * Use zlib decode to properly use window size and checksum in flate filter by @rhuijben in UglyToad/PdfPig#1186 * Avoid doing a true file seek for simple peeking in the token parser by @rhuijben in UglyToad/PdfPig#1188 * Fix regression introduced in 3592fc8 where slicing the stream to the length breaks decoding by @BobLd in UglyToad/PdfPig#1192 * Update NameToUnicodeConvertAglSpecification to test what was intended by @rhuijben in UglyToad/PdfPig#1191 * Add CMap caching at document level and add MurmurHash3 hashing function by @BobLd in UglyToad/PdfPig#1193 * Avoid reading ahead and then seeking back by @rhuijben in UglyToad/PdfPig#1189 * Do not slice the stream to the length breaks decoding in FlateDecode by @BobLd in UglyToad/PdfPig#1194 ... (truncated) ## 0.1.11 Welcome to version 0.1.11. The changes in this version have mainly focused on stability. There is a breaking API change. We have also started to run tests against a larger corpus of documents from Common Crawl allowing us to find bugs and malformed files proactively. This release is screened against 6000 additional files. - Improvements to content and font parsing detected by fuzzing inputs. - Improvements and resiliency for finding the `startxref` location when parsing a file.. - Adds build and tests for Mac OS as well as retrieving system fonts on iPad (Mac Catalyst). - Support clipping when rendering XObjects. - Prevent malformed files leading to an out-of-memory when decompressing streams. - Make `IGraphicsStateOperationFactory` and `ReflectionGraphicsStateOperationFactory` public. - Softmask support for images. - Performance improvements using `Span` and `ReadOnlyMemory` where available. - Handle corrupt files where the stream contains comment tokens. - Improvements to copying from existing files when using `PdfDocumentBuilder`, fixes some bugs with copying fonts and dictionary tokens referenced indirectly. - Handle corrupt files with double `endstream` definitions. - More tolerant parsing for a number of invalid PDFs, including invalid USC2 input, CMAP formats, CFF fonts, missing font subtypes, invalid `xref` table positions, missing `/FirstChar` entry for font dictionaries and corrupt ASCII 85 encoded data. - Fix an issue where adding content to an existing PDF using `PdfDocumentBuilder` could result in upside-down or wrongly positioned text due to global transforms in the source PDF. - New option to completely skip annotations when building a document. - Prevent infinite loops in certain documents #1096. - Improved performance when tokenizing numbers, this should provide a minor speed improvement. - When adding a page from an existing PDF to a `PdfDocumentBuilder` any external link annotations should be preserved. ### Breaking changes The method on `PdfDocumentBuilder`: ``` public PdfPageBuilder AddPage(PdfDocument document, int pageNumber, Func<PdfAction, PdfAction?>? copyLink) ``` Has been changed to wrap the `copyLink` parameter in an options object to support the `KeepAnnotations` option: ``` public PdfPageBuilder AddPage(PdfDocument document, int pageNumber, AddPageOptions options) ``` You can just set the `CopyLinkFunc` property in the options object if you need to access this functionality. ## Auto generated change log * Bump version to 0.1.11-alpha001 by @BobLd in UglyToad/PdfPig#1009 * Improve Jpeg2000Helper to support J2K codec and add test by @BobLd in UglyToad/PdfPig#1010 * Add SetStrokeDetails() and SetFillDetails() to PdfPath and tidy up ContentStreamProcessor by @BobLd in UglyToad/PdfPig#1014 * Implement clipping in ProcessFormXObject() by @BobLd in UglyToad/PdfPig#1015 * Fix #1017 by @lofcz in UglyToad/PdfPig#1018 * Fix PatternColor Equals() method and fix #1016 by @BobLd in UglyToad/PdfPig#1019 * Feature/image mask by @BobLd in UglyToad/PdfPig#1012 * Update README.md by @BobLd in UglyToad/PdfPig#1020 * Fix bug where FormXObject bbox needs to be normalised by @BobLd in UglyToad/PdfPig#1021 * Add MacOS test pipeline and fix failing tests by @BobLd in UglyToad/PdfPig#1025 ... (truncated) ## 0.1.10 ## What's Changed * Fix GetTextOrientation by cleanly checking if rotation is divisible by 90 and fix #913 by @BobLd in UglyToad/PdfPig#914 * Add early version of BrowserSystemFontLister by @BobLd in UglyToad/PdfPig#920 * Remove list from FileTrailerParser.GetStartXrefPosition() by @BobLd in UglyToad/PdfPig#922 * Reorganise Filters and make them public by @BobLd in UglyToad/PdfPig#925 * Support decrypting V4/R4 files with AESV2 and no Length property by @Greybird in UglyToad/PdfPig#924 * Use pdfScanner in ReadVerticalDisplacements and fix #693 and return 0… by @BobLd in UglyToad/PdfPig#928 * Default page number to 0 in ExplicitDestination when the Dest has no page number and fix #736 by @BobLd in UglyToad/PdfPig#930 * Move Paths, GetAnnotations() and GetOptionalContents() outside of ExperimentalAccess and mark Experimental class and reference as obsolete by @BobLd in UglyToad/PdfPig#931 * Upgrade tests project NuGet packages by @BobLd in UglyToad/PdfPig#932 * Optimize cross reference object offset validation by avoiding nested loop by @madelson in UglyToad/PdfPig#935 * Revive trimming/AOT analysis by @madelson in UglyToad/PdfPig#939 * Stop treating Warnings as Errors by @BobLd in UglyToad/PdfPig#941 * Handle alternate Unicode name representation cXXX and fix #943 by @BobLd in UglyToad/PdfPig#944 * Handle odd ligatures names and fix #945 by @BobLd in UglyToad/PdfPig#946 * Update additional glyph list to latest from PDFBox by @BobLd in UglyToad/PdfPig#948 * New GetText() option: NegativeGapAsWhitespace by @Kizaemon in UglyToad/PdfPig#952 * Fix for IndexOutOfRangeException exception by @GrabzIt in UglyToad/PdfPig#955 * Fix "Nightly Release" pipeline following csproj changes by @BobLd in UglyToad/PdfPig#957 * Do not throw exception when lenient parsing in ON in CrossReferenceParser and fix #959 by @BobLd in UglyToad/PdfPig#961 * Improve UnwrapIndexedColorSpaceBytes by @BobLd in UglyToad/PdfPig#962 * Fix out of range exception in AnnotationProvider by @BobLd in UglyToad/PdfPig#963 * Return a copy of the ArrayPoolBufferWriter buffer in Ascii85, AsciiHex and RunLength filters and fix #964 by @BobLd in UglyToad/PdfPig#965 * Make ColorSpaceDetails.BaseNumberOfColorComponents public to allow for external image factories by @BobLd in UglyToad/PdfPig#966 * Improve GlyphList by @BobLd in UglyToad/PdfPig#967 * Properly handle ZapfDingbats font for TrueTypeSimpleFont and add tests by @BobLd in UglyToad/PdfPig#969 * Execute RemoveStridePadding in place when possible by @BobLd in UglyToad/PdfPig#968 * Add HexToken case in OptionalContent parsing by @simonedd in UglyToad/PdfPig#971 * Update UglyToad.PdfPig.ConsoleRunner target framework to net8 by @BobLd in UglyToad/PdfPig#972 * Do not throw error on Pop when stack size is 1 in lenient mode and fix #973 by @BobLd in UglyToad/PdfPig#974 * Fix warnings about "type 'K' cannot be used as type parameter 'TKey' in the generic type or method 'Dictionary<TKey, TValue>'" by @BobLd in UglyToad/PdfPig#976 * Refactor XObjectFactory by @BobLd in UglyToad/PdfPig#977 * Update UnpackComponents() to account for 1bpc + DeviceGray (hack for Jbig2) by @BobLd in UglyToad/PdfPig#978 * CcittFaxDecodeFilter: do not check for input length, invert bitmap with ref byte and fix #982 by @BobLd in UglyToad/PdfPig#983 * Add JPX bits per component decoding by @BobLd in UglyToad/PdfPig#986 * Issues/987 by @BobLd in UglyToad/PdfPig#990 * Make DecodeParameterResolver class public by @BobLd in UglyToad/PdfPig#993 * Update Microsoft and SkiaSharp NuGet packages by @BobLd in UglyToad/PdfPig#994 * Update Microsoft NuGet packages for UglyToad.PdfPig.Package by @BobLd in UglyToad/PdfPig#996 * Resolve image data (implementation from @kasperdaff) by @BobLd in UglyToad/PdfPig#998 * Pass IFilterProvider to IFilter.Decode() and handle null in PdfExtensions.Resolve() by @BobLd in UglyToad/PdfPig#999 * Improve GetExtendedGraphicsStateDictionary() and StackDictionary.TryGetValue() by @BobLd in UglyToad/PdfPig#1004 * Better handle integer overflow in DocstrumBoundingBoxes by @BobLd in UglyToad/PdfPig#1005 * version 0.1.10 by @BobLd in UglyToad/PdfPig#1006 * Update run_integration_tests.yml by @BobLd in UglyToad/PdfPig#1007 ## New Contributors * @madelson made their first contribution in UglyToad/PdfPig#935 * @Kizaemon made their first contribution in UglyToad/PdfPig#952 * @GrabzIt made their first contribution in UglyToad/PdfPig#955 ... (truncated) Commits viewable in [compare view](UglyToad/PdfPig@v0.1.9...0.1.13). </details> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Laurent Ellerbach <laurelle@microsoft.com>


This builds upon PR #1188 (=first commit of this PR).
Cleans up all builtin tokenizers to no longer read one byte too much but instead use the Peek() on the input bytes.
This allows simplifying logic in quite a few places where other parsers had to compensate for reading a byte early on. Or where parsers/tokenizers had to seek back to make things work.
Also make ReadsNextByte a constant virtual method instead of a hidden readonly variable in each parser.