diff --git a/doc/pdf.md b/doc/pdf.md new file mode 100644 index 0000000000..cc53fcdf0b --- /dev/null +++ b/doc/pdf.md @@ -0,0 +1,127 @@ +Design notes from Ken Sharp, with light editing. + +We think one solution is a font with a single glyph (.notdef) and a +CIDToGIDMap which maps all the CIDs to 0. That map would then be +stored as a stream in the PDF file, and when flat compressed should +be pretty small. The font, of course, will be approximately the same +size as the one you currently use. + +I'm working on such a font now, the CIDToGIDMap is trivial, you just +create a stream object which contains 128k bytes (2 bytes per possible +CID and your CIDs range from 0 to 65535) and where you currently have +`"/CIDToGIDMap /Identity"` you would have `"/CIDToGIDMap 0 R"`. + +Note that if, in future, you were to use a different (ie not 2 byte) +CMap for character codes you could trivially extend the CIDToGIDMap. + +The following is an explanation of how some of the font stuff works, +this may be too simple for you in which case please accept my +apologies, its hard to know how much knowledge someone has. You can +skip all this anyway, its just for information. + +The font embedded in a PDF file is usually intended just to be +rendered, but extensions allow for at least some ability to locate (or +copy) text from a document. This isn't something which was an original +goal of the PDF format, but its been retro-fitted, presumably due to +popular demand. + +To do this reliably the PDF file must contain a ToUnicode CMap, a +device for mapping character codes to Unicode code points. If one of +these is present, then this will be used to convert the character +codes into Unicode values. If its not present then the reader will +fall back through a series of heuristics to try and guess the +result. This is, as you would expect, prone to failure. + +This doesn't concern you of course, since you always write a ToUnicode +CMap, so because you are writing the text in text rendering mode 3 it +would seem that you don't really need to worry about this, but in the +PDF spec you cannot have an isolated ToUnicode CMap, it has to be +attached to a font, so in order to get even copy/paste to work you +need to define a font. + +This is what leads to problems, tools like pdfwrite assume that they +are going to be able to (or even have to) modify the font entries, so +they require that the font being embedded be valid, and to be honest +the font Tesseract embeds isn't valid (for this purpose). + +To see why lets look at how text is specified in a PDF file: + +`(Test) Tj` + +Now that looks like text but actually it isn't. Each of those bytes is +a 'character code'. When it comes to rendering the text a complex +sequence of events takes place, which converts the character code into +'something' which the font understands. Its entirely possible via +character mappings to have that text render as 'Sftu' + +For simple fonts (PostScript type 1), we use the character code as the +index into an Encoding array (256 elements), each element of which is +a glyph name, so this gives us a glyph name. We then consult the +CharStrings dictionary in the font, that's a complex object which +contains pairs of keys and values, you can use the key to retrieve a +given value. So we have a glyph name, we then use that as the key to +the dictionary and retrieve the associated value. For a type 1 font, +the value is a glyph program that describes how to draw the glyph. + +For CIDFonts, its a little more complicated. Because CIDFonts can be +large, using a glyph name as the key is unreasonable (it would also +lead to unfeasibly large Encoding arrays), so instead we use a 'CID' +as the key. CIDs are just numbers. + +But.... We don't use the character code as the CID. What we do is use +a CMap to convert the character code into a CID. We then use the CID +to key the CharStrings dictionary and proceed as before. So the 'CMap' +is the equivalent of the Encoding array, but its a more compact and +flexible representation. + +Note that you have to use the CMap just to find out how many bytes +constitute a character code, and it can be variable. For example you +can say if the first byte is 0x00->0x7f then its just one byte, if its +0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I +have seen CMaps defining character codes up to 5 bytes wide. + +Now that's fine for 'PostScript' CIDFonts, but its not sufficient for +TrueType CIDFonts. The thing is that TrueType fonts are accessed using +a Glyph ID (GID) (and the LOCA table) which may well not be anything +like the CID. So for this case PDF includes a CIDToGIDMap. That maps +the CIDs to GIDs, and we can then use the GID to get the glyph +description from the GLYF table of the font. + +So for a TrueType CIDFont, character-code->CID->GID->glyf-program. + +Looking at the PDF file I was supplied with we see that it contains +text like : + +`<0x0075> Tj` + +So we start by taking the character code (117) and look it up in the +CMap. Well you don't supply a CMap, you just use the Identity-H one +which is predefined. So character code 117 maps to CID 117. Then we +use the CIDToGIDMap, again you don't supply one, you just use the +predefined 'Identity' map. So CID 117 maps to GID 117. But the font we +were supplied with only contains 116 glyphs. + +Now for Latin that's not a huge problem, you can just supply a bigger +font. But for more complex languages that *is* going to be more of a +problem. Either you need to supply a font which contains glyphs for +all the possible CID->GID mappings, or we need to think laterally. + +Our solution using a TrueType CIDFont is to intervene at the +CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a +font with just one glyph, the .notdef glyph at GID 0. This is what I'm +looking into now. + +It would also be possible to have a 'PostScript' (ie type 1 outlines) +CIDFont which contained 1 glyph, and a CMap which mapped all character +codes to CID 0. The effect would be the same. + +Its possible (I haven't checked) that the PostScript CIDFont and +associated CMap would be smaller than the TrueType font and associated +CIDToGIDMap. + +--- in a followup --- + +OK there is a small problem there, if I use GID 0 then Acrobat gets +upset about it and complains it cannot extract the font. If I set the +CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally +mad...... diff --git a/src/api/pdfrenderer.cpp b/src/api/pdfrenderer.cpp index 88383bb7bf..d537ec6e80 100644 --- a/src/api/pdfrenderer.cpp +++ b/src/api/pdfrenderer.cpp @@ -45,139 +45,6 @@ using namespace std::literals; #define NO_PDF_COMPRESSION #endif -/* - -Design notes from Ken Sharp, with light editing. - -We think one solution is a font with a single glyph (.notdef) and a -CIDToGIDMap which maps all the CIDs to 0. That map would then be -stored as a stream in the PDF file, and when flat compressed should -be pretty small. The font, of course, will be approximately the same -size as the one you currently use. - -I'm working on such a font now, the CIDToGIDMap is trivial, you just -create a stream object which contains 128k bytes (2 bytes per possible -CID and your CIDs range from 0 to 65535) and where you currently have -"/CIDToGIDMap /Identity" you would have "/CIDToGIDMap 0 R". - -Note that if, in future, you were to use a different (ie not 2 byte) -CMap for character codes you could trivially extend the CIDToGIDMap. - -The following is an explanation of how some of the font stuff works, -this may be too simple for you in which case please accept my -apologies, its hard to know how much knowledge someone has. You can -skip all this anyway, its just for information. - -The font embedded in a PDF file is usually intended just to be -rendered, but extensions allow for at least some ability to locate (or -copy) text from a document. This isn't something which was an original -goal of the PDF format, but its been retro-fitted, presumably due to -popular demand. - -To do this reliably the PDF file must contain a ToUnicode CMap, a -device for mapping character codes to Unicode code points. If one of -these is present, then this will be used to convert the character -codes into Unicode values. If its not present then the reader will -fall back through a series of heuristics to try and guess the -result. This is, as you would expect, prone to failure. - -This doesn't concern you of course, since you always write a ToUnicode -CMap, so because you are writing the text in text rendering mode 3 it -would seem that you don't really need to worry about this, but in the -PDF spec you cannot have an isolated ToUnicode CMap, it has to be -attached to a font, so in order to get even copy/paste to work you -need to define a font. - -This is what leads to problems, tools like pdfwrite assume that they -are going to be able to (or even have to) modify the font entries, so -they require that the font being embedded be valid, and to be honest -the font Tesseract embeds isn't valid (for this purpose). - - -To see why lets look at how text is specified in a PDF file: - -(Test) Tj - -Now that looks like text but actually it isn't. Each of those bytes is -a 'character code'. When it comes to rendering the text a complex -sequence of events takes place, which converts the character code into -'something' which the font understands. Its entirely possible via -character mappings to have that text render as 'Sftu' - -For simple fonts (PostScript type 1), we use the character code as the -index into an Encoding array (256 elements), each element of which is -a glyph name, so this gives us a glyph name. We then consult the -CharStrings dictionary in the font, that's a complex object which -contains pairs of keys and values, you can use the key to retrieve a -given value. So we have a glyph name, we then use that as the key to -the dictionary and retrieve the associated value. For a type 1 font, -the value is a glyph program that describes how to draw the glyph. - -For CIDFonts, its a little more complicated. Because CIDFonts can be -large, using a glyph name as the key is unreasonable (it would also -lead to unfeasibly large Encoding arrays), so instead we use a 'CID' -as the key. CIDs are just numbers. - -But.... We don't use the character code as the CID. What we do is use -a CMap to convert the character code into a CID. We then use the CID -to key the CharStrings dictionary and proceed as before. So the 'CMap' -is the equivalent of the Encoding array, but its a more compact and -flexible representation. - -Note that you have to use the CMap just to find out how many bytes -constitute a character code, and it can be variable. For example you -can say if the first byte is 0x00->0x7f then its just one byte, if its -0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I -have seen CMaps defining character codes up to 5 bytes wide. - -Now that's fine for 'PostScript' CIDFonts, but its not sufficient for -TrueType CIDFonts. The thing is that TrueType fonts are accessed using -a Glyph ID (GID) (and the LOCA table) which may well not be anything -like the CID. So for this case PDF includes a CIDToGIDMap. That maps -the CIDs to GIDs, and we can then use the GID to get the glyph -description from the GLYF table of the font. - -So for a TrueType CIDFont, character-code->CID->GID->glyf-program. - -Looking at the PDF file I was supplied with we see that it contains -text like : - -<0x0075> Tj - -So we start by taking the character code (117) and look it up in the -CMap. Well you don't supply a CMap, you just use the Identity-H one -which is predefined. So character code 117 maps to CID 117. Then we -use the CIDToGIDMap, again you don't supply one, you just use the -predefined 'Identity' map. So CID 117 maps to GID 117. But the font we -were supplied with only contains 116 glyphs. - -Now for Latin that's not a huge problem, you can just supply a bigger -font. But for more complex languages that *is* going to be more of a -problem. Either you need to supply a font which contains glyphs for -all the possible CID->GID mappings, or we need to think laterally. - -Our solution using a TrueType CIDFont is to intervene at the -CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a -font with just one glyph, the .notdef glyph at GID 0. This is what I'm -looking into now. - -It would also be possible to have a 'PostScript' (ie type 1 outlines) -CIDFont which contained 1 glyph, and a CMap which mapped all character -codes to CID 0. The effect would be the same. - -Its possible (I haven't checked) that the PostScript CIDFont and -associated CMap would be smaller than the TrueType font and associated -CIDToGIDMap. - ---- in a followup --- - -OK there is a small problem there, if I use GID 0 then Acrobat gets -upset about it and complains it cannot extract the font. If I set the -CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally -mad...... - -*/ - namespace tesseract { // If the font is 10 pts, nominal character width is 5 pts