Skip to content

Commit a5b5aaf

Browse files
authored
Partitioning overview: handwriting and multilanguage character treatments in PDF files (#659)
1 parent 184dbec commit a5b5aaf

File tree

6 files changed

+27
-0
lines changed

6 files changed

+27
-0
lines changed
935 KB
Loading
919 KB
Loading

img/partitioning/Hiragana-Fast.png

206 KB
Loading
154 KB
Loading

img/partitioning/Hiragana-VLM.png

242 KB
Loading

ui/partitioning.mdx

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,33 @@ The following example shows GPT-4o by OpenAI being used. If the **Auto** strateg
6161

6262
![The VLM strategy processes tables in PDF files with table summaries and text as HTML](/img/partitioning/VLM-Auto-Table-GPT-4o-Example.png)
6363

64+
## Handwriting and multilanguage characters in PDF files
65+
66+
The differences between the various partitioning strategies can be more clearly demonstrated by the ways each of these strategies handle handwriting and multilanguage characters within PDF files.
67+
68+
For example, the **Fast** partitioning strategy skips processing handwriting altogether in PDF files.
69+
70+
The **Fast** strategy processes multilanguage characters in PDF files with limited output, depending on the language. In the following
71+
example, Japanese hiragana characters are processed as text, but the output can be very difficult to work with:
72+
73+
![The Fast strategy produces cryptic CID codes for hiragana characters](/img/partitioning/Hiragana-Fast.png)
74+
75+
For handwriting, the **High Res** strategy typically produces unusable results, for example:
76+
77+
![The High Res strategy typically produces unusable results for handwriting](/img/partitioning/Handwriting-Hi-Res.png)
78+
79+
For multilanguage characters, the **High Res** strategy also typically produces unusable results, for example failing to recognize Japanese hiragana characters:
80+
81+
![The High Res strategy typically produces unusable results for multilanguage characters](/img/partitioning/Hiragana-Hi-Res.png)
82+
83+
The **VLM** strategy can produce great results for handwriting, such as this example that uses GPT-4o by OpenAI:
84+
85+
![The VLM strategy can process handwriting well](/img/partitioning/Handwriting-VLM-GPT-4o.png)
86+
87+
The **VLM** strategy also has great support for recognizing multilanguage characters, such as this example that uses GPT-4o by OpenAI to recognize Japanese hiragana characters:
88+
89+
![The VLM strategy can process Japanese hiragana well](/img/partitioning/Hiragana-VLM.png)
90+
6491
## Supported languages
6592

6693
**Fast** partitioning accepts any text inputs, though automatic language detection of those inputs is restricted to [langdetect](https://pypi.org/project/langdetect/).

0 commit comments

Comments
 (0)