Skip to content
Discussion options

You must be logged in to vote

The problem is, that nothing in the PDF identifies text in categories as HTML knows them. Your HTML extraction simply contains everything on that page as text (certainly with various different properties).

<div id="page0" style="width:612.0pt;height:792.0pt">
<p style="top:74.1pt;left:72.1pt;line-height:14.4pt"><b><span style="font-family:Calibri,sans-serif;font-size:14.4pt">At Glance </span></b></p>
<p style="top:99.9pt;left:72.1pt;line-height:11.2pt"><span style="font-family:Calibri,sans-serif;font-size:11.2pt">Football, also called association football or soccer, is a game involving two teams of 11 players who try </span></p>
<p style="top:114.4pt;left:72.1pt;line-height:11.2pt"><span …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants