[Question] Differences between PyMuPDF blocks extraction in version 1.17 and 1.18 #1362
-
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 4 replies
-
I am sure there is. I am going to convert this to a discussion ... |
Beta Was this translation helpful? Give feedback.
-
Looks fine - everything like you said you expect: >>> import fitz
>>> fitz.version
('1.18.14', '1.18.0', '20210601081138')
>>> doc=fitz.open("DODF.003.21-01-2020.EDICAO.EXTRA.pdf")
>>> page=doc[0]
>>> blocks=page.get_text("blocks")
>>> from pprint import pprint
>>> pprint(blocks[:3])
[(56.691890716552734,
55.67547607421875,
765.3434448242188,
214.10137939453125,
'<image: DeviceGray, width: 1492, height: 330, bpc: 8>',
0,
1),
(56.558387756347656,
881.2556762695312,
712.5414428710938,
898.0530395507812,
'Este documento pode ser verificado\n'
'no endereço eletrônico http://www.in.g o v. b r / a u t e n t i c i d a d e '
'. h t m l ,\n'
'pelo código 50102020012100001\n'
'Documento assinado digitalmente conforme MP nº 2.200-2 de 24/08/2001, que '
'institui a\n'
'Infraestrutura de Chaves Públicas Brasileira - ICP-Brasil.\n',
1,
0),
(119.05406188964844,
189.93833923339844,
726.0125732421875,
207.86404418945312,
'ANO XLVIX EDIÇÃO EXTRA No- 3\n'
'BRASÍLIA - DF, TERÇA-FEIRA, 21 DE JANEIRO DE 2020\n',
2,
0)]
>>> |
Beta Was this translation helpful? Give feedback.
-
Well, that is strange. >>> fitz.version
('1.18.0', '1.18.0', '20201006071559')
>>> from pprint import pprint
>>> pprint(file[0].getText("blocks")[:3])
[(56.558387756347656,
881.2556762695312,
712.5414428710938,
898.0530395507812,
'Este documento pode ser verificado no endereço eletrônico http://www.in.g o '
'v. b r / a u t e n t i c i d a d e . h t m l , pelo código '
'50102020012100001 Documento assinado digitalmente conforme MP nº 2.200-2 de '
'24/08/2001, que institui a Infraestrutura de Chaves Públicas Brasileira - '
'ICP-Brasil.',
0,
0),
(119.05406188964844,
189.93833923339844,
726.0125732421875,
207.86404418945312,
'ANO XLVIX EDIÇÃO EXTRA No- 3 BRASÍLIA - DF, TERÇA-FEIRA, 21 DE JANEIRO DE '
'2020',
1,
0),
(201.19203186035156,
295.4358215332031,
259.4459228515625,
309.9049987792969,
'SEÇÃO II',
2,
0)]
>>> |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Hm, then for some obscure reason, there might be used a different setting of the >>> import fitz
>>> doc=fitz.open("DODF.003.21-01-2020.EDICAO.EXTRA.pdf")
>>> page=doc[0]
>>> blocks=page.get_text("blocks", flags=0) # no images, no whitespace conversion
>>> from pprint import pprint
>>> pprint(blocks[:3])
[(56.558387756347656,
881.2556762695312,
712.5414428710938,
898.0530395507812,
'Este documento pode ser verificado\n'
'no endereço eletrônico http://www.in.g o v. b r / a u t e n t i c i d a d e '
'. h t m l ,\n'
'pelo código 50102020012100001\n'
'Documento assinado digitalmente conforme MP nº 2.200-2 de 24/08/2001, que '
'institui a\n'
'Infraestrutura de Chaves Públicas Brasileira - ICP-Brasil.\n',
0,
0),
(119.05406188964844,
189.93833923339844,
726.0125732421875,
207.86404418945312,
'ANO XLVIX EDIÇÃO EXTRA No- 3\n'
'BRASÍLIA - DF, TERÇA-FEIRA, 21 DE JANEIRO DE 2020\n',
1,
0),
(201.19203186035156,
295.4358215332031,
259.4459228515625,
309.9049987792969,
'SEÇÃO II\n',
2,
0)]
>>> |
Beta Was this translation helpful? Give feedback.
-
I got the same same result even when setting the flags=0. But I also tried to use the same version that you are using for testing, 1.18.14, and I got the blocks with linebreaks: >>> fitz.version
('1.18.14', '1.18.0', '20210601081138')
>>> from pprint import pprint
>>> pprint(file[0].getText('blocks')[:3])
[(56.691890716552734,
55.67547607421875,
765.3434448242188,
214.10137939453125,
'<image: DeviceGray, width: 1492, height: 330, bpc: 8>',
0,
1),
(56.558387756347656,
881.2556762695312,
712.5414428710938,
898.0530395507812,
'Este documento pode ser verificado\n'
'no endereço eletrônico http://www.in.g o v. b r / a u t e n t i c i d a d e '
'. h t m l ,\n'
'pelo código 50102020012100001\n'
'Documento assinado digitalmente conforme MP nº 2.200-2 de 24/08/2001, que '
'institui a\n'
'Infraestrutura de Chaves Públicas Brasileira - ICP-Brasil.\n',
1,
0),
(119.05406188964844,
189.93833923339844,
726.0125732421875,
207.86404418945312,
'ANO XLVIX EDIÇÃO EXTRA No- 3\n'
'BRASÍLIA - DF, TERÇA-FEIRA, 21 DE JANEIRO DE 2020\n',
2,
0)] After this, I tried the 1.18.1 and it was also with linebreaks. So it seems to be a change isolated to the 1.17.7~1.18.0 versions. |
Beta Was this translation helpful? Give feedback.
I got the same same result even when setting the flags=0. But I also tried to use the same version that you are using for testing, 1.18.14, and I got the blocks with linebreaks: