Docx -> html conversion doesn't convert all Captions to Figures. #10966
Replies: 3 comments 6 replies
-
Probably there's some difference in how the images are represented in the docx. If you create a minimal example, with one that works as a figure and one that doesn't, I could have a look. |
Beta Was this translation helpful? Give feedback.
-
Here's a diff of the relevant bits of the XML: --- /Users/jgm/Downloads/working.xml 2025-07-15 09:16:32
+++ /Users/jgm/Downloads/broken.xml 2025-07-15 09:16:47
@@ -1,18 +1,16 @@
-<w:p w14:paraId="55020D4E" w14:textId="4A02BABC" w:rsidR="00556991" w:rsidRDefault="003C066E" w:rsidP="00D9237F">
+<w:p w14:paraId="22CF94A0" w14:textId="6C7EECA1" w:rsidR="00163404" w:rsidRPr="00446636" w:rsidRDefault="00163404" w:rsidP="003E0058">
<w:pPr>
<w:pStyle w:val="Caption"/>
-<w:keepNext/>
</w:pPr>
<w:r>
<w:rPr>
-<w:rFonts w:cs="Arial"/>
<w:noProof/>
</w:rPr>
<w:drawing>
-<wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="5CBF5749" wp14:editId="23DFC6BB">
-<wp:extent cx="3852850" cy="2567940"/>
-<wp:effectExtent l="0" t="0" r="0" b="3810"/>
-<wp:docPr id="26" name="Picture 26" descr="Notebook"/>
+<wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="5C48C191" wp14:editId="150CB746">
+<wp:extent cx="4563857" cy="3041829"/>
+<wp:effectExtent l="0" t="0" r="8255" b="6350"/>
+<wp:docPr id="2" name="Picture 2" descr="Notebook"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
@@ -20,13 +18,11 @@
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
-<pic:cNvPr id="26" name="Picture 26" descr="Notebook"/>
-<pic:cNvPicPr>
-<a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
-</pic:cNvPicPr>
+<pic:cNvPr id="2" name="Picture 2" descr="Notebook"/>
+<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
-<a:blip r:embed="rId8" cstate="print">
+<a:blip r:embed="rId9" cstate="print">
<a:extLst>
<a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
<a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
@@ -37,18 +33,14 @@
<a:fillRect/>
</a:stretch>
</pic:blipFill>
-<pic:spPr bwMode="auto">
+<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
-<a:ext cx="3852850" cy="2567940"/>
+<a:ext cx="4563857" cy="3041829"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
-<a:noFill/>
-<a:ln>
-<a:noFill/>
-</a:ln>
</pic:spPr>
</pic:pic>
</a:graphicData>
@@ -57,11 +49,12 @@
</w:drawing>
</w:r>
</w:p>
-<w:p w14:paraId="4AC54DE4" w14:textId="3D68959F" w:rsidR="00556991" w:rsidRDefault="00556991" w:rsidP="00556991">
+<w:p w14:paraId="021D33AD" w14:textId="6872D59B" w:rsidR="00D640F1" w:rsidRDefault="00D640F1" w:rsidP="000270EC">
<w:pPr>
<w:pStyle w:val="Caption"/>
+<w:keepNext/>
</w:pPr>
-<w:bookmarkStart w:id="0" w:name="_Toc170477850"/>
+<w:bookmarkStart w:id="1" w:name="_Toc170477851"/>
<w:r>
<w:t xml:space="preserve">Figure
</w:t>
@@ -124,7 +117,7 @@
<w:rPr>
<w:noProof/>
</w:rPr>
-<w:t>1
+<w:t>2
</w:t>
</w:r>
<w:r w:rsidR="00F60AE2">
@@ -137,7 +130,7 @@
<w:t xml:space="preserve">
</w:t>
</w:r>
-<w:r w:rsidR="00581BA1">
+<w:r w:rsidRPr="00EB5FDA">
<w:t xml:space="preserve">Example
</w:t>
</w:r>
@@ -145,13 +138,13 @@
<w:t>o
</w:t>
</w:r>
-<w:r w:rsidR="00BA6CE3">
+<w:r w:rsidR="00466EFC" w:rsidRPr="00EB5FDA">
<w:t xml:space="preserve">f
</w:t>
</w:r>
-<w:bookmarkEnd w:id="0"/>
+<w:bookmarkEnd w:id="1"/>
<w:r w:rsidR="00A70A33">
-<w:t>working conversion
+<w:t>broken conversion
</w:t>
</w:r>
</w:p> |
Beta Was this translation helpful? Give feedback.
-
OK, I see the issue, I think. You have "keep next" set for the paragraph containing the caption, which follows the image. Pandoc takes this to indicate that the paragraph containing the caption goes with what follows it, rather than with the image, and so it doesn't produce a figure in this case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've hit an odd issue, which I can't tell if its a bug in the Docx reader, or something in the source document.
All the images in the document are defined as the caption style with an associated text caption placed below the image. When I try converting the document to html, 3 of the 252 images get converted to para { image }, para { caption text } rather than figure { caption, image }.
The working 249 figures are displayed correctly based on my CSS styling, however the three mis converted images don't pick up the styling. I've written a blocks level lua filter to find the offending paragraphs but I can't find a good example of manually creating a figure element using data already in the AST.
The above
new_figure = pandoc.Figure()
line doesn't work and throws lua errors at runtime.Beta Was this translation helpful? Give feedback.
All reactions