Docx -> html conversion doesn't convert all Captions to Figures. #10966

andrew-danieli-fj · 2025-07-14T15:53:08Z

andrew-danieli-fj
Jul 14, 2025

I've hit an odd issue, which I can't tell if its a bug in the Docx reader, or something in the source document.

All the images in the document are defined as the caption style with an associated text caption placed below the image. When I try converting the document to html, 3 of the 252 images get converted to para { image }, para { caption text } rather than figure { caption, image }.

The working 249 figures are displayed correctly based on my CSS styling, however the three mis converted images don't pick up the styling. I've written a blocks level lua filter to find the offending paragraphs but I can't find a good example of manually creating a figure element using data already in the AST.

local new_figure = pandoc.Figure(pandoc.Plain{blocks[i].content, pandoc.Plain{blocks[i+1].content)
blocks[i] = new_figure
blocks:remove(i+1)

The above new_figure = pandoc.Figure() line doesn't work and throws lua errors at runtime.

jgm · 2025-07-14T18:23:40Z

jgm
Jul 14, 2025
Maintainer

Probably there's some difference in how the images are represented in the docx. If you create a minimal example, with one that works as a figure and one that doesn't, I could have a look.

2 replies

andrew-danieli-fj Jul 15, 2025
Author

Hi, I've attached a minimal example.

Broken Captions.docx

I used pandoc -s --wrap=none -N --extract-media=. -f docx -t native ".\Broken Captions.docx" > brokenCaptions.native which generates the following AST.

Pandoc
  Meta { unMeta = fromList [] }
  [ Header
      1
      ( "example-captions" , [] , [] )
      [ Str "Example" , Space , Str "captions" ]
  , Header
      2
      ( "working-conversion" , [] , [] )
      [ Str "Working" , Space , Str "conversion" ]
  , Header
      4
      ( "the-captionimage-below-gets-converted-into-a-figure-block-correctly."
      , []
      , []
      )
      [ Str "The"
      , Space
      , Str "caption/image"
      , Space
      , Str "below"
      , Space
      , Str "gets"
      , Space
      , Str "converted"
      , Space
      , Str "into"
      , Space
      , Str "a"
      , Space
      , Str "Figure"
      , Space
      , Str "block"
      , Space
      , Str "correctly."
      ]
  , Figure
      ( "" , [] , [] )
      (Caption
         Nothing
         [ Para
             [ Str "Figure"
             , Space
             , Str "1\8209\&1"
             , Space
             , Str "Example"
             , Space
             , Str "of"
             , Space
             , Str "working"
             , Space
             , Str "conversion"
             ]
         ])
      [ Plain
          [ Image
              ( ""
              , []
              , [ ( "width" , "4.213527996500438in" )
                , ( "height" , "2.808333333333333in" )
                ]
              )
              [ Str "Notebook" ]
              ( "./media/image1.jpeg" , "" )
          ]
      ]
  , Header
      2
      ( "broken-conversion" , [] , [] )
      [ Str "Broken" , Space , Str "conversion" ]
  , Header
      4
      ( "the-caption-below-doesnt-get-converted-to-a-regular-figure-block."
      , []
      , []
      )
      [ Str "The"
      , Space
      , Str "caption"
      , Space
      , Str "below"
      , Space
      , Str "doesn\8217t"
      , Space
      , Str "get"
      , Space
      , Str "converted"
      , Space
      , Str "to"
      , Space
      , Str "a"
      , Space
      , Str "regular"
      , Space
      , Str "Figure"
      , Space
      , Str "block."
      ]
  , Para
      [ Image
          ( ""
          , []
          , [ ( "width" , "4.991094706911636in" )
            , ( "height" , "3.3265846456692914in" )
            ]
          )
          [ Str "Notebook" ]
          ( "./media/image2.jpeg" , "" )
      ]
  , Para
      [ Str "Figure"
      , Space
      , Str "1\8209\&2"
      , Space
      , Str "Example"
      , Space
      , Str "of"
      , Space
      , Str "broken"
      , Space
      , Str "conversion"
      ]
  ]

Ignoring the why its happening investigation, how would I replace the two incorrect Para objects with a properly structured Figure in the AST using a lua filter? I've worked out how to find the offending Paras using a Blocks filter. I just haven't worked out how to define a new Figure object based on the contents of the Paras (para 1 is the image, para 2 is the caption).

I've also noticed that Pandoc has converted the hyphen symbol in the caption text to a weird value "\8209&" which it struggles to convert back into a "-" when using the HTML writer. It's not something that I've seen in the original document I'm working on, but this could be due to cleaning up the docx to create this minimal version (I may have stripped a property or XML attribute that Pandoc relies on).

andrew-danieli-fj Jul 15, 2025
Author

Right ☺️ I've managed to create a filter that does what I want but given that my knowledge of the lua/pandoc syntax is lacking it could be improved. The code is shown below:-

---Blocks filter function to replace Para embedded images with a Figure and associated caption
---@param blocks (Blocks | Block [])
---@return Blocks
function Blocks(blocks)
   local i = 1
   while i <= #blocks do
      -- If the block is of type Para whose contents contains an Image block then we've found a match.
      if blocks[i].tag == 'Para' and blocks[i].content[1].tag == 'Image' then

         -- I'm sure these next 3 lines can be combined into one.  
         local new_figure = pandoc.Figure( pandoc.Blocks({}), pandoc.Blocks({}) ) -- create an empty figure object
         new_figure.caption.long = blocks[i+1] -- set the contents of the caption
         new_figure.content = blocks[i].content -- set the contents of the image

         blocks[i] = new_figure -- insert the figure over the original image para block
         blocks:remove(i+1) -- then remove the following para block containing the caption
      end

      i = i + 1
   end
   return blocks
end

So this provides a technical solution to the problem, but not the academic one of why pandoc see's two caption blocks in the Word document, and converts them differently. It's probably going to be a Microsoft issue 😆.

jgm · 2025-07-15T16:18:11Z

jgm
Jul 15, 2025
Maintainer

Here's a diff of the relevant bits of the XML:

--- /Users/jgm/Downloads/working.xml	2025-07-15 09:16:32
+++ /Users/jgm/Downloads/broken.xml	2025-07-15 09:16:47
@@ -1,18 +1,16 @@
-<w:p w14:paraId="55020D4E" w14:textId="4A02BABC" w:rsidR="00556991" w:rsidRDefault="003C066E" w:rsidP="00D9237F">
+<w:p w14:paraId="22CF94A0" w14:textId="6C7EECA1" w:rsidR="00163404" w:rsidRPr="00446636" w:rsidRDefault="00163404" w:rsidP="003E0058">
 <w:pPr>
 <w:pStyle w:val="Caption"/>
-<w:keepNext/>
 </w:pPr>
 <w:r>
 <w:rPr>
-<w:rFonts w:cs="Arial"/>
 <w:noProof/>
 </w:rPr>
 <w:drawing>
-<wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="5CBF5749" wp14:editId="23DFC6BB">
-<wp:extent cx="3852850" cy="2567940"/>
-<wp:effectExtent l="0" t="0" r="0" b="3810"/>
-<wp:docPr id="26" name="Picture 26" descr="Notebook"/>
+<wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="5C48C191" wp14:editId="150CB746">
+<wp:extent cx="4563857" cy="3041829"/>
+<wp:effectExtent l="0" t="0" r="8255" b="6350"/>
+<wp:docPr id="2" name="Picture 2" descr="Notebook"/>
 <wp:cNvGraphicFramePr>
 <a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
 </wp:cNvGraphicFramePr>
@@ -20,13 +18,11 @@
 <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
 <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
 <pic:nvPicPr>
-<pic:cNvPr id="26" name="Picture 26" descr="Notebook"/>
-<pic:cNvPicPr>
-<a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
-</pic:cNvPicPr>
+<pic:cNvPr id="2" name="Picture 2" descr="Notebook"/>
+<pic:cNvPicPr/>
 </pic:nvPicPr>
 <pic:blipFill>
-<a:blip r:embed="rId8" cstate="print">
+<a:blip r:embed="rId9" cstate="print">
 <a:extLst>
 <a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
 <a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
@@ -37,18 +33,14 @@
 <a:fillRect/>
 </a:stretch>
 </pic:blipFill>
-<pic:spPr bwMode="auto">
+<pic:spPr>
 <a:xfrm>
 <a:off x="0" y="0"/>
-<a:ext cx="3852850" cy="2567940"/>
+<a:ext cx="4563857" cy="3041829"/>
 </a:xfrm>
 <a:prstGeom prst="rect">
 <a:avLst/>
 </a:prstGeom>
-<a:noFill/>
-<a:ln>
-<a:noFill/>
-</a:ln>
 </pic:spPr>
 </pic:pic>
 </a:graphicData>
@@ -57,11 +49,12 @@
 </w:drawing>
 </w:r>
 </w:p>
-<w:p w14:paraId="4AC54DE4" w14:textId="3D68959F" w:rsidR="00556991" w:rsidRDefault="00556991" w:rsidP="00556991">
+<w:p w14:paraId="021D33AD" w14:textId="6872D59B" w:rsidR="00D640F1" w:rsidRDefault="00D640F1" w:rsidP="000270EC">
 <w:pPr>
 <w:pStyle w:val="Caption"/>
+<w:keepNext/>
 </w:pPr>
-<w:bookmarkStart w:id="0" w:name="_Toc170477850"/>
+<w:bookmarkStart w:id="1" w:name="_Toc170477851"/>
 <w:r>
 <w:t xml:space="preserve">Figure
 </w:t>
@@ -124,7 +117,7 @@
 <w:rPr>
 <w:noProof/>
 </w:rPr>
-<w:t>1
+<w:t>2
 </w:t>
 </w:r>
 <w:r w:rsidR="00F60AE2">
@@ -137,7 +130,7 @@
 <w:t xml:space="preserve">
 </w:t>
 </w:r>
-<w:r w:rsidR="00581BA1">
+<w:r w:rsidRPr="00EB5FDA">
 <w:t xml:space="preserve">Example
 </w:t>
 </w:r>
@@ -145,13 +138,13 @@
 <w:t>o
 </w:t>
 </w:r>
-<w:r w:rsidR="00BA6CE3">
+<w:r w:rsidR="00466EFC" w:rsidRPr="00EB5FDA">
 <w:t xml:space="preserve">f
 </w:t>
 </w:r>
-<w:bookmarkEnd w:id="0"/>
+<w:bookmarkEnd w:id="1"/>
 <w:r w:rsidR="00A70A33">
-<w:t>working conversion
+<w:t>broken conversion
 </w:t>
 </w:r>
 </w:p>

3 replies

jgm Jul 15, 2025
Maintainer

I thought adding the "keepNext" to the second one might fix the problem, but oddly that broke the first one too!

jgm Jul 15, 2025
Maintainer

Here is the intermediate structure produced by the Docx parser:

[ Heading
    1
    (ParaStyleName (CIString "heading 1"))
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Just ( ParaStyleName (CIString "heading 1") , 1 )
              , indent = Nothing
              , numInfo = Just ( "8" , "0" )
              , psParentStyle = Nothing
              , pStyleName = ParaStyleName (CIString "heading 1")
              , pStyleId = ParaStyleId "Heading1"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = True
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    "8"
    "0"
    (Just (Level "0" "decimal" "%1" (Just 1)))
    [ PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Example captions" ])
    ]
, Heading
    2
    (ParaStyleName (CIString "heading 2"))
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Just ( ParaStyleName (CIString "heading 2") , 2 )
              , indent = Nothing
              , numInfo = Just ( "8" , "1" )
              , psParentStyle = Nothing
              , pStyleName = ParaStyleName (CIString "heading 2")
              , pStyleId = ParaStyleId "Heading2"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = True
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    "8"
    "1"
    (Just (Level "1" "decimal" "%1.%2" (Just 1)))
    [ PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Working conversion" ])
    ]
, Heading
    4
    (ParaStyleName (CIString "heading 4"))
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Just ( ParaStyleName (CIString "heading 4") , 4 )
              , indent = Nothing
              , numInfo = Just ( "8" , "3" )
              , psParentStyle = Nothing
              , pStyleName = ParaStyleName (CIString "heading 4")
              , pStyleId = ParaStyleId "Heading4"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = True
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    "8"
    "3"
    (Just (Level "3" "decimal" "%1.%2.%3.%4" (Just 1)))
    [ PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun
               "The caption/image below gets converted into a Figure block correctly."
           ])
    ]
, Captioned
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Nothing
              , indent = Nothing
              , numInfo = Nothing
              , psParentStyle =
                  Just
                    ParStyle
                      { headingLev = Nothing
                      , indent = Nothing
                      , numInfo = Nothing
                      , psParentStyle = Nothing
                      , pStyleName = ParaStyleName (CIString "Normal")
                      , pStyleId = ParaStyleId "Normal"
                      }
              , pStyleName = ParaStyleName (CIString "caption")
              , pStyleId = ParaStyleId "Caption"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = False
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    [ BookMark "0" "_Toc170477850"
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Figure " ])
    , Field
        UnknownField
        [ PlainRun
            (Run
               RunStyle
                 { isBold = Nothing
                 , isBoldCTL = Nothing
                 , isItalic = Nothing
                 , isItalicCTL = Nothing
                 , isSmallCaps = Nothing
                 , isStrike = Nothing
                 , isRTL = Nothing
                 , isForceCTL = Nothing
                 , rHighlight = Nothing
                 , rVertAlign = Nothing
                 , rUnderline = Nothing
                 , rParentStyle = Nothing
                 }
               [ TextRun "1" ])
        ]
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ NoBreakHyphen ])
    , Field
        UnknownField
        [ PlainRun
            (Run
               RunStyle
                 { isBold = Nothing
                 , isBoldCTL = Nothing
                 , isItalic = Nothing
                 , isItalicCTL = Nothing
                 , isSmallCaps = Nothing
                 , isStrike = Nothing
                 , isRTL = Nothing
                 , isForceCTL = Nothing
                 , rHighlight = Nothing
                 , rVertAlign = Nothing
                 , rUnderline = Nothing
                 , rParentStyle = Nothing
                 }
               [ TextRun "1" ])
        ]
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun " " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Example " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "o" ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "f " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "working conversion" ])
    ]
    (Paragraph
       ParagraphStyle
         { pStyle =
             [ ParStyle
                 { headingLev = Nothing
                 , indent = Nothing
                 , numInfo = Nothing
                 , psParentStyle =
                     Just
                       ParStyle
                         { headingLev = Nothing
                         , indent = Nothing
                         , numInfo = Nothing
                         , psParentStyle = Nothing
                         , pStyleName = ParaStyleName (CIString "Normal")
                         , pStyleId = ParaStyleId "Normal"
                         }
                 , pStyleName = ParaStyleName (CIString "caption")
                 , pStyleId = ParaStyleId "Caption"
                 }
             ]
         , indentation = Nothing
         , justification = Nothing
         , numbered = False
         , dropCap = False
         , pChange = Nothing
         , pBidi = Nothing
         , pKeepNext = True
         }
       [ Drawing
           "media/image1.jpeg"
           ""
           "Notebook"
           "BINARY CONTENTS OMITTED"
           (Just ( 3852850.0 , 2567940.0 ))
       ])
, Heading
    2
    (ParaStyleName (CIString "heading 2"))
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Just ( ParaStyleName (CIString "heading 2") , 2 )
              , indent = Nothing
              , numInfo = Just ( "8" , "1" )
              , psParentStyle = Nothing
              , pStyleName = ParaStyleName (CIString "heading 2")
              , pStyleId = ParaStyleId "Heading2"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = True
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    "8"
    "1"
    (Just (Level "1" "decimal" "%1.%2" (Just 1)))
    [ PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Broken conversion" ])
    ]
, Heading
    4
    (ParaStyleName (CIString "heading 4"))
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Just ( ParaStyleName (CIString "heading 4") , 4 )
              , indent = Nothing
              , numInfo = Just ( "8" , "3" )
              , psParentStyle = Nothing
              , pStyleName = ParaStyleName (CIString "heading 4")
              , pStyleId = ParaStyleId "Heading4"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = True
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    "8"
    "3"
    (Just (Level "3" "decimal" "%1.%2.%3.%4" (Just 1)))
    [ PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun
               "The caption below doesn\8217t get converted to a regular Figure block."
           ])
    ]
, Paragraph
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Nothing
              , indent = Nothing
              , numInfo = Nothing
              , psParentStyle =
                  Just
                    ParStyle
                      { headingLev = Nothing
                      , indent = Nothing
                      , numInfo = Nothing
                      , psParentStyle = Nothing
                      , pStyleName = ParaStyleName (CIString "Normal")
                      , pStyleId = ParaStyleId "Normal"
                      }
              , pStyleName = ParaStyleName (CIString "caption")
              , pStyleId = ParaStyleId "Caption"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = False
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = False
      }
    [ Drawing
        "media/image2.jpeg"
        ""
        "Notebook"
        "BINARY CONTENTS OMITTED"
        (Just ( 4563857.0 , 3041829.0 ))
    ]
, Paragraph
    ParagraphStyle
      { pStyle =
          [ ParStyle
              { headingLev = Nothing
              , indent = Nothing
              , numInfo = Nothing
              , psParentStyle =
                  Just
                    ParStyle
                      { headingLev = Nothing
                      , indent = Nothing
                      , numInfo = Nothing
                      , psParentStyle = Nothing
                      , pStyleName = ParaStyleName (CIString "Normal")
                      , pStyleId = ParaStyleId "Normal"
                      }
              , pStyleName = ParaStyleName (CIString "caption")
              , pStyleId = ParaStyleId "Caption"
              }
          ]
      , indentation = Nothing
      , justification = Nothing
      , numbered = False
      , dropCap = False
      , pChange = Nothing
      , pBidi = Nothing
      , pKeepNext = True
      }
    [ BookMark "1" "_Toc170477851"
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Figure " ])
    , Field
        UnknownField
        [ PlainRun
            (Run
               RunStyle
                 { isBold = Nothing
                 , isBoldCTL = Nothing
                 , isItalic = Nothing
                 , isItalicCTL = Nothing
                 , isSmallCaps = Nothing
                 , isStrike = Nothing
                 , isRTL = Nothing
                 , isForceCTL = Nothing
                 , rHighlight = Nothing
                 , rVertAlign = Nothing
                 , rUnderline = Nothing
                 , rParentStyle = Nothing
                 }
               [ TextRun "1" ])
        ]
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ NoBreakHyphen ])
    , Field
        UnknownField
        [ PlainRun
            (Run
               RunStyle
                 { isBold = Nothing
                 , isBoldCTL = Nothing
                 , isItalic = Nothing
                 , isItalicCTL = Nothing
                 , isSmallCaps = Nothing
                 , isStrike = Nothing
                 , isRTL = Nothing
                 , isForceCTL = Nothing
                 , rHighlight = Nothing
                 , rVertAlign = Nothing
                 , rUnderline = Nothing
                 , rParentStyle = Nothing
                 }
               [ TextRun "2" ])
        ]
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun " " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "Example " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "o" ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "f " ])
    , PlainRun
        (Run
           RunStyle
             { isBold = Nothing
             , isBoldCTL = Nothing
             , isItalic = Nothing
             , isItalicCTL = Nothing
             , isSmallCaps = Nothing
             , isStrike = Nothing
             , isRTL = Nothing
             , isForceCTL = Nothing
             , rHighlight = Nothing
             , rVertAlign = Nothing
             , rUnderline = Nothing
             , rParentStyle = Nothing
             }
           [ TextRun "broken conversion" ])
    ]
]```

jgm Jul 15, 2025
Maintainer

You can see that there is a Captioned in the first one, but not in the second.
So we need to look in Text.Pandoc.Readers.Docx.Parse and probably specifically the function addCaptioned.
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx/Parse.hs#L770-L782
as well as isCaptionable:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx/Parse.hs#L783-L787

jgm · 2025-07-15T16:54:21Z

jgm
Jul 15, 2025
Maintainer

OK, I see the issue, I think. You have "keep next" set for the paragraph containing the caption, which follows the image. Pandoc takes this to indicate that the paragraph containing the caption goes with what follows it, rather than with the image, and so it doesn't produce a figure in this case.

1 reply

andrew-danieli-fj Jul 16, 2025
Author

Thanks. I tried adjusting the "keep with next" setting in Word and with the minimal test document it cured the broken image conversion. However the same process only cured 1 of the 3 broken images in the main document, even though all had "keep with next" removed from the caption para and left in place on the image itself. Looks like I need to keep using the filter for the time being until I can work out what other setting might be affecting the conversion.

Uh oh!

Docx -> html conversion doesn't convert all Captions to Figures. #10966

Uh oh!

Uh oh!

andrew-danieli-fj Jul 14, 2025

Replies: 3 comments · 6 replies

Uh oh!

jgm Jul 14, 2025 Maintainer

Uh oh!

Uh oh!

andrew-danieli-fj Jul 15, 2025 Author

Uh oh!

andrew-danieli-fj Jul 15, 2025 Author

Uh oh!

jgm Jul 15, 2025 Maintainer

Uh oh!

jgm Jul 15, 2025 Maintainer

Uh oh!

jgm Jul 15, 2025 Maintainer

Uh oh!

Uh oh!

jgm Jul 15, 2025 Maintainer

Uh oh!

jgm Jul 15, 2025 Maintainer

Uh oh!

andrew-danieli-fj Jul 16, 2025 Author

andrew-danieli-fj
Jul 14, 2025

Replies: 3 comments 6 replies

jgm
Jul 14, 2025
Maintainer

andrew-danieli-fj Jul 15, 2025
Author

andrew-danieli-fj Jul 15, 2025
Author

jgm
Jul 15, 2025
Maintainer

jgm Jul 15, 2025
Maintainer

jgm Jul 15, 2025
Maintainer

jgm Jul 15, 2025
Maintainer

jgm
Jul 15, 2025
Maintainer

andrew-danieli-fj Jul 16, 2025
Author