Skip to content

Missing numbering after converting to plain text #63

@nicholas-gs

Description

@nicholas-gs

Hi @joshy,

I have been using your library to convert a bunch of RTF documents to plain text, before doing further processing to segment the content. The document has a bunch of numbered headers which we use to determine when a section ends and another starts. However, I noticed that sometimes the numbering is missing after using the rtf_to_text function. For example using a snippet

{\\listtext\\pard\\plain\\rtlch\\af3\\afs20\\alang18441\\ab\\ltrch\\f3\\fs20\\lang18441\\langnp18441\\langfe18441\\langfenp18441\\b 10.\\tab}\\pard\\ltrpar\\s19\\itap0\\widctlpar\\qj\\fi-720\\li720\\ri43\\lin720\\rin43\\tx720\\tx1440\\tx2160\\tx2880\\tx3600\\tx4320\\tx5040\\tx5760\\tx6480\\tx7200\\tx7920\\tx8640\\tx9360\\tx10080\\ls23\\ilvl0\\plain\\rtlch\\af3\\afs20\\alang18441\\ab\\ltrch\\f3\\fs20\\lang18441\\langnp18441\\langfe18441\\langfenp18441\\b Trade and other receivables\\tab\\tab\\par\\trowd\\irow0\\irowband0\\trgaph108

it returns just Trade and other receivables, but it should instead be 10. Trade and other receivables.

Not sure the root cause, but I think it is due to the \listtext. If I simply remove it from the rtf string, the numbering is retained in the converted plain text.

Could you advice what should be the correct behaviour?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions