Skip to content

get chinese word in by GetToc or GetTextWords has some unknow word #170

@DHclly

Description

@DHclly

the test file:
20250309第三章.pdf

mupdf.net version:

<PackageReference Include="MuPDF.NET" Version="3.2.5" />

I use this code

public static void T1()
{
    Document doc = new Document(@"D:\learn\python-pdfplumber-learn\pdf-docs\20250309第三章.pdf");
    doc.SetLanguage("zh-CN");
    doc.FontInfos.Add(new FontInfo()
    {
        Name="微软雅黑",
    });

    var toc = doc.GetToc();
    var t0 = toc[0];
    var title = t0.Title;
    Console.WriteLine(title);

    var p1 = doc[0];
    var list = p1.GetTextWords(sort: true);
    foreach (var wb in list)
    {
        Console.WriteLine(wb.Text);
    }
}

result:

Image

open by wps or google chrome:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions