Skip to content

GetPlainText do not support encoding "UniGB-UCS2-H" #55

@kvii

Description

@kvii

For security reason, I can't publish the pdf file that I was reading. But I can show some key part.

Font object (F2):

10 0 obj
<< /BaseFont /STSong-Light-UniGB-UCS2-H /DescendantFonts [ 15 0 R ] /Encoding /UniGB-UCS2-H /Subtype /Type0 /Type /Font >>
endobj

Text command:

BT
1 0 0 1 224 694.89 Tm
/F2 14 Tf
0 0 0 rg
(l_�ϔ��L�\(N�f�bck>V�SU�\))Tj
0 g
ET

Bytes in Tj:

6C 5F 82 CF 94 F6 88 4C 00 5C 28 4E A4 66 13 62 63 6B 3E 56 DE 53 55 00 5C 29

Code:

bs, _ := os.ReadFile("a.pdf")
br := bytes.NewReader(bs)

// extract text
r, _ := pdf.NewReader(br, br.Size())
b, _ := r.GetPlainText()
data, _ := io.ReadAll(b)

fmt.Println(string(data))

Want:

江苏银行(交易扣款回单)...

Got:

l_��n/���9�6]�z�e/�L5033...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions