Trying to avoid a multi-byte character breaking. by Oliverity · Pull Request #55 · morungos/node-word-extractor

Oliverity · 2024-01-12T13:15:33Z

On the issue #54. Unfortunately, it seems we can only get the encoding from the XML heading if we read the stream, and we better assume the encoding before that to avoid the breaking. Here I assume the utf-8, but it wasn't tested on the *.docx files containing utf-16 XMLs inside.

Tries to prevent multi-byte characters from breaking. Unfortunately, we need to setEncoding() before actually reading contents, to avoid such breaking. Which means, we won't know the encoding yet. Right know UTF-8 is assumed, but OOXML files might be UTF-16 too. Haven't tested those as yet.

fritx · 2025-03-26T08:52:30Z

Works for me. Thanks.

pnpm add 'github:Oliverity/node-word-extractor#develop'

// package.json
{
  "dependencies": {
    // ...
    "word-extractor": "github:Oliverity/node-word-extractor#develop"
  }
}

Oliverity · 2025-12-16T14:53:30Z

Works for me. Thanks.

Glad to hear that! And that's especially interesting since you probably deal a lot with the Chinese documents. Could you please tell me if the UTF-16 encoding is in a widespread use for those?

Иванов Олег added 2 commits January 12, 2024 16:05

OOXML stands for "Office Open XML", not "Open Office XML".

66d711f

Oliverity mentioned this pull request Jan 12, 2024

Broken multi-byte letters at the borders of 4096-byte chunks #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to avoid a multi-byte character breaking.#55

Trying to avoid a multi-byte character breaking.#55
Oliverity wants to merge 2 commits intomorungos:developfrom
Oliverity:develop

Oliverity commented Jan 12, 2024 •

edited

Loading

Uh oh!

fritx commented Mar 26, 2025 •

edited

Loading

Uh oh!

Oliverity commented Dec 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oliverity commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fritx commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oliverity commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oliverity commented Jan 12, 2024 •

edited

Loading

fritx commented Mar 26, 2025 •

edited

Loading

Oliverity commented Dec 16, 2025 •

edited

Loading