Skip to content

Trying to avoid a multi-byte character breaking.#55

Open
Oliverity wants to merge 2 commits intomorungos:developfrom
Oliverity:develop
Open

Trying to avoid a multi-byte character breaking.#55
Oliverity wants to merge 2 commits intomorungos:developfrom
Oliverity:develop

Conversation

@Oliverity
Copy link
Copy Markdown

@Oliverity Oliverity commented Jan 12, 2024

On the issue #54. Unfortunately, it seems we can only get the encoding from the XML heading if we read the stream, and we better assume the encoding before that to avoid the breaking. Here I assume the utf-8, but it wasn't tested on the *.docx files containing utf-16 XMLs inside.

Иванов Олег added 2 commits January 12, 2024 16:05
Tries to prevent multi-byte characters from breaking. Unfortunately, we need to setEncoding() before actually reading contents, to avoid such breaking. Which means, we won't know the encoding yet. Right know UTF-8 is assumed, but OOXML files might be UTF-16 too. Haven't tested those as yet.
@fritx
Copy link
Copy Markdown

fritx commented Mar 26, 2025

Works for me. Thanks.

pnpm add 'github:Oliverity/node-word-extractor#develop'
// package.json
{
  "dependencies": {
    // ...
    "word-extractor": "github:Oliverity/node-word-extractor#develop"
  }
}

@Oliverity
Copy link
Copy Markdown
Author

Oliverity commented Dec 16, 2025

Works for me. Thanks.

Glad to hear that! And that's especially interesting since you probably deal a lot with the Chinese documents. Could you please tell me if the UTF-16 encoding is in a widespread use for those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants