Almost all non-utf8 encodings are incorrect

Even the most basic `windows-1252` which `latin1` and `ascii` alias to:
```js
const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=windows-1252' },
    body: Uint8Array.of(0x80, 0x81, 0x82, 0x83, 0x8d, 0x9e, 0x9f)
  })

  console.log(await res.text())
})
```

Results in:
```
you posted:
%u20AC%uFFFD%u201A%u0192%uFFFD%u017E%u0178
€�‚ƒ�žŸ
```

But it should be `€\x81‚ƒ\x8DžŸ` instead, with no replacement chars
i.e. `%u20AC%81%u201A%u0192%8D%u017E%u0178` escaped

See Encoding Standard: https://encoding.spec.whatwg.org/
All characters are mapped in https://encoding.spec.whatwg.org/index-windows-1252.txt, including `0x81` and `0x8D`.

Same goes for other encodings: half of single-bytes are mapped incorrectly and contradict the spec: all of `windows-*` family except `windows-1256`, `koi8-u`, `macintosh`.

---

All of legacy multi-bytes that are supported also behave incorrectly

---

UTF-16 also behaves incorrectly:

```js
const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`Is well formed: ${req.body.isWellFormed()}\n`)
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=utf-16le' },
    body: Uint8Array.of(0, 0xd8, 0, 0xd8)
  })

  console.log(await res.text())
})
```

Results in:
```
you posted:
Is well formed: false
%uD800%uD800
��
```

But per spec it should never produce non-well-formed strings and should instead have produced replacements chars, i.e. `%uFFFD%uFFFD` escaped

See spec: https://encoding.spec.whatwg.org/#shared-utf-16-decoder

This could have potential security impact

These decoders are enabled in the default configuration
The default utf-8 decoder never produces non-well-formed strings, but the client can force that by specifying utf-16 encoding, while per spec that shouldn't be possible (produced strings should be always well-formed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Almost all non-utf8 encodings are incorrect #680

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Almost all non-utf8 encodings are incorrect #680

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions