Skip to content

Almost all non-utf8 encodings are incorrect #680

@ChALkeR

Description

@ChALkeR

Even the most basic windows-1252 which latin1 and ascii alias to:

const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=windows-1252' },
    body: Uint8Array.of(0x80, 0x81, 0x82, 0x83, 0x8d, 0x9e, 0x9f)
  })

  console.log(await res.text())
})

Results in:

you posted:
%u20AC%uFFFD%u201A%u0192%uFFFD%u017E%u0178
€�‚ƒ�žŸ

But it should be €\x81‚ƒ\x8DžŸ instead, with no replacement chars
i.e. %u20AC%81%u201A%u0192%8D%u017E%u0178 escaped

See Encoding Standard: https://encoding.spec.whatwg.org/
All characters are mapped in https://encoding.spec.whatwg.org/index-windows-1252.txt, including 0x81 and 0x8D.

Same goes for other encodings: half of single-bytes are mapped incorrectly and contradict the spec: all of windows-* family except windows-1256, koi8-u, macintosh.


All of legacy multi-bytes that are supported also behave incorrectly


UTF-16 also behaves incorrectly:

const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`Is well formed: ${req.body.isWellFormed()}\n`)
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=utf-16le' },
    body: Uint8Array.of(0, 0xd8, 0, 0xd8)
  })

  console.log(await res.text())
})

Results in:

you posted:
Is well formed: false
%uD800%uD800
��

But per spec it should never produce non-well-formed strings and should instead have produced replacements chars, i.e. %uFFFD%uFFFD escaped

See spec: https://encoding.spec.whatwg.org/#shared-utf-16-decoder

This could have potential security impact

These decoders are enabled in the default configuration
The default utf-8 decoder never produces non-well-formed strings, but the client can force that by specifying utf-16 encoding, while per spec that shouldn't be possible (produced strings should be always well-formed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions