normalize_spaces() is used in text validation, currently foliapy (v2.5.11) and libfolia behave differently here regarding control characters:
- foliapy (>= v2.5.11) all strips control characters.
- foliapy (< v2.5.11) left them as-is, this was wrong.
- libfolia regards a control character the same as a space character, I think this is not correct because control characters don't imply whitespace (in fact, sometimes they are explicitly zero-width)
This issue arose from @martinreynaert 's data, where we see for example:
Expected: Vierstellen-Prädikate bildende Operator „ “ mit dem Zweistellen-Prädikat
Found: Vierstellen-Prädikate bildende Operator „“ mit dem Zweistellen-Prädikat
******* DEVIATION POINT: Operator „<*HERE*>“ mit dem
Character in question is a 0x7f (DELETE).
It also happens in an instance of hebrew text (I translitterate the hebrew because browsers are too smart in RTL rendering and mess up the point): <0x202d>Tun-<0x202d>Idash which libfolia turns into Tun- Idash (inserts an unwanted space). 0x202d is a left-to-right control override.