Skip to content

Discrepancy between foliapy and libfolia in stripping control characters in normalize_spaces() #55

@proycon

Description

@proycon

normalize_spaces() is used in text validation, currently foliapy (v2.5.11) and libfolia behave differently here regarding control characters:

  • foliapy (>= v2.5.11) all strips control characters.
  • foliapy (< v2.5.11) left them as-is, this was wrong.
  • libfolia regards a control character the same as a space character, I think this is not correct because control characters don't imply whitespace (in fact, sometimes they are explicitly zero-width)

This issue arose from @martinreynaert 's data, where we see for example:

Expected: Vierstellen-Prädikate bildende Operator „ “ mit dem Zweistellen-Prädikat
Found: Vierstellen-Prädikate bildende Operator „“ mit dem Zweistellen-Prädikat     
******* DEVIATION POINT: Operator „<*HERE*>“ mit dem       

Character in question is a 0x7f (DELETE).

It also happens in an instance of hebrew text (I translitterate the hebrew because browsers are too smart in RTL rendering and mess up the point): <0x202d>Tun-<0x202d>Idash which libfolia turns into Tun- Idash (inserts an unwanted space). 0x202d is a left-to-right control override.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions