Skip to content

Wrong (or at least non-standard) handling of umlaut/dieresis characters in editor (üöä) #794

@biasedlogic

Description

@biasedlogic

//System info at the bottom

Issue:
German Umlauts ÄÖÜ/äöü (and, I suspect, other language related combined/accented letters) when entered as literals in the editor are not stored / passed to objects properly encoded (i.e. with „NFC“-form, where each character occupies exactly one position in a string).
This creates an ‚str‘ object with a chain of codes that would be roughly right for a bytearray, but not for a string.
Also, the editor‘s behaviour breaks: it takes two right-arrow / left-arrow strikes to traverse a single Umlaut-letter and deleting these characters seems sometimes inconsistent.

Background of the problem
There are characters that have multiple possible unicode encodings, like the German „Umlauts” (äöü), which can be stored as a pair of separate codepoints, a base letter (e.g. „a”) and a combining dieresis character („ ¨ ”) or as a single character (e.g. „ä”).
The most useful way is to store them in the minimal form (i.e. „NFC”-form), because this means that
a) in the editor backspacing over an ä deletes the whole thing, which is how each and any editor/GUI would treat it — it is just a single letter
b) more importantly processing strings typed into the editor as literals gets inconsistent with other processing environments.
However, unlike all other implementations that I have tested, for some reason Pythonista decides to break apart each single German Umlaut into two separate characters, claiming that the four-letter word „März” is, indeed, five characters long.

Show and tell:

Try the following code in Pythonista:

s = "März"

print(f"The Length of string '{s}' is {len(s)}")
for c in s:
	print(f"Character '{c}' is alphanumeric?: {c.isalpha()}") `

in Pythonista this results in:

The Length of string 'März' is 5
Character 'M' is alphanumeric?: True
Character 'a' is alphanumeric?: True
Character '̈' is alphanumeric?: False
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True

In other environments e.g. Colab (see https://colab.research.google.com/drive/1NPChlenbDdGk2atTRIiu89LmPeiKY-Qz?usp=sharing) the result is the expected:

The Length of string 'März' is 4
Character 'M' is alphanumeric?: True
Character 'ä' is alphanumeric?: True
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True

The code can be copy-pasted between Colab and Pythonista, Pythonista will break the single unicode letter apart, Colab (or Python on Windows PC, or Jupyter Notebooks, or Python on a Linux machine or on my Android phone…) will treat them as they should be: as a single letter, where the example word „März” is four characters long.

Pythonista 3.4 (340012)
--- SYSTEM INFORMATION ---
System Information

  • Pythonista N/A (N/A), Default interpreter 3.10.4
  • iOS 18.6.2, model iPad14,10, resolution (portrait) 2048.0 x 2732.0 @ 2.0
    --- SYSTEM INFORMATION ---

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions