Wrong (or at least non-standard) handling of umlaut/dieresis characters in editor (üöä)

//System info at the bottom

**Issue:**
German Umlauts ÄÖÜ/äöü (and, I suspect, other language related combined/accented letters) when entered as literals in the editor are not stored / passed to objects properly encoded (i.e. with „NFC“-form, where each character occupies exactly one position in a string).
This creates an ‚str‘ object with a chain of codes that would be roughly right for a bytearray, but not for a string.
Also, the editor‘s behaviour breaks: it takes two right-arrow / left-arrow strikes to traverse a single Umlaut-letter and deleting these characters seems sometimes inconsistent.

**Background of the problem**
There are characters that have multiple possible unicode encodings, like the German „Umlauts” (äöü), which can be stored as a pair of separate codepoints, a base letter (e.g. „a”) and a combining dieresis character („ ¨ ”) or as a single character (e.g. „ä”).
The most useful way is to store them in the minimal form (i.e. „NFC”-form), because this means that
a) in the editor backspacing over an ä deletes the whole thing, which is how each and any editor/GUI would treat it — it is just a single letter
b) more importantly processing strings typed into the editor as literals gets inconsistent with other processing environments.
However, unlike all other implementations that I have tested, for some reason Pythonista decides to break apart each single German Umlaut into two separate characters, claiming that the four-letter word „März” is, indeed, five characters long.

**Show and tell:**

Try the following code in Pythonista:
```
s = "März"

print(f"The Length of string '{s}' is {len(s)}")
for c in s:
	print(f"Character '{c}' is alphanumeric?: {c.isalpha()}") `
```

in Pythonista this results in:

```
The Length of string 'März' is 5
Character 'M' is alphanumeric?: True
Character 'a' is alphanumeric?: True
Character '̈' is alphanumeric?: False
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True
```

In other environments e.g. Colab (see https://colab.research.google.com/drive/1NPChlenbDdGk2atTRIiu89LmPeiKY-Qz?usp=sharing) the result is the expected:

```
The Length of string 'März' is 4
Character 'M' is alphanumeric?: True
Character 'ä' is alphanumeric?: True
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True
```

The code can be copy-pasted between Colab and Pythonista, Pythonista will break the single unicode letter apart, Colab (or Python on Windows PC, or Jupyter Notebooks, or Python on a Linux machine or on my Android phone…) will treat them as they should be: as a single letter, where the example word „März” is four characters long.


Pythonista 3.4 (340012)
--- SYSTEM INFORMATION ---
**System Information**

* Pythonista N/A (N/A), Default interpreter 3.10.4
* iOS 18.6.2, model iPad14,10, resolution (portrait) 2048.0 x 2732.0 @ 2.0
--- SYSTEM INFORMATION ---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong (or at least non-standard) handling of umlaut/dieresis characters in editor (üöä) #794

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wrong (or at least non-standard) handling of umlaut/dieresis characters in editor (üöä) #794

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions