Skip to content

Multibyte UTF-8 characters break the line editor #18

@jeremy-pereira

Description

@jeremy-pereira

Description

I am writing a REPL for the Lambda Calculus and I incorporated line noise-Swift to provide line editing functionality. Unfortunately, The Greek letter lambda (λ) is encoded in UTF-8 as two bytes: CE BB. linenoise-swift handles input one byte at a time and tries to split the λ. The same problem occurs for any Unicode code point that takes more than one byte to stop in UTF-8, i.e. everything except 7-bit US ASCII.

How to Reproduce

Run the linenoiseDemo command line app. Type in a few characters and then a λ. The cursor will be repositioned at the start of the line and garbage appended to the end of the line. Here is an example:

Type 'exit' to quit
 gdggfdsgdsλ
utput: gdggfdsgdsλ
? 

If you are having trouble producing a λ from your keyboard, the problem still manifests if you copy-paste it from the text of this issue.

Further Information

I made an attempt to fix the issue myself. You can see my attempt here. The patch is a lot bigger than you might expect because adding support for multibyte UTF-8 exposes another more subtle bug.

Consider the following code in class EditLine

    func insertCharacter(_ char: Character) {
        let origLoc = location
        let origEnd = buffer.endIndex
        buffer.insert(char, at: location)
        location = buffer.index(after: location)
        
        if origLoc == origEnd {
            location = buffer.endIndex
        }
    }

The Apple Documentation for insert(_:, at:) says

Calling this method invalidates any existing indices for use with this string

This means that location, origLoc and origEnd are all invalid after the insert. If it's a single byte character we get away with it. If not, location ends up as a garbage value and causes a process abort when it is next used. I ended up changing the types of buffer to [Character] and location to Int as the easy way out.

NB I can give you a pull request or a patch, if it helps, but it hasn't been extensively tested and probably still breaks with composed characters e.g. emoji.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions