Skip to content

Micro-blog instructions should explain graphemes and possibly test extended graphemes #2483

@ageron

Description

@ageron

The instructions of the micro-blog exercise say:

The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.

I understand that we want to keep things simple, but I think this is misleading. For example, in the Roc track the instructions led some people to split the string into codepoints when in fact there's actually a very simple function to split the string into graphemes instead: the tests pass in both cases because they only include graphemes composed of a single codepoint, but they would fail if the tests included flags, or characters with multiple diacritics, or complex emojis, or basically any grapheme composed of multiple codepoints (i.e., extended grapheme clusters).

In short: we shouldn't encourage people to work with codepoints when they can just as easily work with graphemes.

I suggest at least updating the instructions to cover graphemes, but also including some tests with extended grapheme clusters. If we're going to handle unicode, we should try to handle all possible characters. Handling graphemes might be harder in some languages, but in that case they can just disable the extended grapheme tests.

Edit: I'm happy to submit a PR if there's an agreement on this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions