Let's consider emoji 4οΈβ£: it consists of 3 seperate code points. How should we treat it when building ngram index for example?
Example from @snaury: country flags like π¦π¨ (two code points)
One more interesting instance π¨βπ©βπ§βπ¦ (7 unicode code points, 25 bytes)