Unexpected behaviour when slicing unicode strings

If a string contains some emoji unicode characters, I would expect to be able to slice substrings in the same way as any other characters. But the following does not slice the first emoji character:

   "🐱1234" |> String.slice 0 1 

String.length reports the length as 6 not 5, so I have to slice two characters of the string to extract the first emoji character.

Yet if I convert the string to a list of characters via String.toList I get the expected 5 element list allowing me to treat normal and extended characters in the same way.

See this ellie for an example.

Is this expected behaviour or a bug? It seems rather counter-intuitive to me.

3 Likes

Welcome in the world of UTF-8, UTF-16 and UTF-32. See: https://package.elm-lang.org/packages/zwilias/elm-utf-tools/latest/String-UTF32

2 Likes

I consider it a bug, and I posted an issue with a possible fix: https://github.com/elm/core/issues/977

2 Likes

Yeah string encoding is complicated and carries a lot of baggage from decisions made when things were less globalized and memory was more expensive.

There’s characters (the units that a human would see) and code points (the 16-bit units that are stored in memory in JavaScript). Some characters have one codepoint and some have two.

The length function delegates directly to JavaScript, which defines length in terms of code points, not characters. It’s very quick, O(1), but it’s not always the number you want. It tells you about the memory size rather than the number of readable characters. Slice delegates to JavaScript too.

However the fold and map functions in the String package actually iterate over characters, not code points. They have to actually give you back a Char as an argument to your iterator function. There’s no type in Elm for an individual codepoint, (except for Int I suppose).

You could make a function to calculate the number of characters but you have to fold over the string, so it’s O(n).

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.