Unexpected behaviour when slicing unicode strings

jwoLondon · September 29, 2018, 7:36am

If a string contains some emoji unicode characters, I would expect to be able to slice substrings in the same way as any other characters. But the following does not slice the first emoji character:

   "🐱1234" |> String.slice 0 1

String.length reports the length as 6 not 5, so I have to slice two characters of the string to extract the first emoji character.

Yet if I convert the string to a list of characters via String.toList I get the expected 5 element list allowing me to treat normal and extended characters in the same way.

See this ellie for an example.

Is this expected behaviour or a bug? It seems rather counter-intuitive to me.

berryg · September 29, 2018, 10:28am

Welcome in the world of UTF-8, UTF-16 and UTF-32. See: https://package.elm-lang.org/packages/zwilias/elm-utf-tools/latest/String-UTF32

billstclair · September 29, 2018, 2:26pm

I consider it a bug, and I posted an issue with a possible fix: https://github.com/elm/core/issues/977

Brian_Carroll · September 29, 2018, 3:01pm

Yeah string encoding is complicated and carries a lot of baggage from decisions made when things were less globalized and memory was more expensive.

There’s characters (the units that a human would see) and code points (the 16-bit units that are stored in memory in JavaScript). Some characters have one codepoint and some have two.

The length function delegates directly to JavaScript, which defines length in terms of code points, not characters. It’s very quick, O(1), but it’s not always the number you want. It tells you about the memory size rather than the number of readable characters. Slice delegates to JavaScript too.

However the fold and map functions in the String package actually iterate over characters, not code points. They have to actually give you back a Char as an argument to your iterator function. There’s no type in Elm for an individual codepoint, (except for Int I suppose).

You could make a function to calculate the number of characters but you have to fold over the string, so it’s O(n).

system · October 9, 2018, 3:01pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The encoding of the elm string type Request Feedback	8	1518	February 26, 2020
How do you get the nth character of a string Learn	6	375	October 23, 2024
BrianHicks/elm-string-graphemes 1.0.0 Show and Tell	1	610	July 15, 2019
Parse fixed length strings with elm/parser Learn	3	853	September 24, 2019
Elm-unicode is here! Show and Tell	6	1131	April 8, 2021

Unexpected behaviour when slicing unicode strings

Related topics