The encoding of the elm string type

Hi,

I have been thinking about the Char and String type in elm. Specifically how they are encoded.

Currently elm Strings and `Char’s are thin wrappers around a JavaScript string and therefore borrow their string encoding from JavaScript. That encoding is UCS-2 (1). UCS-2 is an encoding very similar to UTF-16 but unpaired surrogate codepoints are allowed. It is a bad (2) encoding that mainly exists for historical reasons (3).

I do not think elm wants to use the UCS-2 encoding and so we should in the medium to long term (before elm 1.0 commits to backwards compatibility) look to change.

Here are some references on what other languages do:

Language char type string type notes
Rust Unicode scalar value Array of bytes that must be valid UTF-8 There are extra string types that provide compatibility with c and compatibility with the OS
c You get alphanumeric characters and some symbols Sequence of bytes must end with '\0' I think strings are normally UTF-8 encoded, c just happens to be older than unicode
haskell Unicode code point type String = [Char]
JavaScript Does not exist UCS-2 encoded
go Either a byte or a rune which is a Unicode code point An array of bytes, can be converted to an array of runes (i.e. code points) See https://blog.golang.org/strings and https://golangbot.com/strings/
  1. https://mathiasbynens.be/notes/javascript-encoding
  2. subjective
  3. You have to use UCS-2 if you want to interact with the filesystem on windows. See https://simonsapin.github.io/wtf-8/#motivation
3 Likes

Have you thought about how that would work practically? Are there alternative string encodings in javascript? how fast are they? Ultimately elm is bound to the web platform, so an alternative string would need to be built on top of that, and wouldn’t hook directly into the browser internals. I think anything we build in javascript will be significantly slower than the built-in string type for common string operations.

Then, when putting anything in the DOM, we need to convert from the custom encoding to the browser’s one, which unless I’m missing something would mean it would use twice the memory.

Like with numbers (32-bit bitshifts, NaN/infinity weirdness) we all would like elm to do better than other languages, but it’s just not really possible on the web platform right now without sacrificing efficiency.

2 Likes

I’m not sure why the encoding of strings should be user-visible at all. String functions should work with strings and unicode code points. And there should be functions to get, say, an array of UTF-8 bytes etc.

Have you thought about how that would work practically?

What I am proposing here is that elm explicitly specifies the encoding of Char and Strings values. If elm specified the UCS-2 encoding for example then a practical implementation is trivial. It would be harder to efficiently implement other encodings but I am confident it would be possible.

For example, if elm forbids unpaired surrogate codepoints then it can continue to use javascript strings under the hood. The elm string functions would ensure that it is impossible to insert unpaired surrogate codepoints in to the string. This is an example of one of my favourite elm principles: making invalid states impossible. The only performance overhead been seen when passing strings from javascript (or from http, etc) as elm would need to validate the string. I think this overhead would be well worth it.

I’m not sure why the encoding of strings should be user-visible at all. String functions should work with strings and unicode code points. And there should be functions to get, say, an array of UTF-8 bytes etc.

If all string functions work exclusively with unicode code points then the encoding of strings is user visible; a string is a sequence of unicode code points. I think this would be a good encoding to chose (one of a few good encodings).

However, currently the elm String module does not work exclusively with unicode code points. String.length use JavaScript string’s length property and so returns the number of UTF-16 code points.

Moreover, it is possible to have elm strings containing unpaired surrogates. In otherwords, currently elm does have a user-visible string encoding and it is “sometimes unicode code points, most the time UCS-2”. However, it is not documented or consistent.

Well, maybe it’s just a semantic difference, but I can easily see how you can encode strings using UTF-8 internally but still return unicode code points in your functions.

I think the best thing to do is to have ways to work with a number of access methods.

It’s not just String.length, see Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units) · Issue #1061 · elm/core · GitHub “Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units)”.

1 Like

It could be interesting to look at how wasm-bindgen, which bridges rust wasm and javascript, works for strings: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html

2 Likes

It’s not just String.length , see https://github.com/elm/core/issues/1061 “Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units)”.

There is though an important difference between bugs in the elm string functions (I think this is reasonable, elm is a young langauge and “it is better to do it right than to do it now”) and the fuzziness regarding the encoding of elm strings. The former can (and will) be fixed but if the fuzzy string encoding sneaks into elm 1.0 then that fuzzyness will remain forever. (As demonstrated by windows and javascript).


Let me quote the important bit of @mattpiz’s very good link:

When passing a string from JavaScript to Rust, it uses the TextEncoder API to convert from UTF-16 to UTF-8. This is normally perfectly fine… unless there are unpaired surrogates. In that case it will replace the unpaired surrogates with U+FFFD (�, the replacement character). That means the string in Rust is now different from the string in JavaScript!

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.