The encoding of the elm string type

harrysarson · February 14, 2020, 8:35pm

Hi,

I have been thinking about the Char and String type in elm. Specifically how they are encoded.

Currently elm Strings and `Char’s are thin wrappers around a JavaScript string and therefore borrow their string encoding from JavaScript. That encoding is UCS-2 (1). UCS-2 is an encoding very similar to UTF-16 but unpaired surrogate codepoints are allowed. It is a bad (2) encoding that mainly exists for historical reasons (3).

I do not think elm wants to use the UCS-2 encoding and so we should in the medium to long term (before elm 1.0 commits to backwards compatibility) look to change.

Here are some references on what other languages do:

Language	char type	string type	notes
Rust	Unicode scalar value	Array of bytes that must be valid UTF-8	There are extra string types that provide compatibility with c and compatibility with the OS
c	You get alphanumeric characters and some symbols	Sequence of bytes must end with `'\0'`	I think strings are normally UTF-8 encoded, c just happens to be older than unicode
haskell	Unicode code point	`type String = [Char]`
JavaScript	Does not exist	UCS-2 encoded
go	Either a byte or a rune which is a Unicode code point	An array of bytes, can be converted to an array of runes (i.e. code points)	See https://blog.golang.org/strings and https://golangbot.com/strings/

https://mathiasbynens.be/notes/javascript-encoding
subjective
You have to use UCS-2 if you want to interact with the filesystem on windows. See https://simonsapin.github.io/wtf-8/#motivation

folkertdev · February 14, 2020, 11:46pm

Have you thought about how that would work practically? Are there alternative string encodings in javascript? how fast are they? Ultimately elm is bound to the web platform, so an alternative string would need to be built on top of that, and wouldn’t hook directly into the browser internals. I think anything we build in javascript will be significantly slower than the built-in string type for common string operations.

Then, when putting anything in the DOM, we need to convert from the custom encoding to the browser’s one, which unless I’m missing something would mean it would use twice the memory.

Like with numbers (32-bit bitshifts, NaN/infinity weirdness) we all would like elm to do better than other languages, but it’s just not really possible on the web platform right now without sacrificing efficiency.

norpan · February 15, 2020, 8:22pm

I’m not sure why the encoding of strings should be user-visible at all. String functions should work with strings and unicode code points. And there should be functions to get, say, an array of UTF-8 bytes etc.

harrysarson · February 15, 2020, 11:21pm

Have you thought about how that would work practically?

What I am proposing here is that elm explicitly specifies the encoding of Char and Strings values. If elm specified the UCS-2 encoding for example then a practical implementation is trivial. It would be harder to efficiently implement other encodings but I am confident it would be possible.

For example, if elm forbids unpaired surrogate codepoints then it can continue to use javascript strings under the hood. The elm string functions would ensure that it is impossible to insert unpaired surrogate codepoints in to the string. This is an example of one of my favourite elm principles: making invalid states impossible. The only performance overhead been seen when passing strings from javascript (or from http, etc) as elm would need to validate the string. I think this overhead would be well worth it.

I’m not sure why the encoding of strings should be user-visible at all. String functions should work with strings and unicode code points. And there should be functions to get, say, an array of UTF-8 bytes etc.

If all string functions work exclusively with unicode code points then the encoding of strings is user visible; a string is a sequence of unicode code points. I think this would be a good encoding to chose (one of a few good encodings).

However, currently the elm String module does not work exclusively with unicode code points. String.length use JavaScript string’s length property and so returns the number of UTF-16 code points.

Moreover, it is possible to have elm strings containing unpaired surrogates. In otherwords, currently elm does have a user-visible string encoding and it is “sometimes unicode code points, most the time UCS-2”. However, it is not documented or consistent.

norpan · February 16, 2020, 12:13am

Well, maybe it’s just a semantic difference, but I can easily see how you can encode strings using UTF-8 internally but still return unicode code points in your functions.

I think the best thing to do is to have ways to work with a number of access methods.

malaire · February 16, 2020, 7:15am

It’s not just String.length, see Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units) · Issue #1061 · elm/core · GitHub “Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units)”.

mattpiz · February 16, 2020, 6:16pm

It could be interesting to look at how wasm-bindgen, which bridges rust wasm and javascript, works for strings: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html

harrysarson · February 16, 2020, 7:04pm

It’s not just String.length , see https://github.com/elm/core/issues/1061 “Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units)”.

There is though an important difference between bugs in the elm string functions (I think this is reasonable, elm is a young langauge and “it is better to do it right than to do it now”) and the fuzziness regarding the encoding of elm strings. The former can (and will) be fixed but if the fuzzy string encoding sneaks into elm 1.0 then that fuzzyness will remain forever. (As demonstrated by windows and javascript).

Let me quote the important bit of @mattpiz’s very good link:

When passing a string from JavaScript to Rust, it uses the TextEncoder API to convert from UTF-16 to UTF-8. This is normally perfectly fine… unless there are unpaired surrogates. In that case it will replace the unpaired surrogates with U+FFFD (�, the replacement character). That means the string in Rust is now different from the string in JavaScript!

system · February 26, 2020, 7:04pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elm core libs in WebAssembly Show and Tell	24	8183	October 26, 2019
Elm-unicode is here! Show and Tell	6	1124	April 8, 2021
Rendering HTML encoded Strings Learn	5	2748	October 17, 2019
Is any Json string a valid Elm string? Learn	3	546	October 19, 2022
Encoding/decoding arrays of floats with Bytes Request Feedback	2	579	May 2, 2019

The encoding of the elm string type

Related topics