Elm core libs in WebAssembly

Thanks Evan! :smiley:
Hadn’t heard of StringView before but I’ve played with TextDecoder as part of this project.

you may get some performance benefits… if the underlying implementation wants UTF-8 anyway

This is what I’m concerned about! My impression is that hardly any of the underlying implementations do want UTF-8! So the performance argument goes the other way.

My understanding is that all of the string-related C++ code in all the browsers work on UTF-16. I remember reading a comment from a Servo developer on Hacker News, who said that since it was a new ground-up browser development, they wanted to go with UTF-8 for the internal string stuff. But all the W3C specs are based on UTF-16 and they kept running into problems. In the end they had to abandon that idea and switch to UTF-16. They just couldn’t take the performance hit with all the conversions. Unfortunately I haven’t been able to dig up that link again!

If that’s right and Elm uses UTF-8 as the underlying representation in the String library, then it will pay some performance penalty for it, because there will be lots of conversions to UTF-16 somewhere in order to interface with browser C++ code. How big is that penalty? I don’t know. But why pay it? I guess UTF-8 saves you memory though. Which matters most? Dunno, needs benchmarking!

I’ll provide some of the evidence that led me to this conclusion. If you look in the W3C specs you’ll find lots of references to DOMString. Like for example when you do document.createElement('div'), the bytes that representing that string 'div' must be a DOMString.

DOMString is defined to be UTF-16 here
The spec for the Document interface is here. It specifies that a Document has a method called createElement that has an argument tagName whose type is DOMString.

If you browse around those specs, hit ctrl-F, and search for DOMString, your screen will light up. It seems crazy that they don’t just leave this up to browser implementations.

All of this was a huge shock to me by the way. I thought that, since HTML documents are usually transmitted over the wire as UTF-8, then surely the DOM would also be based on UTF-8? Nope! That’s not how it works! I guess it gets converted during HTML parsing? I’m not sure. I’d love to know, but this kind of detail is pretty hard to find.

I would love to be wrong about this so if someone knows I am, please tell me!