Guida Compiler - Was "There are 3 Elm compilers written in Elm"

deciojf · August 5, 2025, 5:40pm

I did some investigation and it looks like the issue might relate to large single files. The culprit seems to be the Snippet operations.

I have updated ./bin/index.js to have on the first line #!/usr/bin/env node --heap-snapshot-on-oom --max-old-space-size=1072

This allowed me to produce a .heapsnapshot file that lists a very large number “concatenated string”. With some more digging, it looks like this relates to the top Css module on the elm-css package, which in fact is very large.

Does anyone know how to improve the operations on the Snippet structure regarding the concatenation of strings? Thanks

rupert · August 5, 2025, 7:56pm

I noticed that too - that a lot of heap space was taken up with concatenated strings. And they all seemed to start with the same prefix and contain the contents of a .elm file.

Not sure, but perhaps as Strings are being read in and appended together they are not “structure sharing”? What I mean is in an immutable language if I have:

a = "a"
b = "b"
c = a ++ b

Then I might expect c to consist of 2 strings, a and b, but each of those to be a reference to the original a and b, and not a deep copy. But if concatentated strings are causing lots of deep copies of the same Strings, and garbage collector is not cleaning up the ones not used, then that would take up a lot of memory.

Just a guess though.

I don’t find the Chrome dev tools profiler easy to understand, is there a better NodeJS profiler we could try? I used to be a whizz with JProfiler in my Java days, what Chrome offers is nowhere near as good unfortunately.

rupert · August 5, 2025, 8:11pm

type Snippet
    = Snippet
        { fptr : String
        , offset : Int
        , length : Int
        , offRow : Row
        , offCol : Col
        }

Is it putting the entire .elm file in the fptr for each Snippet? But then only referencing a small section of it with the offsets?

Perhaps in the Haskell code all these Strings are pointers to a single String, but in the Elm they are not and that is why the heap contains the same String many many times?

In that case the answer might be to move the fptr out of the Snippet, and just have one at a higher level somewhere that the offsets refer to?

deciojf · August 5, 2025, 10:58pm

It looks like converting the String to a List Char and back to a String does solve the issue for compiling rtfeldman/elm-css.

This seems to solve the issue with the concatenated strings, but we should still find a better solution for this.

@rupert any chance you can try compiling the other failing projects with this performance branch? Thanks

rupert · August 6, 2025, 8:56am

I was able to build rtfeldman/elm-css with it, and a few of my projects that used that also. The 150K LOC project I have which also has 64 package dependencies.. no, but that one is a bit on the extreme size. But I think overall the change is increasing the size of projects that it can build.

Do you know why the String -> List Char -> String trick works? Is this exposing some inneficiency of Elms String handling that could be improved with a patch to elm/core?

rupert · August 6, 2025, 11:47am

I notice you defined your own Compiler.Parse.Primitives with:

type Parser x a
    = Parser (State -> PStep x a)

Rather than use elm/parser. Possibly this is a source of inefficiency in String handling, as elm/parser has some kernel code to help with String manipulation?

pit · August 6, 2025, 7:14pm

Hi Décio, just to add one more data point: my port of the compiler has no problems with rtfeldman/elm-css, but I got a stack overflow for pablohirafuji/elm-markdown. I was able to fix the problem by slightly changing Evan’s original code. I wanted to tell you about it, but I’ve seen that you have changed the implementation already (function eatSpaces in the parser). Good luck!

deciojf · August 6, 2025, 10:51pm

Found this (ref.: javascript - JS String concatenation explodes memory consumption - Stack Overflow):

Strings in JS are immutable, so every time you add a character, it will create a new string that is 1 character longer than the previous one. GC will not run until everything is done, so you’re stuck with tons of strings of various lengths.

My current fix was just trying convert the ConsStrings into a simple String, by using the List Char. Given you have confirmed this fixes some issues, I will update the branch with some comments and most likely merge it.

I do recall giving it a try early in the project, but I was not successful at the time. Most of the project has been done by mapping from the original Haskell code, so that if there was an issue, we could more easily compare with the original implementation.

Having said this, I do think the project is now in a good state to start focusing more on performance, and deviating from the original when needed. Would you be willing to give it a try at updating Compiler.Parse.Primitives so that it internally uses the elm/parser (I assume we would need Parser.Advanced), ideally without changing it’s API? or creating an issue on the project to track this idea?

Thanks for checking on this @pit

rupert · August 7, 2025, 12:25pm

I got an email saying Google Jules coding agent is emerging from Beta, so I asked it to do just that. Only got about 2/3rds of the way there before it gave up on a syntax error. But I got it to check in its WIP anyway and will see if any of it is useful.

I think this might not fix anything to do with the concatenated strings issue - but I do have an idea about that:

===

Instead of slicing the String up-front and putting it in the Snippet, just put in the whole unsliced String which will be a reference so only take up 1 pointer of memory. Then when code needs just the slice, apply the slice operation on demand, I have found String.slice in Elm to perform pretty well.

===

I will paste the above into an Issue and have a go with it.

rupert · August 7, 2025, 12:40pm

Worth noting that in the Haskell - 16-bit row and cols are used to save memory, but in the Elm they are going to be 64-bit floats. Possibly we can pack the range pointers and row and col somehow?

I am guessing from _fptr being a pointer that elm/compiler is slicing the snippet on-demand.

type Row = Word16
type Col = Word16

data Snippet =
  Snippet
    { _fptr   :: ForeignPtr Word8
    , _offset :: Int
    , _length :: Int
    , _offRow :: Row
    , _offCol :: Col
    }

deciojf · August 16, 2025, 10:10am

Hi all, I’m currently at the end of implementing a guida format command - as part of developing the vs-code plugin.

I would like to take the opportunity to ask you all what would be the most important “features” for the language, from your perspective:

Performance
Tooling (vs-code extension, command line, etc.)
New language features (syntactic sugar, allow testing private functions, etc.)
Bug fixes (known issues on elm and related projects)
Package manager (and related…)
Analyzes tooling (similar to elm-review)
Other (please add a comment with suggestions)
Not considering using the language

0 voters

Thanks

rupert · August 20, 2025, 9:34am

Other: I am interested in doing some compiler hacking to add new backend(s). I started out learning about the compiler through the Lamdera codebase and that would have been where I did this work. But intrigued by this drop-in re-implementation in Elm, as its preferable for me to work in Elm than Haskell.

So I think the “feature” that interests me is that we can have a compiler that the Elm community can contribute to without needing to use something other than Elm to do so.

rupert · August 20, 2025, 2:28pm

I think the out of memory problem comes down to this code here:

github.com/elm/bytes

src/Elm/Kernel/Bytes.js

master


      
          var _Bytes_read_u16 = F3(function(isLE, bytes, offset) { return __Utils_Tuple2(offset + 2, bytes.getUint16(offset, isLE)); });
          var _Bytes_read_u32 = F3(function(isLE, bytes, offset) { return __Utils_Tuple2(offset + 4, bytes.getUint32(offset, isLE)); });
          var _Bytes_read_f32 = F3(function(isLE, bytes, offset) { return __Utils_Tuple2(offset + 4, bytes.getFloat32(offset, isLE)); });
          var _Bytes_read_f64 = F3(function(isLE, bytes, offset) { return __Utils_Tuple2(offset + 8, bytes.getFloat64(offset, isLE)); });
          
          var _Bytes_read_bytes = F3(function(len, bytes, offset)
          {
          	return __Utils_Tuple2(offset + len, new DataView(bytes.buffer, bytes.byteOffset + offset, len));
          });
          
          var _Bytes_read_string = F3(function(len, bytes, offset)
          {
          	var string = '';
          	var end = offset + len;
          	for (; offset < end;)
          	{
          		var byte = bytes.getUint8(offset++);
          		string +=
          			(byte < 128)
          				? String.fromCharCode(byte)
          				:

This happens when the binary files under elm-stuff (or guida-stuff) are read back in, and Strings are decoded from Bytes. The way the String is built by appending one char at time results in lots of ‘concatenated strings’ in memory as described above. Applying a (String.toList >> String.fromList) to the String allows all those intermediate concatenated strings to be freed for garbage collection, which helps a lot.

That got me wondering, what does String.fromList do that is better?

github.com/elm/core

src/Elm/Kernel/String.js

master


      
          	// check if it is a hex, octal, or binary number
          	if (s.length === 0 || /[\sxbo]/.test(s))
          	{
          		return __Maybe_Nothing;
          	}
          	var n = +s;
          	// faster isNaN check
          	return n === n ? __Maybe_Just(n) : __Maybe_Nothing;
          }
          
          function _String_fromList(chars)
          {
          	return __List_toArray(chars).join('');
          }

github.com/elm/core

src/Elm/Kernel/List.js

master


      
          function _List_fromArray(arr)
          {
          	var out = _List_Nil;
          	for (var i = arr.length; i--; )
          	{
          		out = _List_Cons(arr[i], out);
          	}
          	return out;
          }
          
          function _List_toArray(xs)
          {
          	for (var out = []; xs.b; xs = xs.b) // WHILE_CONS
          	{
          		out.push(xs.a);
          	}
          	return out;
          }
          
          var _List_map2 = F3(function(f, xs, ys)
          {

It builds an array using .push(), and then .join('') on the array. Might it be even better to allocate the entire array as Unit16s, fill it in from the bytes and then .join?

So the _Bytes_read_string operation could be improved by pushing the char codes onto an array, and then joining to form a String.

Seems like a good one for a PR to add to elm-janitor.

I’m a bit slow here to figure out what you probably already knew @deciojf. But just wanted to clarify that it does not seem to be the Parser code at all, which never works with Bytes. Its only at the step of reading back in from the binary files that this problem occurs. Some confusion arose in my mind because some of the binary → Snippet code resides in the Parser modules.

I would quite like to try something like this:

type SnippetRef
    = SnippetRef
        { filename : String
        , offset : Int
        , length : Int
        , row : Row
        , col : Col
        }

So that instead of reading the snippets into memory, just create pointers to them. This would only be done during the binary file read phase, not during parsing, which would use the normal Snippet. Then if a SnippetRef ever needs to be dereferenced (which I think is just for error reporting, right?), run a Task to extract it from the file. I think this will dramitically reduce memory needed to run a compile.

deciojf · August 20, 2025, 8:16pm

These are really great findings!

I did try doing this, and it seems to greatly improve the timings and fixes the OOM issue. Will try posting some updates later today.

Also, the following issue Decoding long strings can lead to large heap footprint · Issue #19 · elm/bytes · GitHub seems to be related to this.

rupert · August 20, 2025, 8:23pm

Interesting that there is already an issue for it. I think we could do better than force the assembly of the string by trying an index operation on it - as there would still be a need to build up all the concatentated strings on the heap and then gc them. Better to avoid that altogether with the array.join operation.

deciojf · August 21, 2025, 12:41am

Created a PR with some changes to try to improve this: Update elm/bytes `_Bytes_read_string` to use array instead of string as accumulator by decioferreira · Pull Request #110 · guida-lang/compiler · GitHub

deciojf · August 21, 2025, 12:44am

Also, allow me to share the fact that I have just published a VSCode extension for Guida. It can be found here: Guida - Visual Studio Marketplace

This is a WIP extension, still a lot to do

deciojf · August 29, 2025, 4:42pm

Thank you all for your contribution on this, I will definitely take this result into account when trying to prioritise the work on Guida

system · September 8, 2025, 4:42pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tailwind and Elm = Love? Request Feedback	22	2468	December 7, 2020
Community BETA for Elm 0.19.1 Request Feedback	31	10345	September 18, 2019
When should I not use Elm in production? Learn	19	5385	August 29, 2020
Elm 0.19.1 is out! Show and Tell	20	11928	November 8, 2019
Elm core libs in WebAssembly Show and Tell	24	8295	October 26, 2019

Guida Compiler - Was "There are 3 Elm compilers written in Elm"

Related topics