Background
I am getting quite far along in my conversion of the current Elm compiler written in Haskell to a new “ElmPlus” compiler (currently called “Ficus”/“ficus-lang” since the names fir-lang and fig-lang weren’t available for organization names on GitHub) still written in Haskell but with the intention to translate the compiler to be written in itself since @pit and @deciob have demonstrated that one can translate the Elm compiler even into Elm transpiling to JavaScript, although with a fairly high cost in performance; since my compiler will use Elm syntax as a more general purpose language to output to at least C/Web Assembly and with linearly indexed arrays and updates-in-place-when-last-used, it will be at least as fast as Haskell, but with the possibility of outputting to many of a number of languages such as PHP, Typescript, Python, R (all easy as they are all dynamically typed languages with garbage collection, but possibly worth doing in order to enjoy the advantages of the pure functional code paradigm as well as the huge ecosystem of packages available for those languages), Go, Nim, Rust, etc. There was a recent thread translating “elm/core” into the Go language that culminated with @rupert’s suggesting that once one had a “elm/core” written in Go, one could start to look at writing a back-end for the Elm compiler to output Go code; this isn’t quite so easy as one might think even though Go has its own garbage collection, because Elm erases the Type information available to the Code Generator other than for what is exported from each module - however, my compiler conversion includes each module’s internal Type information to be made available to Code Generators, so that problem can be solved.
Desire to Avoid Fragmentation of the Elm Community
One of my concerns is not to fragment the Elm user space, with a large group of the former Elm community being frequent Roc lang contributors, another group flocking off to Gren, and now seemingly quite a bit of interest in Lamdera as being able to handle and connect a “full-stack” of back-end servers and front-end clients, and there also being the possibility of Evan eventually offering a new Elm follow-on that may do what either or both of Lamdera and my efforts do. However, of those only Lamdera and my project (and likely a new Elm, of course) are committed to being fully backward compatible to Elm as to being able to use any of the packages and code that don’t contain “Kernel” code and even being able to use those when compiling to JavaScript. All changes and features added are fully forward compatible to Elm source code as to syntax, and any “.elm” source file will compile as long as it doesn’t try to define new custom “Ability’s” which will require the use of a new keyword ability
which might conflict with a name for a binding (value or function). I propose that if the feature of custom “Ability’s” is offered, it will be limited to use in source files with a new source file extension such as “.fcs”. Thus, use of my compiler will be very similar to the use of Lamdera other than the very few extensions and the capabilities of generating efficient code in other languages other than JavaScript.
Some Feedback Please:
At first I was going to mostly just replace the JavaScript code generator with a C code generator because the C code can then be passed through further compilation such as through Emscripten to produce JavaScript or Web Assembly and the C code can be used directly to produce native code for all the major platforms such as Windows, macOS, Linux, and mobile apps for Android and iOS; however, I see there may be an advantage in being able to produce JavaScript directly in then being able to make on online IDE and/or REPL without having to use a server to do the compilation and possibly run the result. Now, I also see other interest in producing code for other back-end languages such as Go and others as mentioned in the Background paragraph.
Now currently, Elm embeds the native “Kernel” JavaScript into the AST “objects” files where it applies, but that becomes a bit awkward if one were to support, say, ten back-end languages in that there would be ten different embedded “Kernel” packages in each of those modules that use them. Other languages such as Fable and Roc have essentially some “platform” code with each “platform” supporting one type of back-end, but that means there is a quite a bit of redundant code in those parts of the packages that aren’t “kernel” code. I am thinking of separating the “Kernel” code for each back-end out from being embedded in the "Artifacts.dat` files for each package that uses so as well as the “Artifacts.dat” file there will be separate files for each “platform” back-end compiled to adjacent locations to be used by the different code generators, which might well be just plug-ins that that the output of the more generalized AST files and combine them with optimizations to produce the target code. Does anyone have any suggestions on this?
I know some are dying to ask so I’ll answer the question: “Will the new language support FFI to the new back-end languages?” - Yes, there is no reason not to support this, and I can think of at least three ways to implement it: 1) create package wrappers around any foreign code one wants to include and remove the restriction on publishing packages with “Kernel” code, 2) modify the port
mechanism so that it can call and be called from the back-end code languages, and 3) provide a full FFI package something like what Haskell or PureScript have; Definitely some of these will be implemented, and very likely eventually all three.
The Biggest Problems with Backward Compatibility
This last is the biggest problem with backward compabilitity with current Elm code, and it isn’t related to Elm syntax at all but rather to the “standard library” packages: Many of these packages are wrong or inconsistent!!! I’ll give examples of the main ones, as follows:
- Currently, the
Int
Type is of inconsistent bit depth: allBitwise
operations produce 32 bits, but so also does integer divide. Other languages, even producing JavaScript, are more consistent about this, with PureScript consistently treating itsInt
Type as 32-bits, Fable mostly retaining bit depth but allowing overflow at least in the case of addition, subtraction, and multiply to “number/float” type ranges. As most applications will useInt
’s within a signed 32-bit range, I would like to make this the default behaviour, but risk breaking any code that depends on Int’s having the extended range (sometimes). The only way I can see to make this consistent with current versions is to have two “Kernel” versions for each back-end language, one to be used when compiling “elm.json” projects with “.elm” source files and the other to be used with “ficus.json” projects with “.fcs” source files. Is this important and frequent enough to justify all the extra work? - Currently, integer division, and the
modBy
, andremainderBy
functions are an inconsistent mess as to what happens when dividing by zero: integer division by zero produces zero which while not mathematically correct, is at least consistent and other languages take this shortcut. CurrentlymodBy 0 0
produces a panic/exception which Elm should not allow andremainderBy 0 0
producesNaN
(Not a Number) as anInt
which it is not as it is aFloat
representation. To be consistent, I would like the results of these functions with zero to be a zeroInt
to be consistent with integer divide. This should be safe to do as current Elm code will have provided zero test’s for the first argument in order to work correctly. - Currently, Elm’s power operator only works correctly for
Float
’s but for powers of less than one forInt
’s produces aFloat
result which it calls anInt
. I would like to provide special case code that produces a zeroInt
for any negativeInt
exponent, which should be fine with any workarounds currently used that make this usable. - Currently, the
String
functions that use theSlice
function by index values can be wrong. One of the biggest problems with the use of index values on variable length string encoding such as UTF-16 as used by Elm/JavaScript or UTF-8 is that programmers fail to consider than a character length is not necessarily one index value. Thus, theString.length
function produces the number of 16-bit words in theString
, not the number of Unicode characters, and use of theString.slice
function can be wrong when such use assume one index count per character, so therefore theString.left
,String.right
,String.dropLeft
, andString.dropRight
functions will be wrong when such use assume the number of index positions is the number of characters as they do. I would like to fix these functions so that the relative left and right offsets are corrected to be the number of characters to be retained/dropped, which shouldn’t affect current code in that some workarounds will have to be in place to make their use correct when there is a possibility of characters requiring two 16-bit words. - Also related to string representation, C strings are UTF-8 meaning that they are variable length character representations, and Emscripten then does automatic conversion between UTF-8 and UTF-16 for normal C strings when they cross the interface between Web Assembly and JavaScript. However, the obtained index values from the
String.indexes
/String.indexes
and the count from theString.length
functions would then reflect 8-bit index values rather than 16-bit ones. In order to be exactly compatible with current Elm results, one would have to use a UTF-16 string format by default, which would be easy to pass across the Web Assembly/JavaScript interface, but would require a conversion to UTF-8 when being passed to C library functions, and would be somewhat less efficient for encoding mostly ASCII character strings. I’m afraid the newString
native modules will have to put up with this inefficiency in the interests of preserving strict current Elm compatibility..
Any Further Cases?
I welcome input on these and any further problem edge cases I may not have considered.