Rewriting elm-syntax and future plans

Recently, with the help of Jeremie Gillet (Jie) and @lue-bird, we released new versions of stil4m/elm-syntax (used in elm-review among others), completely rewritten to use what is known as a Pratt parser.

I describe the change in this article, and also explain what the plans are for the next major version, and how you can help pitch in.

Hope you enjoy it!

19 Likes

Hey Jeroen, thanks for the article!

I’m curious, how much does the elm-syntax performance improvement translate to elm-review performance? Do you have some benchmarks across the two versions?

I have not tested it at all. For elm-review particularly, the time spent on parsing is eclipsed by the time analyzing, given enough rules in your configuration.

If some people would like to try computing the information and report it back here, you can run elm-review --benchmark-info to get the information, you should then look at the “parse/fetch parsed files” line near the top. Though I don’t remember how reliable that number is. And don’t forget to clean up the file cache between every run (otherwise you’re benchmarking):

rm -r elm-stuff/generated-code/jfmengels/elm-review/cli/<your-elm-review-cli-version>/file-cache
# or just rm -r elm-stuff if you're lazy, but that may impact measurements?
elm-review --benchmark-info

This should have the most impact on startup time for elm-review (because we parse files before analyzing anything) and during fixes because we parse the files again after every fix (maybe even multiple times actually…).

The data structure for the propoesd v8 CST is:

type Node a
    = Node Range a

type alias Location =
    { row : Int
    , column : Int
    }

type alias Range =
    { start : Location
    , end : Location
    }

Which is perfect for describing the source code location of parts of the syntax, for purposes such as reporting errors and so on.

The one idea I had was to generalize this to:

type Node r a
    = Node r a

Two reasons…

The first reason is that when doing codegen, you do not have a source file, so you do not know the locations to fill in. So instead you have to use a dummy value, Range.empty instead. In this situation it might be preferable to codegen stuff with the type Node () a. Or maybe make up your own empty type to signify that location information has not been added yet, such as:

type Generated = Generated -- placeholder type

doCodeGen : Model -> Node Generated FunctionOrValue

The second reason is that you might like to add more information, or alternative information than the standard Range record provides. For example, I do codegen from JSON files that describe AWS services, to create the Elm stubs for calling them. In this case, I might like to provide a path into the JSON structure as the location, so that any error reporting can be tied back to the model that the codegen was run on:

type JsonPath = ...

doCodeGen : Model -> Node JsonPath FunctionOrValue

I think this would give the library less of a bias towards asuming that the CST always originates from a source file, which supports source → source type transformations well, to something that is unbiased as to how the AST/CST source model is created.

Another example use case… Suppose you wanted to calculate the Kolmogorov complexity for every function in a file. This would be a recursive algorithm that works from the leaf nodes up to the functions, calculating intermediate values along the way, then combining them as the larger expressions and functions are calculated. To support this, use an r with space for this r == type alias LocationAndKolmogorov = { start : Range, end : Range, k : Int }. So could also be useful for analytics.

What do you think?

5 Likes

This is definitely something I’m interested in having, and I forgot to mention that in the article. @Janiczek mentioned several times that this is useful for computing a typed CST for instance (so obviously that gets my attention).

However, I still have many question marks on this.

Should all nodes use the type variable? For instance, type inference (or Kolmogorov) information makes sense on expressions, but it probably doesn’t on a type or module declaration, and a lot of other nodes. Then, what value do we put on them? Should we have something like this?

type alias File expressionType otherType =
    { moduleDefinition : Node otherType Module
    , imports : List (Node otherType Import)
    , declarations : List (Node expressionType (Declaration expressionType otherType))
    , comments : List (Node otherType Comment)
    }

It feels a bit clunky, but especially, it feels like this could be very much be dependent on the use-case, making me think that maybe the type should be duplicated per usage.

Also, when parsed by elm-syntax, should expressionType and otherType still be Range initially? If so, then adding any information means mapping the entire AST, which means that it could be mapped to another type anyway.

It’s all still very fuzzy in my head. Maybe I’m overthinking this :man_shrugging:
I think I need more use-cases or experience from people that have needed to fork the AST :pray:
I’ll let someone create an issue so that we can continue the discussion on GitHub, I feel like that would be a more appropriate location for the conversation.

I think one type variable is enough. Different use cases can use a custom type to sub-divide as and when they need it.

I would expect that after parsing the result would be a File Range or Node Range, yes. After that transformation to something else might very well be done, but at least you could re-use the existing AST from elm-syntax for this purpose, if there was a type variable.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.