Elm-syntax - The rough edges

I have been trying again to get the pretty printer output from elm-syntax-dsl to match the output of elm-format, and have fixed quite a few bugs towards this goal recently.

There are a couple of problems with elm-syntax, that cannot be solved in the pretty printer alone. Specifically, triple quoted multi-line strings “”", and single line comments --. Not that elm-syntax-dsl consumes elm-syntax as a dependency.

For Strings, In Elm.Syntax.Expression, line 125:

type Expression 
     = ... 
     | Literal String      -- Just one constructor for ALL strings 
     | ...

In Elm/Parser/Expression.elm, lines 448-454:

 literalExpression : Parser (WithComments (Node Expression)) 
 literalExpression = 
     Tokens.singleOrTripleQuotedStringLiteralMapWithRange 
         (\\range string -> 
             { comments = Rope.empty 
             , syntax = Node range (Literal string)  -- Same constructor for both! 
             } 
         )

For single line comments, there is a representation:

type alias File =
    { moduleDefinition : Node Module
    , imports : List (Node Import)
    , declarations : List (Node Declaration)
    , comments : List (Node Comment)
    }

The difficulty here is that these comments are gathered at the file level, with their Node giving their position in the code. If I re-format the code, this position may change, so how do I know where to put the comments back again?

So lets add comments directly into Expressions? Except there a lot of places that you can put a comment in Elm code, and that would mean putting a lot of Maybe Comments in the Expression AST, > 99.9% of which would not be used.

The String one seems solvable by adding a Normal|Triple label to the Literal String representation.

The comments one… perhaps take a look at how elm-syntax does it and borrow from there.

Just curious if anyone has any thoughts on this?

1 Like

Comments. It’s always the comments. They take the joy out of modifying files at the AST level.

4 Likes

Yeah, the doc comments are better - but even dealing with them is pretty complex, since they contain markdown, and then there is the interaction between the @docs tags and how the exposing list gets organized.

But I will investigate how elm-format deals with it all.

From what I can figure out elm-format seems to use the fact that a – comment can only appear once per line, and at the end of a line, or on a line by itself.

So if I have some code like this:

let
    x = a -- Comment here is end line attached to Node of a
          -- Comment here is after line break but attaced to Node of a
         + b -- Comment here is end line attached to Node of whole assignment
    -- Comment here can have its own Comment entry in the AST
in

So it looks like it would be enought to let each Node have 1 or more comments, attached either at the end of the line, or after a line break. Then its a question of figuring out the correct Node to attach each comment to.

Comments on their own line, not really attached to anything can have an AST entry.

let
    -- Comment here on line before, attached to assignment.
    x = a -- Comment here is end line attached to Node of a
          -- Comment here is after line break but attaced to Node of a
         + b -- Comment here is end line attached to Node of whole assignment
    -- Comment here can have its own Comment entry in the AST
in

Also attached to a Node, but on line before.

you can fairly easily go through all comments and collect those in a range: elm-syntax-format/src/ElmSyntaxPrintDefunctionalized.elm at main · lue-bird/elm-syntax-format · GitHub , then there are various ways elm-format prints them (some collapse into a single line if possible, some are all printed with linebreaks in between etc depending on where in the source code you are): elm-syntax-format/src/ElmSyntaxPrintDefunctionalized.elm at main · lue-bird/elm-syntax-format · GitHub

As you said, and also in my opinion, storing the parsed comments at each possible place makes it both very annoying to collect and is also a waste of space, so I think the current solution is good enough.

The only real issue currently is that many tokens do not have ranges in the syntax tree so when printing, we cannot know if the comments are before or after it: GitHub - lue-bird/elm-syntax-format: pretty print elm-syntax like elm-format

As for single vs triple quoted: yes, this will need to be fixed in elm-syntax v8, currently I check which one was used by seeing if the Node range is multi-line or the single-line range length is greater than the string unicode length + 2 (which sometimes fails when doing a bunch of escapes like escaping single quotes in a string but it’s generally a decent guess): elm-syntax-format/src/ElmSyntaxPrintDefunctionalized.elm at main · lue-bird/elm-syntax-format · GitHub

3 Likes

My pretty printer does not just print the code as elm-format does. Instead it lets you print the code to a particular page-width (and now with an optional ribbon-width too). It re-flows the code to try and fit it to a particular page with. If there are broken lines which are short, it may reassemble them into single lines. If there are lines which are too long or lead to too deeply indented code further on, it may break those lines to get everything to fit.

So I could find all comments in a range. But the ranges are changing during this code reformat. So I will not know what range to find all comments for, if you see what I mean?

===

The idea is to get this reformated output as close as I can to being idempotent under elm-format - that is, running elm-format on it will not make any changes. In practice this will be impossible to do when fitting, because elm-format sometimes puts things on a line regardless of how long it is. But hopefully I can get very close, or find ways to violate line length restrictions where absolutely needed to match elm-format.

Most likely I will just make a tool that runs a re-format and then after runs elm-format on it anyway to ensure an exact match.

1 Like

Really curious timing, just ran into these issues this week, and am currently working on an expanded version of the syntax model.

Adding to what @lue-bird already mentioned, here is the link to the PR adding the triple quote distinction in breaking-changes-v8: https://github.com/stil4m/elm-syntax/pull/189

The difficulty here is that these comments are gathered at the file level, with their Node giving their position in the code. If I re-format the code, this position may change, so how do I know where to put the comments back again?

When I am processing a node for formatting, I still have its original location/range. I use that to determine which of the comments (if any) belong here. In other words, I compare original node location with original comment location to associate them.

Some libraries for other languages do it this way. For example, in Roslyn, each node has both LeadingTrivia and TrailingTrivia properties. (Trivia not only contain comments, since it supports different kinds of whitespaces and line-breaks, all to be preserved exactly)
Fun exercises to figure out whether to put such a comment in the trailing trivia of the last, or the leading trivia of the next expression node…

But, so far the model keeping the comments in the list at the root seems to work well. I did not run into a case which made me want to change that aspect of the syntax model.

Absolutely, I am experimenting with a modified version to add locations of these tokens.

Related:

1 Like

To work around the problem of figuring out placement between these location-less tokens on one side and comments on the other, we used an approach that places the tokens where the gaps between comments are.

The following is an example from our test suite that we succesfully round-trip using the original stil4m/elm-syntax (version 7) model:

module Test exposing (..)


decl a b =
    let
        alfa =
            42
    in
    if a == 71 then
        [ 13 ]

    else if
        -- A comment before condition
        (a == 73)
            || (a == 77)
        -- Comment between conditions
    then
        [ 17 ]

    else if
        a == 79
        -- Another simple comment
    then
        -- And another simple comment
        [ 21 ]

    else
        -- Yet another simple comment
        [ 23 ]

While this might be sufficient to achieve proper round-tripping aligned with avh4/elm-format, the implementation is somewhat hacky here: Since our architecture has a formatter separate from the renderer, the renderer has duplicate implementations of this placement logic.

I am in agreement with you all, there’s a lot of pain points with elm-syntax, and the v8 branch/project is work in that direction. I described the future of elm-syntax v8 during its last major development. There’s an branch/PR with all the ongoing changes: Ongoing branch for v8 by jfmengels · Pull Request #188 · stil4m/elm-syntax · GitHub

Unfortunately, I have not had the time/energy to work on it recently. But if people want to discuss, propose or even make PRs, I’ll do my best to help out so we can get this to a close. In my OSS experience, there’s nothing like collaboration to re-ignite the fire. And it would feel so good to get over this milestone!

The simple/triple quotes is one of the issues fixed on the v8 branch.

As for comments, the discussion has been brought up many times, and I don’t think I have been convinced by anything yet (but it’s been a while). I’m absolutely open to changing my mind, but right now I’m going with not changing anything and leaving the comments be a separate list next to comments.

1 Like

Thanks - I had not realized there was a v8 branch till lue mentioned it, I will check that out.

I can try re-inserting the comments from the list, using original position ranges from the nodes being re-formated. Perhaps it will not be as hard as I imagine, but knowing how hard it was to write the pretty printing code, I do imagine it will be extremely fiddly. Its worth exploring this though as keeping the comments in a list is definitely a much simpler model than working them into the AST directly.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.