Parsers with Error Recovery

I’ve seen this topic mentioned a few times on Slack, but though it worth asking here so the answers are better preserved.

I want to write a parser with error recovery (for this). So as a user is inputing the source code, the parser should run and attempt to parse the input. If it fails to do so, it should recover and continue on the next part of the source that is valid.

Has anyone any links to blogs, source code, etc on doing this in Elm? Or any thoughts or experience they might care to share?

2 Likes

The easiest way is probably to define a simple notion of “chunks” that any file can be broken into.

So with Elm code, maybe that means a chunk begins whenever you have a “fresh line” where a newline is directly followed by a letter. That indicates that we are seeing a type, type alias, type annotation, declaration, import, or port. Then you parse the chunks individually.

That approach could get confused by weirdly indented expressions I guess, but it seems like it’s a reasonable start at least! I also wouldn’t be to tied to trying to do it all in a single pass over the file at first. Just start by figuring out if it can get the error messages you’d like to see!

3 Likes

The approach taken by IntelliJ is two-fold:

  1. pinning: when a certain token is reached in a parse rule, commit to that rule even if subsequent tokens do not match the remaining rule
  2. recovery: when committed to a failed rule*, discard tokens until you reach some sentinel token (or arbitrary logic) which indicates a point from which you can recover.

The error recovery can be nested so you can recover from top-level declarations, as Evan described, or you can recover inside expressions, i.e. the incomplete List literal [ 1, 2,

Pinning happens when you see something that unequivocally indicates what the user intends. So if I’m writing an expression in Elm and I type [, we can be pretty sure that the user is trying to write a List literal and we commit to parsing it as a list literal at this point even though there is no matching ].

Recovery happens when you need to drop garbage tokens so that the parser can resume. If I write [1, 2, 3xyz ], the parser will fail when we reach 3xyz (because it’s not a valid expression). Assuming that we pinned (committed to parsing the list literal) when we saw the [, we can recover by dropping tokens until we consume the closing ].

Footnote *: this is a bit of a simplification. I think recovery in IntelliJ/GrammarKit used to work this way (where it would only happen when a rule failed but was pinned), but now I think the recovery is always done, even when a rule succeeds normally.

4 Likes

Elm Markup does parsing with recovery and it sounds like it takes roughly the same approach as @klazuka describes.

This module is likely relevant: Tolerant Parser though not well documented because it’s internal :sweat_smile:

3 Likes

Thanks for all of these most helpful replies.

This will work nicely for what I am trying to do. The file is a series of ‘declarations’. I can either chunk on newline followed by a letter, as you suggest. Or I could chunk on newline followed by ‘name =’, as all the declarations start that way - that would allow subsequent lines of the declaration to also be at the start of the line.

I am aiming to do the parsing on the file as it is being edited. Chunking will mean that the parsing does not have to be re-done on parts of the file already parsed. I can track where the edits are, figure out which chunks that overlaps with, and then re-chunk and parse just those.

I am trying to comprehend whether this scheme can be supported by elm/parser. I guess its going to be supported by whatever goes in the ‘context’ of a Parser.Advanced.

I think studying this will probably answer what I am wondering about above.

Beyond a first pass implementation with chunking of ‘declarations’, it would nice if I could have tolerant parsing that goes deeper into the syntax for the purpose of offering context sensitive help. For example, my ‘declarations’ can take a list of properties (name/value pairs), and the possible properties depend on which code generator the model is going to be passed to. Some properties are optional, some are required. If I can parse the property list as it is being edited but skip over any malformed parts of it, then I should be able to see what properties are already defined, and offer up suggestions for those that are not yet defined.

On the chunking - I looked into doing it with elm/regex which is easy to do:

Regex.fromString "\\n(?=\\w)"
    |> Maybe.withDefault Regex.never
    |> (\r -> Regex.split r source)

But for interactive work, I will also need to know the start and end positions of each chunk. For that reason, I think it may make sense to write the initial chunker using elm/parser.

Or perhaps I should start by splitting into individual lines, and then grouping them into chunks. That might work better, as the editor could be backed by an array-of-lines model.

I’ve been trying to collect my thoughts on this, feel free to chip in if anything chimes for you!

I think this is good advice, as recovery may need to understand the context of an error in order to be able to succesfully recover. If error reporting is already precise and giving good context, the code will likely be in a good place to implement recoveries. As I figure out the recovery strategies, I may need to re-visit context and adjust it better to that purpose.

High Level Approch

Parsing is quite hard, so is doing good errors, so is doing error recovery. It seems prudent to not try and tackle all this complexity in a single pass. Fortunately, the way elm/parser is structured supports this well, as does Matt’s TolerantParser, since Parser is simpler than Parser.Advanced is simpler that TolerantParser. So…

  1. Write a Parser that accepts the input, does not matter that error reporting is not so great. Just get the shape of the parser right.

  2. Switch up to Parser.Advanced using type alias Parser = Parser.Advanced.Parser Never Parser.Problem a to get started. Then start designing the Context paying particular attention to areas where a recovery could be possible.

  3. Look at error recovery.

Chunking, Error positions and Global Context

Since I chunked ahead of using a Parser, rather than trying to parse all in a single pass and recover to the start of chunks, each Parser will not be starting from true line 1. So I at least need to pass the start line around to add to the row.

I could do this at the end, if I am only marking positions in error, by post-processing the list of DeadEnds.** In my case, I want to record source positions all through the AST on succesful parsings too, so that errors in later stages of the compiler pipe-line can also tie back to the source accurately. (I could also post process the AST to add in the start line, but why make an extra pass if its not really needed).

At first I was trying to is Parser.Advanced.inContext to carry this global context around. Then I realised there is no getContext function, so how do I get the info back deeper in the parser? So I now think of the Parser context as a local context, and a seperate global context can just be passed around as function parameters:

type alias TopLevelContext = 
    { startLine : Int
    , ...
    }

someParser : TopLevelContext -> Parser AST
someParser global = 
    ...
    |= someChildParser global

someChildParser : TopLevelContext -> Parser AST
...

** Actually, just realized I have to do it at the end for errors, since there is no mapDeadEnds function that I could use to modify the line numbers on-the-fly. No big deal, at least I can insert the correct line numbers into the succesful AST during parsing.

I can see recovery is often likely to consist of chomping until one of some list of characters is reached. Using Elm as an example, if we have:

val = [ 1, 2, 3, ]
val = [ 1, 2, 3xy, 4]

Then chomping until ',' or ']' is hit is going to work.

Indeed, I can see in the TolerantParser, this recovery strategy is already coded:

https://github.com/mdgriffith/elm-markup/blob/3.0.1/src/Mark/Internal/TolerantParser.elm#L129

type OnError error
    = FastForwardTo (List Char)
    | Skip
    | StopWith error

Carrying the closure.

But there is another common case, that should come up often during interactive editing. That is when the list has not been closed yet. Suppose I am typing:

val = [ 1, 2, 3, 
                 ^ cursor is here

I didn’t close the list yet, and chomping for the comma or end-of-list is not going to work. An editor could theoretically want to infer type on the list by recovering to get the list [1, 2, 3] and then offer help relevant to a List Int context.

I am thinking this could be done by always pushing the closure of the current AST structure onto the context. By which I mean, if say [ is encountered, the expected closure is ]. Or ``(and)orletandin` and so on.

When an error is encountered, in addition to chomping for certain characters, could also try popping the next closure off the context stack and acting as if it was encountered, and then trying to continue from there. If that fails, could rewind and try popping another one, and so on, until none are left.


val =
    [ (1, "one"
    , (2, "two")
    ]

This doesn’t seem like it will always produce good result. I forgot the ) on the first list entry, the parser won’t notice until it hits ], at which point it will pop the ) and parse the expression as [ (1, "one", (2, "two")) ].

I don’t see a function to pop the context stack, only to push onto it with inContext, so I am not yet sure how this will work? I guess I can always push a new context onto it that is always interpreted as overriding the next one, so can modify the context in that way even without a pop operation. :thinking:

If your project also provides a source editor, then one thing you can do to mitigate the missing, closing token (e.g. ], ), }) would be to always insert a matching, closing token when the user types an opening token. This doesn’t guarantee that you’ll always have a closing token, but it will probably work most of the time.

1 Like

Yes, having the editor insert the closing ], ), } will help.

My project does include a source editor, if it did not I would not bother trying to do error recovery on the parsing. Top-level chunking is enough to restart the parser and allows multiple errors to be reported per file. Imagine for example, if Elm did not do this and only ever gave max 1 error per file, you would not get such a good insight into how many errors you have left to fix, or to choose the order in which you fix them. But for a compiler its probably enough if it can generate one error per chunk at the syntax checking stage.

My source language Salix is a data modelling language, and the compiler can feed it to a configurable set of output code generators. Each code generator can accept or require additional ‘properties’ be added to the source. For example, if I model a 1:1 relationship between 2 records and am generating the DDL for a relational database, I need to decide which table will hold the foreign key. In that case, there might be a sql.fk property to tag one of the record fields with. So beyond parsing, I want feedback from later stages of the compiler to the editor in order to populate auto completion lists with the available or required properties, and so on.

That is my motivation for exploring error recovery deeper within the AST, and not just top-level chunking. I want to succesfully parse the property list, even when it is not syntactically correct.

This illustrates the feedback loop:

Editor ---  (Source) ---> Parser --- (AST) --> Later Compiler Stages 
  ^--------------------Context Sensitive Help ------------|

This also leads me to thinking, do I need some kind of additional ‘hole’ in my AST? So where there is an error in the source, I could mark a hole that later stages of the compiler could provide auto completion suggestions for? Perhaps I don’t need to do this, as I am only interested in errors that occur where the cursor currently is, so I can figure out where in the AST the cursor is, and build the auto completion list based on that.

I think where an error occurs, and the parser chomps to recover, I need to capture the chomped string in the error (using getChompedString). I possibly need to trim off the whitespace around that chomped string too. This string and its location will then allow the editor to know what to replace with suggestions from the auto completion list.

Here’s also some research, if your keen to get into the details https://tree-sitter.github.io/tree-sitter/#underlying-research

That’s some hard-core reading material right there… thanks, I’ll give it a go.

I am wondering how much of this stuff is implementable on top of elm/parser. If that does not expose the features needed, I’ll most likely just stick with what I can do with extending elm/parser as I don’t really want to re-design the parser from scratch.

Also noting that elm/parser is a kernel package - not sure if it really needs to be, or that is just for some speed ups? But it means I can’t just clone elm/parser and make small modifications.

1 Like

So I feel like I am now making some good progress with this. I have taken Matt’s TolerantParser as a starting point, the current WIP is here:

https://github.com/the-sett/parser-recoverable

Example

Don’t get too excited, but I created a little example to start trying it out:

cd examples
elm-live src/Main.elm

Parsing Outcomes

I didn’t like that the result of run is going to yield a double wrapped Result, the outer one with DeadEnds but the inner one with just problem.

type alias Parser context problem data =
    Parser.Parser context problem (Result (List problem) data)

outcome : Result (List (DeadEnd c x)) (Result (List x) data)
outcome = Parser.Advaned.run someParser someInput

So I changed the inner error representation to be (DeadEnd c x) too. Only problem with that is that Parser.Advanced has no getContext function, so I could not give a context stack. Errors resulting from inside the extended parser will therefore have not have the full context - perhaps there is a way to do it by carrying around a second copy of the context stack with the parser? I guess I’ll try and fix that when it actually looks like it is needed.

The other thing is that I didn’t think Result was the right output for this parser to give. If the parser does recover succesfully it will still yield some AST, but it should also give errors for the bits it had to gloss over to get it. So the outcome should be more like a tuple.

I decided to use this rather than (List (DeadEnd c x), Maybe data), to avoid the case where there are no errors and no data!

{-| Describes the possible outcomes from running a parser.

    - `Success` means that the parsing completed with no syntax errors at all.
    - `Partial` means that the parsing was able to complete by recovering from
    syntax errors. The syntax errors are listed along with the parsed result.
    - `Failure` means that the parsing could not complete, so there is no parsed
    result, only a list of errors.

-}
type Outcome context problem value
    = Success value
    | Partial (List (DeadEnd context problem)) value
    | Failure (List (DeadEnd context problem))

The run function is then:

run : Parser c x a -> String -> Outcome c x a

Recovery Tactics

Similar to the TolerantParser, I defined a set of actions to guide the parser when it fails.

{-| Describes the possible ways the parser should act when it encounters
something that it cannot parse.

    - `Fail` stop parsing and return a `Failure` outcome.
    - `Warn` ignore the error, but add a problem and use a `Partial` outcome.
    - `Ignore` ignore the error and continue with a `Success` outcome.
    - `ChompForMatch` try chomping to find a matching character. If succesfull
    add a problem but continue with a `Partial` outcome. If this does not work
    then `Fail`.

-}
type RecoveryTactic x
    = Fail
    | Warn x
    | Ignore
    | ChompForMatch (List Char) x

The default behaviour is to Fail.

The current recovery tactic can be attached to a parser with this function:

withRecovery : RecoveryTactic x -> Parser c x a -> Parser c x a

The idea is that this will be passed down to all subsequent parsers (chained with ignore, keep, map, andThen, and so on), and not just for one particular token. So if parsing a List Int, but some of the Ints are expressions, the parser could keep its strategy of chomping to the end of the list or next comma, in the event that it sees a malformed expression. Not totally sure this is the right thing, but it feels right for now.

Feedback to the Editor

This is what I am going to work on next, by evolving the problem type.

When a string gets chomped to recover, I will add the start position of the string, and the string itself to the problem. The idea is that an editor can check if a problem overlaps the current cursor position, and if so, it knows what String to cut out of the source and offer context sensitive suggestions to replace it with.