Parsers with Error Recovery

I’ve seen this topic mentioned a few times on Slack, but though it worth asking here so the answers are better preserved.

I want to write a parser with error recovery (for this). So as a user is inputing the source code, the parser should run and attempt to parse the input. If it fails to do so, it should recover and continue on the next part of the source that is valid.

Has anyone any links to blogs, source code, etc on doing this in Elm? Or any thoughts or experience they might care to share?

2 Likes

The easiest way is probably to define a simple notion of “chunks” that any file can be broken into.

So with Elm code, maybe that means a chunk begins whenever you have a “fresh line” where a newline is directly followed by a letter. That indicates that we are seeing a type, type alias, type annotation, declaration, import, or port. Then you parse the chunks individually.

That approach could get confused by weirdly indented expressions I guess, but it seems like it’s a reasonable start at least! I also wouldn’t be to tied to trying to do it all in a single pass over the file at first. Just start by figuring out if it can get the error messages you’d like to see!

4 Likes

The approach taken by IntelliJ is two-fold:

  1. pinning: when a certain token is reached in a parse rule, commit to that rule even if subsequent tokens do not match the remaining rule
  2. recovery: when committed to a failed rule*, discard tokens until you reach some sentinel token (or arbitrary logic) which indicates a point from which you can recover.

The error recovery can be nested so you can recover from top-level declarations, as Evan described, or you can recover inside expressions, i.e. the incomplete List literal [ 1, 2,

Pinning happens when you see something that unequivocally indicates what the user intends. So if I’m writing an expression in Elm and I type [, we can be pretty sure that the user is trying to write a List literal and we commit to parsing it as a list literal at this point even though there is no matching ].

Recovery happens when you need to drop garbage tokens so that the parser can resume. If I write [1, 2, 3xyz ], the parser will fail when we reach 3xyz (because it’s not a valid expression). Assuming that we pinned (committed to parsing the list literal) when we saw the [, we can recover by dropping tokens until we consume the closing ].

Footnote *: this is a bit of a simplification. I think recovery in IntelliJ/GrammarKit used to work this way (where it would only happen when a rule failed but was pinned), but now I think the recovery is always done, even when a rule succeeds normally.

4 Likes

Elm Markup does parsing with recovery and it sounds like it takes roughly the same approach as @klazuka describes.

This module is likely relevant: Tolerant Parser though not well documented because it’s internal :sweat_smile:

3 Likes

Thanks for all of these most helpful replies.

This will work nicely for what I am trying to do. The file is a series of ‘declarations’. I can either chunk on newline followed by a letter, as you suggest. Or I could chunk on newline followed by ‘name =’, as all the declarations start that way - that would allow subsequent lines of the declaration to also be at the start of the line.

I am aiming to do the parsing on the file as it is being edited. Chunking will mean that the parsing does not have to be re-done on parts of the file already parsed. I can track where the edits are, figure out which chunks that overlaps with, and then re-chunk and parse just those.

I am trying to comprehend whether this scheme can be supported by elm/parser. I guess its going to be supported by whatever goes in the ‘context’ of a Parser.Advanced.

I think studying this will probably answer what I am wondering about above.

Beyond a first pass implementation with chunking of ‘declarations’, it would nice if I could have tolerant parsing that goes deeper into the syntax for the purpose of offering context sensitive help. For example, my ‘declarations’ can take a list of properties (name/value pairs), and the possible properties depend on which code generator the model is going to be passed to. Some properties are optional, some are required. If I can parse the property list as it is being edited but skip over any malformed parts of it, then I should be able to see what properties are already defined, and offer up suggestions for those that are not yet defined.

On the chunking - I looked into doing it with elm/regex which is easy to do:

Regex.fromString "\\n(?=\\w)"
    |> Maybe.withDefault Regex.never
    |> (\r -> Regex.split r source)

But for interactive work, I will also need to know the start and end positions of each chunk. For that reason, I think it may make sense to write the initial chunker using elm/parser.

Or perhaps I should start by splitting into individual lines, and then grouping them into chunks. That might work better, as the editor could be backed by an array-of-lines model.

I’ve been trying to collect my thoughts on this, feel free to chip in if anything chimes for you!

I think this is good advice, as recovery may need to understand the context of an error in order to be able to succesfully recover. If error reporting is already precise and giving good context, the code will likely be in a good place to implement recoveries. As I figure out the recovery strategies, I may need to re-visit context and adjust it better to that purpose.

High Level Approch

Parsing is quite hard, so is doing good errors, so is doing error recovery. It seems prudent to not try and tackle all this complexity in a single pass. Fortunately, the way elm/parser is structured supports this well, as does Matt’s TolerantParser, since Parser is simpler than Parser.Advanced is simpler that TolerantParser. So…

  1. Write a Parser that accepts the input, does not matter that error reporting is not so great. Just get the shape of the parser right.

  2. Switch up to Parser.Advanced using type alias Parser = Parser.Advanced.Parser Never Parser.Problem a to get started. Then start designing the Context paying particular attention to areas where a recovery could be possible.

  3. Look at error recovery.

Chunking, Error positions and Global Context

Since I chunked ahead of using a Parser, rather than trying to parse all in a single pass and recover to the start of chunks, each Parser will not be starting from true line 1. So I at least need to pass the start line around to add to the row.

I could do this at the end, if I am only marking positions in error, by post-processing the list of DeadEnds.** In my case, I want to record source positions all through the AST on succesful parsings too, so that errors in later stages of the compiler pipe-line can also tie back to the source accurately. (I could also post process the AST to add in the start line, but why make an extra pass if its not really needed).

At first I was trying to is Parser.Advanced.inContext to carry this global context around. Then I realised there is no getContext function, so how do I get the info back deeper in the parser? So I now think of the Parser context as a local context, and a seperate global context can just be passed around as function parameters:

type alias TopLevelContext = 
    { startLine : Int
    , ...
    }

someParser : TopLevelContext -> Parser AST
someParser global = 
    ...
    |= someChildParser global

someChildParser : TopLevelContext -> Parser AST
...

** Actually, just realized I have to do it at the end for errors, since there is no mapDeadEnds function that I could use to modify the line numbers on-the-fly. No big deal, at least I can insert the correct line numbers into the succesful AST during parsing.

I can see recovery is often likely to consist of chomping until one of some list of characters is reached. Using Elm as an example, if we have:

val = [ 1, 2, 3, ]
val = [ 1, 2, 3xy, 4]

Then chomping until ',' or ']' is hit is going to work.

Indeed, I can see in the TolerantParser, this recovery strategy is already coded:

https://github.com/mdgriffith/elm-markup/blob/3.0.1/src/Mark/Internal/TolerantParser.elm#L129

type OnError error
    = FastForwardTo (List Char)
    | Skip
    | StopWith error

Carrying the closure.

But there is another common case, that should come up often during interactive editing. That is when the list has not been closed yet. Suppose I am typing:

val = [ 1, 2, 3, 
                 ^ cursor is here

I didn’t close the list yet, and chomping for the comma or end-of-list is not going to work. An editor could theoretically want to infer type on the list by recovering to get the list [1, 2, 3] and then offer help relevant to a List Int context.

I am thinking this could be done by always pushing the closure of the current AST structure onto the context. By which I mean, if say [ is encountered, the expected closure is ]. Or ( and ) or let and in and so on.

When an error is encountered, in addition to chomping for certain characters, could also try popping the next closure off the context stack and acting as if it was encountered, and then trying to continue from there. If that fails, could rewind and try popping another one, and so on, until none are left.


val =
    [ (1, "one"
    , (2, "two")
    ]

This doesn’t seem like it will always produce good result. I forgot the ) on the first list entry, the parser won’t notice until it hits ], at which point it will pop the ) and parse the expression as [ (1, "one", (2, "two")) ].

I don’t see a function to pop the context stack, only to push onto it with inContext, so I am not yet sure how this will work? I guess I can always push a new context onto it that is always interpreted as overriding the next one, so can modify the context in that way even without a pop operation. :thinking:

If your project also provides a source editor, then one thing you can do to mitigate the missing, closing token (e.g. ], ), }) would be to always insert a matching, closing token when the user types an opening token. This doesn’t guarantee that you’ll always have a closing token, but it will probably work most of the time.

1 Like

Yes, having the editor insert the closing ], ), } will help.

My project does include a source editor, if it did not I would not bother trying to do error recovery on the parsing. Top-level chunking is enough to restart the parser and allows multiple errors to be reported per file. Imagine for example, if Elm did not do this and only ever gave max 1 error per file, you would not get such a good insight into how many errors you have left to fix, or to choose the order in which you fix them. But for a compiler its probably enough if it can generate one error per chunk at the syntax checking stage.

My source language Salix is a data modelling language, and the compiler can feed it to a configurable set of output code generators. Each code generator can accept or require additional ‘properties’ be added to the source. For example, if I model a 1:1 relationship between 2 records and am generating the DDL for a relational database, I need to decide which table will hold the foreign key. In that case, there might be a sql.fk property to tag one of the record fields with. So beyond parsing, I want feedback from later stages of the compiler to the editor in order to populate auto completion lists with the available or required properties, and so on.

That is my motivation for exploring error recovery deeper within the AST, and not just top-level chunking. I want to succesfully parse the property list, even when it is not syntactically correct.

This illustrates the feedback loop:

Editor ---  (Source) ---> Parser --- (AST) --> Later Compiler Stages 
  ^--------------------Context Sensitive Help ------------|

This also leads me to thinking, do I need some kind of additional ‘hole’ in my AST? So where there is an error in the source, I could mark a hole that later stages of the compiler could provide auto completion suggestions for? Perhaps I don’t need to do this, as I am only interested in errors that occur where the cursor currently is, so I can figure out where in the AST the cursor is, and build the auto completion list based on that.

I think where an error occurs, and the parser chomps to recover, I need to capture the chomped string in the error (using getChompedString). I possibly need to trim off the whitespace around that chomped string too. This string and its location will then allow the editor to know what to replace with suggestions from the auto completion list.

Here’s also some research, if your keen to get into the details https://tree-sitter.github.io/tree-sitter/#underlying-research

That’s some hard-core reading material right there… thanks, I’ll give it a go.

I am wondering how much of this stuff is implementable on top of elm/parser. If that does not expose the features needed, I’ll most likely just stick with what I can do with extending elm/parser as I don’t really want to re-design the parser from scratch.

Also noting that elm/parser is a kernel package - not sure if it really needs to be, or that is just for some speed ups? But it means I can’t just clone elm/parser and make small modifications.

1 Like

So I feel like I am now making some good progress with this. I have taken Matt’s TolerantParser as a starting point, the current WIP is here:

https://github.com/the-sett/parser-recoverable

Example

Don’t get too excited, but I created a little example to start trying it out:

cd examples
elm-live src/Main.elm

Parsing Outcomes

I didn’t like that the result of run is going to yield a double wrapped Result, the outer one with DeadEnds but the inner one with just problem.

type alias Parser context problem data =
    Parser.Parser context problem (Result (List problem) data)

outcome : Result (List (DeadEnd c x)) (Result (List x) data)
outcome = Parser.Advaned.run someParser someInput

So I changed the inner error representation to be (DeadEnd c x) too. Only problem with that is that Parser.Advanced has no getContext function, so I could not give a context stack. Errors resulting from inside the extended parser will therefore have not have the full context - perhaps there is a way to do it by carrying around a second copy of the context stack with the parser? I guess I’ll try and fix that when it actually looks like it is needed.

The other thing is that I didn’t think Result was the right output for this parser to give. If the parser does recover succesfully it will still yield some AST, but it should also give errors for the bits it had to gloss over to get it. So the outcome should be more like a tuple.

I decided to use this rather than (List (DeadEnd c x), Maybe data), to avoid the case where there are no errors and no data!

{-| Describes the possible outcomes from running a parser.

    - `Success` means that the parsing completed with no syntax errors at all.
    - `Partial` means that the parsing was able to complete by recovering from
    syntax errors. The syntax errors are listed along with the parsed result.
    - `Failure` means that the parsing could not complete, so there is no parsed
    result, only a list of errors.

-}
type Outcome context problem value
    = Success value
    | Partial (List (DeadEnd context problem)) value
    | Failure (List (DeadEnd context problem))

The run function is then:

run : Parser c x a -> String -> Outcome c x a

Recovery Tactics

Similar to the TolerantParser, I defined a set of actions to guide the parser when it fails.

{-| Describes the possible ways the parser should act when it encounters
something that it cannot parse.

    - `Fail` stop parsing and return a `Failure` outcome.
    - `Warn` ignore the error, but add a problem and use a `Partial` outcome.
    - `Ignore` ignore the error and continue with a `Success` outcome.
    - `ChompForMatch` try chomping to find a matching character. If succesfull
    add a problem but continue with a `Partial` outcome. If this does not work
    then `Fail`.

-}
type RecoveryTactic x
    = Fail
    | Warn x
    | Ignore
    | ChompForMatch (List Char) x

The default behaviour is to Fail.

The current recovery tactic can be attached to a parser with this function:

withRecovery : RecoveryTactic x -> Parser c x a -> Parser c x a

The idea is that this will be passed down to all subsequent parsers (chained with ignore, keep, map, andThen, and so on), and not just for one particular token. So if parsing a List Int, but some of the Ints are expressions, the parser could keep its strategy of chomping to the end of the list or next comma, in the event that it sees a malformed expression. Not totally sure this is the right thing, but it feels right for now.

Feedback to the Editor

This is what I am going to work on next, by evolving the problem type.

When a string gets chomped to recover, I will add the start position of the string, and the string itself to the problem. The idea is that an editor can check if a problem overlaps the current cursor position, and if so, it knows what String to cut out of the source and offer context sensitive suggestions to replace it with.

I found some time at the weekend to look through these. The parsing papers are not really much use when it comes to Elm, as they are all about LR parsers - tree sitter is presumably LR parser based. The Elm parser is a recursive descent parser and so is for LL grammars.

This is why it has backtrackable for situations where you need to parse ahead, realize you are on the wrong branch and then back-up and try a different one. For example in C, you might have a function signature or implementation:

// Signature
void f(int, const char*);

// Implementation
void f(int, const char*)
{
}

You don’t know which you are getting until you hit the ; or {. With an LL parser you either re-design the grammar to be more LL friendly (that is make the void f(int, const char*) bit a production of the grammar that forms part of a sig or implementation), or look ahead.

Googling for “error recovery in LL parser” or “error recovery in recursive descent parser” does turn up some relevant stuff.

From the tree sitter stuff though, this paper is quite fascinating on the topic of incremental tooling, https://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-946.pdf. One thing to note is that written over 20 years ago, computers were a lot slower, I think that a simple chunking strategy might be good enough to keep the problem small enough with the amount of grunt we have at our disposal these days. Also, there seems to be an obsession with reaching the milestone having things fast enough to re-parse on every key stroke. A significant achievement, but I think its not really necessary as most editors implement a delay before offering auto suggests anyway, so fine to do it a few hundred ms after the last keystroke.

1 Like

Anyway, I think I have now got this working now how I want it to. I have played around with various ideas, and the main insight I have had is that the recovery tactic should not be associated with tokens. The aim is not to error correct individual tokens, but simply to get the parser back to a place where it can continue, whilst taking note of the problem that did occur.

Originally, I had the idea of passing down the error handler on each Parser building block, in a similar way to how inContext works in Parser.Advanced. So I had:

type Parser context problem value
    = Parser
        (RecoveryTactic problem
         ->
            { pa : PA.Parser context problem (Outcome context problem value)
            , onError : RecoveryTactic problem
            }
        )

The problem with this, is that the recovery tactic would often be used in the wrong situation. If parsing a list of integers with a contribution from :cat: say:

[1sfd, 2, 3, 4]

That would fail on parsing 1sdf as an Int, but if recovering by skipping ahead to , then re-trying the int parser, that is also going to fail because there is whitespace after the comma, not an int. The recovery tactic needs to be put accross a larger piece of the parser, which will be re-tried in its entirety:

(PR.succeed identity
    |> PR.ignore PR.spaces
    |> PR.keep (PR.int ExpectingInt InvalidNumber)
    |> PR.ignore PR.spaces
    |> PR.ignore (PR.symbol "," ExpectingComma)
)
    |> PR.forwardThenRetry [ ',' ] ExpectingComma Recovered

So I was able to get rid of the complicated context passing error handling mechanism, and just have the Parser type like this:

type alias Parser context problem value =
    PA.Parser context problem (Outcome context problem value)

The recovery tactic described in various papers is to first back up to a known start symbol, then scan ahead to a sentinal symbol, and try to continue after that. This is implemted as:

https://github.com/the-sett/parser-recoverable/blob/master/src/Parser/Recoverable.elm#L608

And can be summarised by this pseudo-code:

forwardThenRetry parser =
    loop, starting with empty warning list []
        oneOf [ try backtrackable parser
                    |> ensure any warnings from previous loop iterations are kept
              , scan ahead for a sentinal token
                    |> if no characters consumed then
                           fail
                        else 
                           try again 
                              |> adding a warning about what was skipped over
                   ]

Some tidying up and documentation and I will put it out as a new package.

2 Likes

Very glad to have found this discussion. Super interesting and helpful.

I use a very simple chunking approach, “logical paragraphs” in MiniLaTex, so that the parser can keeping going and give the user as much rendered output as possible.

A logical paragraph is either an ordinary paragraph (blank lines above and below) or an outer \begin ... \end block in LaTeX. This approach confines parse failures to logical paragraphs, which helps a lot, but is not sufficiently fine-grained. With one exception (below) this means that in most cases at most one paragraph is not properly rendered.

The biggest problem with the logical paragraph approach as it stands is that if the user types \begin{whatever}, the text will generate an unpleasant error until the matching \end{whatever} is typed: all the text after the \begin{whatever} is seriously messed up/missing in the rendered output. I had implemented a solution like the one @klazuka suggested above … have the editing system supply the \end{whatever}. However, it had to put that aside for the time being because of the jumping-cursor problem with text edit fields. Bummer! As soon as I get elm-editor in good enough shape, to integrate with the app, it will be possible to use this fix.

Hope to understand the above discussion to to more fine-grained recovery.

@rupert, I just cloned your parser-recoverable repo and tried the Array example. Beautiful! Am going to study your code and see if I can use it to improve the MiniLaTeX parser. Should give better results than what I have now (chunking as above + error messages using Parser.Advanced).

I should be able to introduce parser-recoveralble gradually so as not to have to rewrite the whole parser before seeing benefits from it.

I tried to keep the API idential to Parser.Advanced, so switching over to it should be relatively easy. All the functions should behave the same way as the corresponding functions in Parser.Advanced. The behaviour should only change when you apply one of the recovery tactics.

I cannot export (|.) and (|=) though, due to kernel code restrictions :-(. So you have to re-write those:

import Parser.Advanced as PA exposing ((|=), (|.))

parser = 
    PA.succeed identity
        |. PA.spaces
        |= PA.int
        |. PA.spaces

to

import Parser.Recoverable as PR

parser = 
    PR.succeed identity
        |> PR.ignore PR.spaces
        |> PR.keep PR.int
        |> PR.ignore PR.spaces

Another difference is that where functions like Parser.Advanced.keyword take a Token argument, I thought just passing in the String and problem without wrapping as a Token was neater:

PA.keyword (PA.Token "let" ExpectingLet)

PR.keyword "let" ExpectingLet

I’ll likely make a release soon - I’m trying to figure out how to make a recoverable version of sequence first. The fast-forward recovery at the moment fast forwards to a sentinal Char, then continues. But sequence uses Strings as separators not Chars, so I need to implement a slightly different fast-forward mechanism, then it will be good to go. I’ll just push it out with the recovery tactics I implemented so far, it will be interesting to hear if they work for you, or if you find they need to be modified to cover situations you are encountering.

Currently trying to write a recovery tactic for sequences, which is proving quite tricky.

Parser.Advanced.sequence lets you specify how you want the Trailing separators handled with its Forbidden, Optional and Mandatory choices. A recoverable parser has to deal with missing values in the sequence, including at the very end, which is going to interact with the trailing separators choice. I suppose I just need to enemerate all the possibilities and figure out what the outcome should be for each.

For example:

For a sequence like [1,], supposed to be an Elm-style List of Ints.

Forbidden trailing , - so error is ‘skipped ahead to the end token’, with the Partial result [1].
… other options

… other error sequences that are possible

The recovery tactics I implemented so far require you to add them in to exising parsing code - which is not simple to do. The idea of having error recovery handled automatically, is appealing.

So the other thing I have been looking into is error recovery in LL grammars and parser generators. I don’t think elm/parser is really restricted to LL grammars, its just that the way we most commonly use it is to write recursive descent style parsers.

A recursive descent parser using a parsing combinator library, embeds implicit knowledge about the grammar being parsed into code. An LL parser generator can generate recursive descent parsers, but can also make use of explicit knowledge about the grammar it is parsing and has been giving as input to generate a parser for. When it comes to error recovery, this makes a big difference.

This paper gives details of how to calculate for a given token and parse state, how to figure out the set of things that could come next. That means that the parser can automatically figure out where it can continue from. There is also an algorithm for repairing the input, by inserting imaginary tokens not in the input, in order to get to the continuation token, and this algorithm also always terminates. The result is an LL parser generator that can recover and repair all errors automatically. Pretty nice:

I did not realize it because I have never used Antlr (but have used StringTemplate, one of Terrence Parr’s other projects - all my StringTemplate code is now ported to Elm :slight_smile: ), but Antlr also uses this technique. Antlr is a popular LL(k) parser generator with auto error recovery. The (k) bit means it can look-ahead k tokens to figure out ambiguities in the input grammar - that is when 2 or more things start out the same way so both branches need to be explored in more depth, to figure out which one is correct.

And there is a port of it to Standard ML, which might provide some insights on how it could be written in Elm:

I think a parser generator could be written on top of elm/parser, essentially elm/parser would just do the tokenization, with the parser code generated on top of a List of Tokens, rather than parsing direct from the input String. That is, in elm/parser we use the combinators to do both the lexing and parsing, in a single step, which is quite convenient for hand-written recursive descent parsers but for a parser generator, probably conceptually easier to be thinking at the level of tokens.

This also proves my point above that elm/parser is not restricted to LL grammars - you could do an LR parser on this token list too. It is clear though that error recovery in LR parsers is more difficult than for LL.

If anyone knows of any LL parser generator proejcts implemented in functional languages (seeing something in Haskell or OCaml might be nice), do post a link.