Parsers with Error Recovery

The recovery tactics I implemented so far require you to add them in to exising parsing code - which is not simple to do. The idea of having error recovery handled automatically, is appealing.

So the other thing I have been looking into is error recovery in LL grammars and parser generators. I don’t think elm/parser is really restricted to LL grammars, its just that the way we most commonly use it is to write recursive descent style parsers.

A recursive descent parser using a parsing combinator library, embeds implicit knowledge about the grammar being parsed into code. An LL parser generator can generate recursive descent parsers, but can also make use of explicit knowledge about the grammar it is parsing and has been giving as input to generate a parser for. When it comes to error recovery, this makes a big difference.

This paper gives details of how to calculate for a given token and parse state, how to figure out the set of things that could come next. That means that the parser can automatically figure out where it can continue from. There is also an algorithm for repairing the input, by inserting imaginary tokens not in the input, in order to get to the continuation token, and this algorithm also always terminates. The result is an LL parser generator that can recover and repair all errors automatically. Pretty nice:

I did not realize it because I have never used Antlr (but have used StringTemplate, one of Terrence Parr’s other projects - all my StringTemplate code is now ported to Elm :slight_smile: ), but Antlr also uses this technique. Antlr is a popular LL(k) parser generator with auto error recovery. The (k) bit means it can look-ahead k tokens to figure out ambiguities in the input grammar - that is when 2 or more things start out the same way so both branches need to be explored in more depth, to figure out which one is correct.

And there is a port of it to Standard ML, which might provide some insights on how it could be written in Elm:

I think a parser generator could be written on top of elm/parser, essentially elm/parser would just do the tokenization, with the parser code generated on top of a List of Tokens, rather than parsing direct from the input String. That is, in elm/parser we use the combinators to do both the lexing and parsing, in a single step, which is quite convenient for hand-written recursive descent parsers but for a parser generator, probably conceptually easier to be thinking at the level of tokens.

This also proves my point above that elm/parser is not restricted to LL grammars - you could do an LR parser on this token list too. It is clear though that error recovery in LR parsers is more difficult than for LL.

If anyone knows of any LL parser generator proejcts implemented in functional languages (seeing something in Haskell or OCaml might be nice), do post a link.