Parsing puzzle, avoid look-ahead?

Here’s a little parsing problem I ran into while writing a parser for EDN (planning to announce that package soon). I used elm-tools/parser, but failed to find a good solution without adding a look-ahead primitive. I’m curious whether there’s a nice way to solve this with delayedCommit, or in some other way?

Here’s a reduced version: We want to parse a nested datastructure of Lisp-like lists and integers, where the list parentheses are “self-delimited” for lack of a better term:

data Thing = Number Int | Things (List Thing)

(1 2 3) == ( 1  2 3 )  --> Things [ Number 1, Number 2, Number 3 ]
((1) 2) == ( (1)2 )    --> Things [ Things [ Number 1 ], Number 2 ]
(()1()) == ( () 1 () ) --> Things [ Things [], 1, Things [] ]

With look-ahead, we can make a parser for numbers that ensures the number is delimited:

import Parser as P exposing ((|.), (|=), Parser)

-- run a parser, then rewind input
lookAhead : Parser a -> Parser a

sep : Parser ()
sep = P.oneOf
    [ P.ignore P.oneOrMore (\c -> c == ' ')
    , lookAhead (P.oneOf [P.symbol "(", P.symbol ")"])
    ]

number : Parser Thing
number = P.succeed Number |= P.int |. sep

and put the whole thing together with a list parser:

thing : Parser Thing
thing = P.oneOf [number, things]

things : Parser Thing
things = P.succeed Things
    |. P.symbol "("
    |= P.repeat P.zeroOrMore thing
    |. P.symbol ")"

(This doesn’t quite work, since it doesn’t eat all the optional whitespace this way, and lacks some lazy. Here’s a complete version.)

I think the core issue I’m running into is that I need the closing parenthesis both to terminate the number, and to terminate the list. So trying to do this without look-ahead I found my number parser had to return both the number and the closing token, which made things … messy.

I hope I haven’t broken the problem down too far to illustrate the issue! And am very curious if you have some suggestions for how to tackle this.

I don’t know EDN so I’m most likely missing something. It’s not clear to me - from the example inputs - why the look-ahead is needed. Is something like this unable to parse certain “good” inputs (or does it give false positives, perhaps?)

1 Like

Thank you Ilias, this certainly helps me narrow it down! And yes, that’s a correct parser, and I was kind of afraid this would happen. :slight_smile: I do think I broke it down too far, but this might turn out into a moving goal-posts kind of thing.

I’ll give it another attempt for the full language and report back.

1 Like

I’ve made some progress.

  1. Parser.int is sneaky! It actually does look-ahead under the hood, and will fail on e.g. “1a” but not “1(”. Parsing numbers using P.keep P.oneOrMore Char.isDigit gets us closer. (But that still works fine on my proposed simple language.)
  2. To show the issue, I need two different token types that we require to be separated by whitespace, but that would parse separately just fine without. So I’ll add a token consisting of lower-case letters. So things like (hello 1(2 world) bye ) are in, but (hello1 2bye) are not.

I’ve updated the Ellie to illustrate this.

Right, yeah. The way I usually see this done is to write something custom for repeat (or use the list thing in Parser.LanguageKit. Basically what you’d do is parse the first thing, then require whitespace for subsequent things.

Now, the annoying thing in this case is that whitespace requirement depends on what the previous and next thing are. With some minor cleverness, we can encode that: check what the last thing parsed was, and only require whitespace if it was not a list and the next thing is not a list.

I threw together an example though there is probably some room for improvement, there. Hope that helps!

1 Like

This is a good tutorial on building parsers with combinators.

It’s in F# as opposed to Elm but the ideas all flow from the same source. And here is a post about building a Scheme parser in F#:

Mark

1 Like

Mark, thanks, but I’m not sure how this helps with the concrete issue here? The scheme parsing example in particular requires space separators everywhere.

I’m struggling with elm-tools/parser's lack of an explicit look-ahead combinator. Compare e.g. the remarks on look-ahead in this article on Parsec.

Yes I think this is getting to the core of it, thank you! Your solution keeps things nicely to the list parser and doesn’t touch the other token parsers, which I like a lot. I wasn’t sure how to do that.

1 Like