Parsing puzzle, avoid look-ahead?

rob · February 11, 2018, 1:44pm

Here’s a little parsing problem I ran into while writing a parser for EDN (planning to announce that package soon). I used elm-tools/parser, but failed to find a good solution without adding a look-ahead primitive. I’m curious whether there’s a nice way to solve this with delayedCommit, or in some other way?

Here’s a reduced version: We want to parse a nested datastructure of Lisp-like lists and integers, where the list parentheses are “self-delimited” for lack of a better term:

data Thing = Number Int | Things (List Thing)

(1 2 3) == ( 1  2 3 )  --> Things [ Number 1, Number 2, Number 3 ]
((1) 2) == ( (1)2 )    --> Things [ Things [ Number 1 ], Number 2 ]
(()1()) == ( () 1 () ) --> Things [ Things [], 1, Things [] ]

With look-ahead, we can make a parser for numbers that ensures the number is delimited:

import Parser as P exposing ((|.), (|=), Parser)

-- run a parser, then rewind input
lookAhead : Parser a -> Parser a

sep : Parser ()
sep = P.oneOf
    [ P.ignore P.oneOrMore (\c -> c == ' ')
    , lookAhead (P.oneOf [P.symbol "(", P.symbol ")"])
    ]

number : Parser Thing
number = P.succeed Number |= P.int |. sep

and put the whole thing together with a list parser:

thing : Parser Thing
thing = P.oneOf [number, things]

things : Parser Thing
things = P.succeed Things
    |. P.symbol "("
    |= P.repeat P.zeroOrMore thing
    |. P.symbol ")"

(This doesn’t quite work, since it doesn’t eat all the optional whitespace this way, and lacks some lazy. Here’s a complete version.)

I think the core issue I’m running into is that I need the closing parenthesis both to terminate the number, and to terminate the list. So trying to do this without look-ahead I found my number parser had to return both the number and the closing token, which made things … messy.

I hope I haven’t broken the problem down too far to illustrate the issue! And am very curious if you have some suggestions for how to tackle this.

ilias · February 11, 2018, 7:31pm

I don’t know EDN so I’m most likely missing something. It’s not clear to me - from the example inputs - why the look-ahead is needed. Is something like this unable to parse certain “good” inputs (or does it give false positives, perhaps?)

rob · February 12, 2018, 8:48am

Thank you Ilias, this certainly helps me narrow it down! And yes, that’s a correct parser, and I was kind of afraid this would happen. I do think I broke it down too far, but this might turn out into a moving goal-posts kind of thing.

I’ll give it another attempt for the full language and report back.

rob · February 12, 2018, 10:05am

I’ve made some progress.

Parser.int is sneaky! It actually does look-ahead under the hood, and will fail on e.g. “1a” but not “1(”. Parsing numbers using P.keep P.oneOrMore Char.isDigit gets us closer. (But that still works fine on my proposed simple language.)
To show the issue, I need two different token types that we require to be separated by whitespace, but that would parse separately just fine without. So I’ll add a token consisting of lower-case letters. So things like (hello 1(2 world) bye ) are in, but (hello1 2bye) are not.

I’ve updated the Ellie to illustrate this.

ilias · February 12, 2018, 10:54am

Right, yeah. The way I usually see this done is to write something custom for repeat (or use the list thing in Parser.LanguageKit. Basically what you’d do is parse the first thing, then require whitespace for subsequent things.

Now, the annoying thing in this case is that whitespace requirement depends on what the previous and next thing are. With some minor cleverness, we can encode that: check what the last thing parsed was, and only require whitespace if it was not a list and the next thing is not a list.

I threw together an example though there is probably some room for improvement, there. Hope that helps!

MarkHamburg · February 12, 2018, 12:44pm

This is a good tutorial on building parsers with combinators.

It’s in F# as opposed to Elm but the ideas all flow from the same source. And here is a post about building a Scheme parser in F#:

Mark

rob · February 12, 2018, 1:18pm

Mark, thanks, but I’m not sure how this helps with the concrete issue here? The scheme parsing example in particular requires space separators everywhere.

I’m struggling with elm-tools/parser's lack of an explicit look-ahead combinator. Compare e.g. the remarks on look-ahead in this article on Parsec.

rob · February 12, 2018, 1:20pm

Yes I think this is getting to the core of it, thank you! Your solution keeps things nicely to the list parser and doesn’t touch the other token parsers, which I like a lot. I wasn’t sure how to do that.

Topic		Replies	Views
Demystifying an Obscure LookAhead Parser Show and Tell	1	340	September 9, 2023
How to build interesting parsers Show and Tell	3	984	December 9, 2022
Elm/parser look ahead without backtrackable Learn	4	1194	February 16, 2019
Parsers with Error Recovery Learn	22	4759	October 11, 2020
Problem writing a parser Learn	4	922	January 29, 2018

Parsing puzzle, avoid look-ahead?

Related topics