Parsing column-based data with elm/parser

ChrisWellsWood · October 17, 2018, 8:36am

I’m trying to write a parser for data that is column-based strings. It looks like this:

111112222222233333334444444566 -- For clarity
HELLO  19.000200.000 -1.000A C -- Real data format

The numbers show where data starts and ends, this is always the same, there is no delimiter. At the moment I’ve been using String.slice to pull out the column range then stripping the white space and rejoining everything with a delimiter that I can parse. I’m sure there is a better way but it’s beyond me! Anyone know how to do this? Thanks!

Libbum · October 17, 2018, 8:48am

What is your expected output types here? If I understand correctly, one row would be
String Float Float Float String String
Is that right? Can you confirm that 19.000200.000 is two individual floating point numbers and there is no delimiter between them?

ChrisWellsWood · October 17, 2018, 9:01am

Yup, that’s exactly right. Separated into columns it is:

Group	Content	Type
`1`	`HELLO`	`type Record = Hello \| Goodbye`
`2`	`_19.000`	`Float`
`3`	`200.000`	`Float`
`4`	`_-1.000`	`Float`
`5`	`A`	`Maybe String`
`6`	`_C`	`String`

I’ve added underscores to explicitly show where spaces are.

Libbum · October 17, 2018, 11:09am

So I’d use elm/parser for this.

That 19.000200.00 is a bit of a killer at the moment - has me stumped. Lets assume there is a delimiting space between those two floats for now:

import Parser exposing ((|.), (|=), Parser, chompIf, end, float, getChompedString, keyword, lineComment, oneOf, spaces, succeed, symbol)

type Record
    = Hello
    | Goodbye


{-| Parsed information for each row.
Not sure what your floats are for, so these are of course just dummy names.
-}
type alias Data =
    { header : Record
    , first : Float
    , second : Float
    , third : Float
    , trailChar : Maybe String
    , finalChar : String
    }


{-| This guy does the actual work.
-}
scanner : Parser Data
scanner =
    succeed Data
        |= header
        |. spaces
        |= setWidthFloat
        |. spaces
        |= setWidthFloat
        |. spaces
        |= setWidthFloat
        |= oneOf
            [ succeed Just
                |= charString
                |. spaces
            , succeed Nothing
                |. spaces
            ]
        |= charString
        |. spaces
        |. endRow


{-| Convert the initial string into a record type.
-}
header : Parser Record
header =
    oneOf
        [ succeed Hello
            |. keyword "HELLO"
        , succeed Goodbye
            |. keyword "GOODBYE"
        ]


{-| Not sure if you have comments in your data like you've
shown here, but if so, you can ignore them like this.
Otherwise you can just use `|. end` in the `scanner` function.
-}
endRow : Parser ()
endRow =
    oneOf
        [ end
        , lineComment "--"
        ]


{-| Works on the single character portion at the end.
Assumes that these will always be uppercase ASCII values.
-}
charString : Parser String
charString =
    getChompedString <| chompIf Char.isUpper


{-| A custom float parser since we need to separate those two
values without delimiters. (Not implemented here, this just captures the negative symbol)
-}
setWidthFloat : Parser Float
setWidthFloat =
    oneOf
        [ succeed negate
            |. symbol "-"
            |= float
        , float
        ]

Running the parser here will get you a Data record for the row:

> Parser.run scanner "HELLO  19.000 200.000 -1.000A C -- Real data format"
Ok { finalChar = "C", first = 19, header = Hello, second = 200, third = -1, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

To get around the 19.000200.000 issue, I’ve got to this point:

floatString : Parser String
floatString =
    getChompedString <|
        succeed ()
            |. Parser.chompWhile (\c -> Char.isDigit c)
            |. symbol "."
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit

Which captures the correct information, but has a String type. I haven’t been able to figure out how to do this AND convert the string to a Float at the same time. Perhaps someone else can see a way to do that?

If so, then setWidthFloat could be altered so use floatString instead of float, and the |. spaces between the first and second float captures in scanner can be removed. This should be everything.

ChrisWellsWood · October 17, 2018, 12:39pm

Thanks for this but I should clarify. The spaces in this example are just incidental to the values, I tried to show that with the numbers indicating the associated columns. For example, this is also a valid string:

111112222222233333334444444566 -- For clarity
GDBYE-119.000200.000-11.000A+C -- Real data format

So there are never delimiters, it’s only based on the column number. The file format is a pain, it’s been in use in pretty much this format since the 70s. My example here is simplified over the real thing, but it’s the columns that I couldn’t think how to handle in elm/parser. I did this in the end:

convertToSeparated : String -> String
convertToSeparated string =
    let
        recordType =
            String.slice 0 5 string

        float1 =
            String.slice 5 13 string

        float2 =
            String.slice 13 20 string

        float3 =
            String.slice 20 27 string

        string1 =
            String.slice 27 28 string

        string2 =
            String.slice 28 30 string
    in
    [ recordType
    , float1
    , float2
    , float3
    , string1
    , string2
    ]
        |> List.map String.trim
        |> String.join ";"

Then the string is trivial to parse with elm/parser! But I kind of felt defeated as I’m sure that I could do this more robustly with the parser module.

Libbum · October 17, 2018, 12:59pm

I see. That is a bit of a terrible format to parse!

I’m still hacking at a solution to the float parser which should still solve most of this: we can just factor in and ignore any white space into that.

The + between A and C: is that significant?

Since the data length is constant though, your split solution isn’t a bad one IMO.

ChrisWellsWood · October 17, 2018, 1:16pm

It’s an awful format, and the worst part is that lots of people don’t even stick to the specification! Let’s say the + is not significant, it could just be any character. I’m sure that chomping a defined number of characters must be possible, then you could parse the result of getChompedString, that would solve this but I couldn’t figure out how to do that.

folkertdev · October 17, 2018, 2:09pm

This should work for parsing n arbitrary characters

parseNCharacters n = 
    if n == 0 then
        Parser.succeed ()
    else
        Parser.chompIf (\_ -> True) |. parseNCharacters (n - 1)

Libbum · October 17, 2018, 2:26pm

OK, got it done I’m pretty sure. It’s not amazing, but hopefully something you’re more comfortable with than the split solution you have already.

import Parser exposing ((|.), (|=), Parser, andThen, chompIf, chompWhile, end, float, getChompedString, keyword, lineComment, oneOf, spaces, succeed, symbol)


type Record
    = Hello
    | Goodbye


{-| Parsed information for each row.
-}
type alias Data =
    { header : Record
    , first : Float
    , second : Float
    , third : Float
    , trailChar : Maybe String
    , finalChar : String
    }


{-| This guy does the actual work.
-}
scanner : Parser Data
scanner =
    succeed Data
        |= header
        |. whitespace
        |= setWidthFloat
        |. whitespace
        |= setWidthFloat
        |. whitespace
        |= setWidthFloat
        |= oneOf
            [ succeed Just
                |= charString
                |. charspace
            , succeed Nothing
                |. charspace
            ]
        |= charString
        |. whitespace
        |. endRow


{-| Convert the initial string into a record type.
-}
header : Parser Record
header =
    oneOf
        [ succeed Hello
            |. keyword "HELLO"
        , succeed Goodbye
            |. keyword "GDBYE"
        ]


{-| Not sure if you have comments in your data like you've
shown here, but if so, you can ignore them like this.
Otherwise you can just use `|. end` in the `scanner` function.
-}
endRow : Parser ()
endRow =
    oneOf
        [ end
        , lineComment "--"
        ]


{-| Works on the single character portion at the end.
Assumes that these will always be uppercase ASCII values.
-}
charString : Parser String
charString =
    getChompedString <| chompIf Char.isUpper


{-| A custom float parser since we need to separate those two
values without delimeters.
-}
setWidthFloat : Parser Float
setWidthFloat =
    oneOf
        [ succeed negate
            |. symbol "-"
            |= floatString
        , floatString
        ]


{-| First we match the expected string that should contain our float.
Initially, this will always succeed, but then we convert the captured
string to a float. Error handling happens there.
-}
floatString : Parser Float
floatString =
    getChompedString
        (succeed ()
            |. chompWhile (\c -> Char.isDigit c)
            |. symbol "."
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
        )
        |> andThen matchToFloat


{-| Convert the string to a float and make sure it's parsed correctly.
-}
matchToFloat : String -> Parser Float
matchToFloat str =
    case String.toFloat str of
        Just good ->
            succeed good

        Nothing ->
            Parser.problem "Failed to convert Fixed Width Float"


{-| Using this instead of `spaces` to kill off whitespace only if it exists.
-}
whitespace : Parser ()
whitespace =
    chompWhile (\c -> c == ' ' || c == '\t' || c == '\n' || c == '\u{000D}')


{-| Same as the `whitespace` method but also includes additional symbols that may
appear in the second to last column.
-}
charspace : Parser ()
charspace =
    chompWhile (\c -> c == '+' || c == '-' || c == ' ' || c == '\t' || c == '\n' || c == '\u{000D}')

A few changes since the last iteration: dropped the internal spaces which expects whitespace to a custom chomper that consumes whitespace only if it exists (Edit: I read this wrong, spaces is fine to use here too). Additionally charspace does the same but dumps the + between A and C (I put ‘-’ there too, but change this as you want).

Then, we use the captured String which is set to our fixed width float, and convert the result into a successful parser method.

> Parser.run scanner "HELLO  19.000200.000 -1.000A C -- Real data format"
Ok { finalChar = "C", first = 19, header = Hello, second = 200, third = -1, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

> Parser.run scanner "GDBYE-119.000200.000-11.000A+C -- Real data format"
Ok { finalChar = "C", first = -119, header = Goodbye, second = 200, third = -11, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

ChrisWellsWood · October 17, 2018, 3:43pm

The matchToFloat bit seems weird to me, can’t you just pass the chomped string to another parser (like float or w/e)? If it’s possible to do that I could use @folkertdev’s recursive method to pull out the columns then parse them into data.

Thanks to both of you for the help btw!

Libbum · October 17, 2018, 3:54pm

Yeah, I’m really not sure. This was the only way I could get it to happen.

mapChompedString : (String -> a -> b) -> Parser a -> Parser b

Is the only type signature in the API that I can see that may work, but using that in this instance is over my head.

folkertdev · October 17, 2018, 4:10pm

Using andThen to use String.toFloat in that way is totally fine. you could use the Parser.float but that would really be more complicated (and probably slower too).

Libbum · October 17, 2018, 4:15pm

As an aside, if you take a look at the way Tereza is building the YAML parser:

Here is the union type for the model which includes strings and floats, and the fromString function does effectively the same thing as above.

Then, that function maps incoming data from a getChompedString pipe.

So there’s at least one other use of this pattern in the wild.

ChrisWellsWood · October 17, 2018, 7:01pm

The reason it seems a bit jarring is that as soon as you drop out of Parser.elm you lose the rich error information generated by the parser.

Libbum · October 17, 2018, 7:17pm

Oh, well you can easily just use Parser.Advanced.problem instead of Parser.problem. Then set your own error type - directly employing ExpectingFloat for example.

import Parser.Advanced.problem

matchToFloat : String -> Parser Float
matchToFloat str =
    case String.toFloat str of
        Just good ->
            succeed good

        Nothing ->
            problem Parser.ExpectingFloat

Regardless, you still get complete errors if things go amiss anyhow, since you’re matching only expected values. This is without the Advanced handler:

> Parser.run scanner "GDBYE-119.0s0200.000-11.000A+C -- Real data format"
Err [{ col = 12, problem = UnexpectedChar, row = 1 }]
    : Result (List Parser.DeadEnd) Data

You’ll only hit that string based error if you can find a way that some digits then a ‘.’ then three digits are not a floating point number. I don’t think that’s happened in the last 300 years at least

system · October 27, 2018, 7:17pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Code review: elm/parser for spreadsheet cell positions Request Feedback	6	1613	February 12, 2019
How can I use elm/parser to make a Float from a String? Learn	6	1402	June 17, 2019
Parsers with Error Recovery Learn	22	4778	October 11, 2020
How to parse integers separated by dots Learn	11	1377	January 9, 2022
Best way to write this in Elm? (Parser vs Regex) Learn	2	1137	January 11, 2018

Parsing column-based data with elm/parser

Related topics