Parsing column-based data with elm/parser

I’m trying to write a parser for data that is column-based strings. It looks like this:

111112222222233333334444444566 -- For clarity
HELLO  19.000200.000 -1.000A C -- Real data format

The numbers show where data starts and ends, this is always the same, there is no delimiter. At the moment I’ve been using String.slice to pull out the column range then stripping the white space and rejoining everything with a delimiter that I can parse. I’m sure there is a better way but it’s beyond me! Anyone know how to do this? Thanks!

What is your expected output types here? If I understand correctly, one row would be
String Float Float Float String String
Is that right? Can you confirm that 19.000200.000 is two individual floating point numbers and there is no delimiter between them?

Yup, that’s exactly right. Separated into columns it is:

Group Content Type
1 HELLO type Record = Hello | Goodbye
2 _19.000 Float
3 200.000 Float
4 _-1.000 Float
5 A Maybe String
6 _C String

I’ve added underscores to explicitly show where spaces are.

So I’d use elm/parser for this.

That 19.000200.00 is a bit of a killer at the moment - has me stumped. Lets assume there is a delimiting space between those two floats for now:

import Parser exposing ((|.), (|=), Parser, chompIf, end, float, getChompedString, keyword, lineComment, oneOf, spaces, succeed, symbol)

type Record
    = Hello
    | Goodbye


{-| Parsed information for each row.
Not sure what your floats are for, so these are of course just dummy names.
-}
type alias Data =
    { header : Record
    , first : Float
    , second : Float
    , third : Float
    , trailChar : Maybe String
    , finalChar : String
    }


{-| This guy does the actual work.
-}
scanner : Parser Data
scanner =
    succeed Data
        |= header
        |. spaces
        |= setWidthFloat
        |. spaces
        |= setWidthFloat
        |. spaces
        |= setWidthFloat
        |= oneOf
            [ succeed Just
                |= charString
                |. spaces
            , succeed Nothing
                |. spaces
            ]
        |= charString
        |. spaces
        |. endRow


{-| Convert the initial string into a record type.
-}
header : Parser Record
header =
    oneOf
        [ succeed Hello
            |. keyword "HELLO"
        , succeed Goodbye
            |. keyword "GOODBYE"
        ]


{-| Not sure if you have comments in your data like you've
shown here, but if so, you can ignore them like this.
Otherwise you can just use `|. end` in the `scanner` function.
-}
endRow : Parser ()
endRow =
    oneOf
        [ end
        , lineComment "--"
        ]


{-| Works on the single character portion at the end.
Assumes that these will always be uppercase ASCII values.
-}
charString : Parser String
charString =
    getChompedString <| chompIf Char.isUpper


{-| A custom float parser since we need to separate those two
values without delimiters. (Not implemented here, this just captures the negative symbol)
-}
setWidthFloat : Parser Float
setWidthFloat =
    oneOf
        [ succeed negate
            |. symbol "-"
            |= float
        , float
        ]

Running the parser here will get you a Data record for the row:

> Parser.run scanner "HELLO  19.000 200.000 -1.000A C -- Real data format"
Ok { finalChar = "C", first = 19, header = Hello, second = 200, third = -1, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

To get around the 19.000200.000 issue, I’ve got to this point:

floatString : Parser String
floatString =
    getChompedString <|
        succeed ()
            |. Parser.chompWhile (\c -> Char.isDigit c)
            |. symbol "."
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit

Which captures the correct information, but has a String type. I haven’t been able to figure out how to do this AND convert the string to a Float at the same time. Perhaps someone else can see a way to do that?

If so, then setWidthFloat could be altered so use floatString instead of float, and the |. spaces between the first and second float captures in scanner can be removed. This should be everything.

Thanks for this but I should clarify. The spaces in this example are just incidental to the values, I tried to show that with the numbers indicating the associated columns. For example, this is also a valid string:

111112222222233333334444444566 -- For clarity
GDBYE-119.000200.000-11.000A+C -- Real data format

So there are never delimiters, it’s only based on the column number. The file format is a pain, it’s been in use in pretty much this format since the 70s. My example here is simplified over the real thing, but it’s the columns that I couldn’t think how to handle in elm/parser. I did this in the end:

convertToSeparated : String -> String
convertToSeparated string =
    let
        recordType =
            String.slice 0 5 string

        float1 =
            String.slice 5 13 string

        float2 =
            String.slice 13 20 string

        float3 =
            String.slice 20 27 string

        string1 =
            String.slice 27 28 string

        string2 =
            String.slice 28 30 string
    in
    [ recordType
    , float1
    , float2
    , float3
    , string1
    , string2
    ]
        |> List.map String.trim
        |> String.join ";"

Then the string is trivial to parse with elm/parser! But I kind of felt defeated as I’m sure that I could do this more robustly with the parser module.

I see. That is a bit of a terrible format to parse!

I’m still hacking at a solution to the float parser which should still solve most of this: we can just factor in and ignore any white space into that.

The + between A and C: is that significant?

Since the data length is constant though, your split solution isn’t a bad one IMO.

It’s an awful format, and the worst part is that lots of people don’t even stick to the specification! Let’s say the + is not significant, it could just be any character. I’m sure that chomping a defined number of characters must be possible, then you could parse the result of getChompedString, that would solve this but I couldn’t figure out how to do that.

This should work for parsing n arbitrary characters

parseNCharacters n = 
    if n == 0 then
        Parser.succeed ()
    else
        Parser.chompIf (\_ -> True) |. parseNCharacters (n - 1)

OK, got it done I’m pretty sure. It’s not amazing, but hopefully something you’re more comfortable with than the split solution you have already.

import Parser exposing ((|.), (|=), Parser, andThen, chompIf, chompWhile, end, float, getChompedString, keyword, lineComment, oneOf, spaces, succeed, symbol)


type Record
    = Hello
    | Goodbye


{-| Parsed information for each row.
-}
type alias Data =
    { header : Record
    , first : Float
    , second : Float
    , third : Float
    , trailChar : Maybe String
    , finalChar : String
    }


{-| This guy does the actual work.
-}
scanner : Parser Data
scanner =
    succeed Data
        |= header
        |. whitespace
        |= setWidthFloat
        |. whitespace
        |= setWidthFloat
        |. whitespace
        |= setWidthFloat
        |= oneOf
            [ succeed Just
                |= charString
                |. charspace
            , succeed Nothing
                |. charspace
            ]
        |= charString
        |. whitespace
        |. endRow


{-| Convert the initial string into a record type.
-}
header : Parser Record
header =
    oneOf
        [ succeed Hello
            |. keyword "HELLO"
        , succeed Goodbye
            |. keyword "GDBYE"
        ]


{-| Not sure if you have comments in your data like you've
shown here, but if so, you can ignore them like this.
Otherwise you can just use `|. end` in the `scanner` function.
-}
endRow : Parser ()
endRow =
    oneOf
        [ end
        , lineComment "--"
        ]


{-| Works on the single character portion at the end.
Assumes that these will always be uppercase ASCII values.
-}
charString : Parser String
charString =
    getChompedString <| chompIf Char.isUpper


{-| A custom float parser since we need to separate those two
values without delimeters.
-}
setWidthFloat : Parser Float
setWidthFloat =
    oneOf
        [ succeed negate
            |. symbol "-"
            |= floatString
        , floatString
        ]


{-| First we match the expected string that should contain our float.
Initially, this will always succeed, but then we convert the captured
string to a float. Error handling happens there.
-}
floatString : Parser Float
floatString =
    getChompedString
        (succeed ()
            |. chompWhile (\c -> Char.isDigit c)
            |. symbol "."
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
            |. chompIf Char.isDigit
        )
        |> andThen matchToFloat


{-| Convert the string to a float and make sure it's parsed correctly.
-}
matchToFloat : String -> Parser Float
matchToFloat str =
    case String.toFloat str of
        Just good ->
            succeed good

        Nothing ->
            Parser.problem "Failed to convert Fixed Width Float"


{-| Using this instead of `spaces` to kill off whitespace only if it exists.
-}
whitespace : Parser ()
whitespace =
    chompWhile (\c -> c == ' ' || c == '\t' || c == '\n' || c == '\u{000D}')


{-| Same as the `whitespace` method but also includes additional symbols that may
appear in the second to last column.
-}
charspace : Parser ()
charspace =
    chompWhile (\c -> c == '+' || c == '-' || c == ' ' || c == '\t' || c == '\n' || c == '\u{000D}')

A few changes since the last iteration: dropped the internal spaces which expects whitespace to a custom chomper that consumes whitespace only if it exists (Edit: I read this wrong, spaces is fine to use here too). Additionally charspace does the same but dumps the + between A and C (I put ‘-’ there too, but change this as you want).

Then, we use the captured String which is set to our fixed width float, and convert the result into a successful parser method.

> Parser.run scanner "HELLO  19.000200.000 -1.000A C -- Real data format"
Ok { finalChar = "C", first = 19, header = Hello, second = 200, third = -1, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

> Parser.run scanner "GDBYE-119.000200.000-11.000A+C -- Real data format"
Ok { finalChar = "C", first = -119, header = Goodbye, second = 200, third = -11, trailChar = Just "A" }
    : Result (List Parser.DeadEnd) Data

The matchToFloat bit seems weird to me, can’t you just pass the chomped string to another parser (like float or w/e)? If it’s possible to do that I could use @folkertdev’s recursive method to pull out the columns then parse them into data.

Thanks to both of you for the help btw! :smiley:

Yeah, I’m really not sure. This was the only way I could get it to happen.

mapChompedString : (String -> a -> b) -> Parser a -> Parser b

Is the only type signature in the API that I can see that may work, but using that in this instance is over my head.

Using andThen to use String.toFloat in that way is totally fine. you could use the Parser.float but that would really be more complicated (and probably slower too).

As an aside, if you take a look at the way Tereza is building the YAML parser:

Here is the union type for the model which includes strings and floats, and the fromString function does effectively the same thing as above.

Then, that function maps incoming data from a getChompedString pipe.

So there’s at least one other use of this pattern in the wild.

1 Like

The reason it seems a bit jarring is that as soon as you drop out of Parser.elm you lose the rich error information generated by the parser.

Oh, well you can easily just use Parser.Advanced.problem instead of Parser.problem. Then set your own error type - directly employing ExpectingFloat for example.

import Parser.Advanced.problem

matchToFloat : String -> Parser Float
matchToFloat str =
    case String.toFloat str of
        Just good ->
            succeed good

        Nothing ->
            problem Parser.ExpectingFloat

Regardless, you still get complete errors if things go amiss anyhow, since you’re matching only expected values. This is without the Advanced handler:

> Parser.run scanner "GDBYE-119.0s0200.000-11.000A+C -- Real data format"
Err [{ col = 12, problem = UnexpectedChar, row = 1 }]
    : Result (List Parser.DeadEnd) Data

You’ll only hit that string based error if you can find a way that some digits then a ‘.’ then three digits are not a floating point number. I don’t think that’s happened in the last 300 years at least :wink:

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.