Using Xml.Decode

mikevdg · April 1, 2020, 11:25pm

Hi all. I’m trying to parse this rather complicated XML document:

My code is here: https://github.com/mikevdg/chickpea

Now, I’m having difficultly working with Xml.Decode from https://package.elm-lang.org/packages/ymtszw/elm-xml-decode/latest/. Easy stuff is easy. Hard stuff has had me stumped for a few hours.

The XML I’m trying to decode is (simplified for this example):

<Schema>
    <EntityType Name="Person">
        <Property>foo</Property>
        <Property>bar</Property>
    </EntityType>
    <ComplexType Name="Location">
        <Property>foo</Property>
        <Property>bar</Property>
    </ComplexType>
<Schema>

The Schema contains a list of things, and each thing can be an <EntityType> or a <ComplexType> (or a few other things). Ditto with EntityType and ComplexType - they contain different types of things, mostly <Property> but there’s a range of other things they can be.

So far I have (in Schema.elm):

type SchemaEntry = 
    EntityType String EntityTypeEntry
    | ComplexType -- TODO
    | EnumType -- TODO
    | Function
    | Action
    | EntityContainer String (List EntityContainerEntry)

type EntityTypeEntry =
    Key (List PropertyDetails)
    | Property PropertyDetails
    | NavigationProperty PropertyDetails

decodeSchemaEntry : Decoder SchemaEntry
decodeSchemaEntry =
    oneOf [
        decodeEntityType
        , decodeComplexType
    ]

decodeEntityType : Decoder SchemaEntry
decodeEntityType = 
    map2 asEntityType 
        (path ["EntityType"] (single (stringAttr "name")))
        (path ["EntityType"] (list decodeEntityTypeEntry))

Now, it compiles, but I’m repeating path ["EntityType"] which smells bad to me. I don’t feel I’m doing this correctly.

Given a chuck of XML such as `<hello attr1=“foo” attr2="bar> <contents…/> , how do I pull out multiple attributes and contents to assemble into one value?

ymtszw · April 2, 2020, 9:56am

Author here. Thanks for utilizing the package!

I’ve investigated the case a bit, and I do think it is complicated enough.

Let me propose my (kind of) solution at the moment:
(The snippet is written as elm-test code. I will push this to the repo as an example test case later)

        , describe "discourse#5412" <|
            let
                exampleXml =
                    """
<Schema>
    <EntityType Name="Person">
        <Property>foo</Property>
        <Property>bar</Property>
    </EntityType>
    <ComplexType Name="Location">
        <Property>foo</Property>
        <Property>bar</Property>
    </ComplexType>
    <EntityType Name="Animal">
        <Property>ban</Property>
    </EntityType>
</Schema>
"""
            in
            [ test "proposedDecoder" <|
                \_ ->
                    let
                        proposedDecoder =
                            path [] (leakyList decodeSchemaEntry)

                        decodeSchemaEntry =
                            with node <|
                                \n ->
                                    case n of
                                        Element "EntityType" _ _ ->
                                            map2 EntityType (stringAttr "Name") decodeEntityTypeEntry

                                        Element "ComplexType" _ _ ->
                                            succeed ComplexType

                                        _ ->
                                            fail "TODO"

                        decodeEntityTypeEntry =
                            oneOf
                                [ path [ "Property" ] <| leakyList <| map Property string

                                -- More to come here
                                ]
                    in
                    exampleXml
                        |> run proposedDecoder
                        |> Expect.equal
                            (Ok
                                [ EntityType "Person" [ Property "foo", Property "bar" ]
                                , ComplexType
                                , EntityType "Animal" [ Property "ban" ]
                                ]
                            )
            ]
        ]


type SchemaEntry
    -- More to come
    = EntityType String (List EntityTypeEntry)
    | ComplexType


type EntityTypeEntry
    -- More to come
    = Property PropertyDetails


type alias PropertyDetails =
    String -- Temporary

Points:

I tweaked EntityType variant so that it can take List EntityTypeEntry, since from the look of the document, <EntityType> may contain multiple <Property>s, is it not?
- (Even if not, it actually does not make that much difference since the bigger problem lies below)
I think we want to enumerate child data of <Schema> Node, right? In this example <Schema> is a “root” layer so it is a bit tricky but, path [] << list (or leakyList) comes to the rescue
- path [] allows you to iterate over children of “current” Node without digging any deeper
- This might better be exposed as children or similarly-named API for more reachability
Duplicated path [ "EntityType" ] ... is definitely a code smell, I agree. But since your targeted data structure is already rather complex, you need a “custom decoder” which directly peek into structure of Node. You can do so with with node <| \n -> ... pattern
As an aside, be careful about path behavior since it performs breadth-first search. If you want to iterate over Nodes directly under an Element WITHOUT skipping Nodes that does not match the predicate, behavior of path might not be suitable for you.

Anyway, against such a convoluted structure, be it XML or not, you will need to construct large “decoder tree” nonetheless. You may find directly walking through the Node tree without using Xml.Decode more intuitive, but who knows.

Hope this gives you an insight!

DullBananas · April 2, 2020, 9:12pm

I recently had trouble implementing a parser for a deeply nested XML file. I decided to directly use the XMLParser module directly without using decoder. Source code is here; maybe it will help: https://github.com/dullbananas/editsc/tree/master/editsc/js/src/ProjectFile

in my code, xmlitem is used to represent most of the xml elements

mikevdg · April 2, 2020, 10:51pm

Hi @ymnszw. Thanks for the reply, I’m slowly digesting it.

mikevdg · April 2, 2020, 11:26pm

In your code, where does node come from?

ymtszw · April 3, 2020, 4:06am

node is provided in latest version of the elm-xml-decode. It is a decdoer directly returns XmlParser.Node for custom decoding.

Atlewee · April 3, 2020, 6:22am

I had to parse some pretty large XML files lately. (VISIO documents). Where each page contains typically 100k lines of XML. I tried different libraries, but it was hard to get it both performant and build the custom types I wanted from the data. (The libraries I tried all build a complete representation of the document before decoding ) In the end I just skipped XML parsing/decoding all together and used elm/parser to extract the data I needed directly from text, without going to XML representation first… This increased performance x100 and was a lot more flexible for my usecase.
It also made me learn the great parsing library for use with other stuff later on

ymtszw · April 3, 2020, 6:56am

I would say XML folks (do we even call them with this much of generalization? lol) are really “creative” in utilizing XML capability to its maximum. Albeit they have tons of historical knowlege and supporting libraries at their disposal.

OTOH Elm is more inclined to be good at recent web app development and mostly following its trend, often comes short at doing other things. XML handling is one of them, however recently I see many people try extending Elm’s boundary to broader applications, which is very stimulating to see!

Good found! Plaintext-based approach is a nice escape hatch in this context I think.

Until somebody found a way to automatically generate decoders from XSD (XML Schema Definition), constructing necessary XML decoders by our own hands are always PITA. Even if it was a thing, decoding large XML documents in “functional” way is inherently slow (this is also true for complicated JSON doc.)

Atlewee · April 3, 2020, 8:03am

Maybe a library in between elm-parser and xml-parser would be a good idea?
something like Elm.Parser.Extra.Xml or something…
Where you actually use elm/parser all the way, but the library knows how to read XML and has really nice elm/parser helper functions for extracting what you need. Instead of decoding the whole structure first and then extract data. This way you could get some performance benefit if you do not need all the data, you get all the flexibility from elm/parser and can also create nice error messages based on the content you expect…

ymtszw · April 3, 2020, 10:55am

Possibly. If I were to do something on that line, I would introduce “tag parser” which accepts start and end tag shape, then composable to other parsers (including itself to handle nesting same tags) to parse its inner contents.
We already have Punie/elm-parser-extras by @Punie with between for similar purpose which can be used here.

mikevdg · April 4, 2020, 1:01am

Until somebody found a way to automatically generate decoders from XSD

I’m going to need this in order to finish my project. I don’t particularly want to write a full decoder for the OData metadata files by hand.

If I get that far and if nobody else implements it first, I’ll try doing it myself, but I’m still learning Elm and my first attempt will probably be a dog’s breakfast.

For now I’ll work on other parts of my application and come back to XML parsing later.

system · April 14, 2020, 1:01am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elm json decoding type error Learn	2	660	October 22, 2021
Using Elm for parsing deeply nested XML Learn	4	777	March 27, 2020
Decoding really "flexible" JSON schema? Learn	5	1453	September 20, 2019
Tips on JSON decoding Learn	4	706	November 14, 2020
JSON Decode to Opaque Type Request Feedback	9	1184	March 4, 2021

Using Xml.Decode

Related topics