Using Xml.Decode

Hi all. I’m trying to parse this rather complicated XML document:

The OData reference service metadata

My code is here: https://github.com/mikevdg/chickpea

Now, I’m having difficultly working with Xml.Decode from https://package.elm-lang.org/packages/ymtszw/elm-xml-decode/latest/. Easy stuff is easy. Hard stuff has had me stumped for a few hours.

The XML I’m trying to decode is (simplified for this example):

<Schema>
    <EntityType Name="Person">
        <Property>foo</Property>
        <Property>bar</Property>
    </EntityType>
    <ComplexType Name="Location">
        <Property>foo</Property>
        <Property>bar</Property>
    </ComplexType>
<Schema>

The Schema contains a list of things, and each thing can be an <EntityType> or a <ComplexType> (or a few other things). Ditto with EntityType and ComplexType - they contain different types of things, mostly <Property> but there’s a range of other things they can be.

So far I have (in Schema.elm):

type SchemaEntry = 
    EntityType String EntityTypeEntry
    | ComplexType -- TODO
    | EnumType -- TODO
    | Function
    | Action
    | EntityContainer String (List EntityContainerEntry)

type EntityTypeEntry =
    Key (List PropertyDetails)
    | Property PropertyDetails
    | NavigationProperty PropertyDetails

decodeSchemaEntry : Decoder SchemaEntry
decodeSchemaEntry =
    oneOf [
        decodeEntityType
        , decodeComplexType
    ]

decodeEntityType : Decoder SchemaEntry
decodeEntityType = 
    map2 asEntityType 
        (path ["EntityType"] (single (stringAttr "name")))
        (path ["EntityType"] (list decodeEntityTypeEntry))

Now, it compiles, but I’m repeating path ["EntityType"] which smells bad to me. I don’t feel I’m doing this correctly.

Given a chuck of XML such as `<hello attr1=“foo” attr2="bar> <contents…/> , how do I pull out multiple attributes and contents to assemble into one value?

Author here. Thanks for utilizing the package!

I’ve investigated the case a bit, and I do think it is complicated enough.

Let me propose my (kind of) solution at the moment:
(The snippet is written as elm-test code. I will push this to the repo as an example test case later)

        , describe "discourse#5412" <|
            let
                exampleXml =
                    """
<Schema>
    <EntityType Name="Person">
        <Property>foo</Property>
        <Property>bar</Property>
    </EntityType>
    <ComplexType Name="Location">
        <Property>foo</Property>
        <Property>bar</Property>
    </ComplexType>
    <EntityType Name="Animal">
        <Property>ban</Property>
    </EntityType>
</Schema>
"""
            in
            [ test "proposedDecoder" <|
                \_ ->
                    let
                        proposedDecoder =
                            path [] (leakyList decodeSchemaEntry)

                        decodeSchemaEntry =
                            with node <|
                                \n ->
                                    case n of
                                        Element "EntityType" _ _ ->
                                            map2 EntityType (stringAttr "Name") decodeEntityTypeEntry

                                        Element "ComplexType" _ _ ->
                                            succeed ComplexType

                                        _ ->
                                            fail "TODO"

                        decodeEntityTypeEntry =
                            oneOf
                                [ path [ "Property" ] <| leakyList <| map Property string

                                -- More to come here
                                ]
                    in
                    exampleXml
                        |> run proposedDecoder
                        |> Expect.equal
                            (Ok
                                [ EntityType "Person" [ Property "foo", Property "bar" ]
                                , ComplexType
                                , EntityType "Animal" [ Property "ban" ]
                                ]
                            )
            ]
        ]


type SchemaEntry
    -- More to come
    = EntityType String (List EntityTypeEntry)
    | ComplexType


type EntityTypeEntry
    -- More to come
    = Property PropertyDetails


type alias PropertyDetails =
    String -- Temporary

Points:

  • I tweaked EntityType variant so that it can take List EntityTypeEntry, since from the look of the document, <EntityType> may contain multiple <Property>s, is it not?
    • (Even if not, it actually does not make that much difference since the bigger problem lies below)
  • I think we want to enumerate child data of <Schema> Node, right? In this example <Schema> is a “root” layer so it is a bit tricky but, path [] << list (or leakyList) comes to the rescue
    • path [] allows you to iterate over children of “current” Node without digging any deeper
    • This might better be exposed as children or similarly-named API for more reachability
  • Duplicated path [ "EntityType" ] ... is definitely a code smell, I agree. But since your targeted data structure is already rather complex, you need a “custom decoder” which directly peek into structure of Node. You can do so with with node <| \n -> ... pattern
  • As an aside, be careful about path behavior since it performs breadth-first search. If you want to iterate over Nodes directly under an Element WITHOUT skipping Nodes that does not match the predicate, behavior of path might not be suitable for you.

Anyway, against such a convoluted structure, be it XML or not, you will need to construct large “decoder tree” nonetheless. You may find directly walking through the Node tree without using Xml.Decode more intuitive, but who knows.

Hope this gives you an insight!

1 Like

I recently had trouble implementing a parser for a deeply nested XML file. I decided to directly use the XMLParser module directly without using decoder. Source code is here; maybe it will help: https://github.com/dullbananas/editsc/tree/master/editsc/js/src/ProjectFile

in my code, xmlitem is used to represent most of the xml elements

Hi @ymnszw. Thanks for the reply, I’m slowly digesting it.

In your code, where does node come from?

node is provided in latest version of the elm-xml-decode. It is a decdoer directly returns XmlParser.Node for custom decoding.

I had to parse some pretty large XML files lately. (VISIO documents). Where each page contains typically 100k lines of XML. I tried different libraries, but it was hard to get it both performant and build the custom types I wanted from the data. (The libraries I tried all build a complete representation of the document before decoding ) In the end I just skipped XML parsing/decoding all together and used elm/parser to extract the data I needed directly from text, without going to XML representation first… This increased performance x100 and was a lot more flexible for my usecase.
It also made me learn the great parsing library for use with other stuff later on :slight_smile:

1 Like

I would say XML folks (do we even call them with this much of generalization? lol) are really “creative” in utilizing XML capability to its maximum. Albeit they have tons of historical knowlege and supporting libraries at their disposal.

OTOH Elm is more inclined to be good at recent web app development and mostly following its trend, often comes short at doing other things. XML handling is one of them, however recently I see many people try extending Elm’s boundary to broader applications, which is very stimulating to see!

Good found! Plaintext-based approach is a nice escape hatch in this context I think.

Until somebody found a way to automatically generate decoders from XSD (XML Schema Definition), constructing necessary XML decoders by our own hands are always PITA. Even if it was a thing, decoding large XML documents in “functional” way is inherently slow (this is also true for complicated JSON doc.)

1 Like

Maybe a library in between elm-parser and xml-parser would be a good idea?
something like Elm.Parser.Extra.Xml or something…
Where you actually use elm/parser all the way, but the library knows how to read XML and has really nice elm/parser helper functions for extracting what you need. Instead of decoding the whole structure first and then extract data. This way you could get some performance benefit if you do not need all the data, you get all the flexibility from elm/parser and can also create nice error messages based on the content you expect…

Possibly. If I were to do something on that line, I would introduce “tag parser” which accepts start and end tag shape, then composable to other parsers (including itself to handle nesting same tags) to parse its inner contents.
We already have Punie/elm-parser-extras by @Punie with between for similar purpose which can be used here.

1 Like

Until somebody found a way to automatically generate decoders from XSD

I’m going to need this in order to finish my project. I don’t particularly want to write a full decoder for the OData metadata files by hand.

If I get that far and if nobody else implements it first, I’ll try doing it myself, but I’m still learning Elm and my first attempt will probably be a dog’s breakfast.

For now I’ll work on other parts of my application and come back to XML parsing later.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.