ProtoBuf decoding/encoding via Bytes.Decode.width?


#1

Similar to Google’s ProtoBuf, the OpenAPI Specification (OAS) defines a standard, programming language-agnostic interface for data interchange. ProtoBuf uses .proto files, OAS uses YAML files. There is an OpenAPI Generator that can be used to generate server stubs, client libraries or documentation for several languages. As the OAS Generator’s template creator for Elm (See example) I’ve always been intrigued by the ability to generate Models & Requests automatically. This both saves me from write encoders/decoders manually and makes it impossible to break the contract between server and client.

As I got inspired by Evan’s vision for data interchange and Elm now supports processing bytes, I started playing around and trying to parse ProtoBuf messages in Elm. My current API is mostly inspired by the existing Elm JSON encoder and decoder packages. I am running into an issue concerning decoding.

In contrast to decoding JSON, my data is not stored in a clear (JSON) tree structure:

  • The order of the fields of the record I am decoding to may be different to the ordering of the field numbers (i.e. the ordering of data in the Bytes) as defined in the .proto;
  • When decoding a repeated int32 I do not know in advance how many int32s I will find in the bytes sequence. Those are not nicely put together like in JSON but are stored sequentially.

Therefore, I start my decoding process by first creating a Dict Int (List Bytes). The Int contains the field number and the list contains all the matching chunks of Bytes I run into while processing the whole range of Bytes. For this I use Bytes.Decode.loop as I keep filling the Dict until I run out of Bytes. And that’s currently my main issue. I do not know when to stop the loop. When I (with some workarounds) have generated this Dict, I use it to run field decoders on the stored chunk(s) based on each field number.

Here a some core functions to summarize the approach a bit:

  • All Bytes are cut into pieces and collected in type alias Chunks = Dict FieldNumber (List Bytes), where type alias FieldNumber = Int;
  • message : a -> Decoder (Chunks -> a) is used the start decoding into type a;
  • field : FieldNumber -> FieldDecoder a -> Decoder (Chunks -> a -> b) -> Decoder (Chunks -> b) is used for decoding additional fields using any of the FieldDecoders:
    • float : FieldDecoder Float
    • string : FieldDecoder String
    • int32 : FieldDecoder Int
    • repeated : FieldDecoder a -> FieldDecoder (List a)
    • etc. [edit: outdated]

These could be used to take a .proto file

syntax = "proto3";

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}

and to generate

import Bytes exposing (Decoder)
import ProtoBuf.Decoder exposing (field, int32, message, string, toDecoder)

type alias SearchRequest =
    { query : Maybe String
    , page_number : Maybe Int
    , result_per_page : Maybe Int
    }

searchRequestDecoder : Decoder SearchRequest
searchRequestDecoder =
    message SearchRequest
        |> field 1 string
        |> field 2 int32
        |> field 3 int32
        |> toDecoder

However, to make this work I need the map the last field's Decoder (Chunks -> b) output to a normal Decoder b that can be passed to e.g. Http.expectBytes. That is what toDecoder : Decoder (Chunks -> a) -> Decoder a should do. To be able to generate the Chunks I need to know when to end the Bytes.Decode.loop. This would be easy if I knew the number of Bytes I initially receive. As far as I know there is no way to do this with the current elm/bytes. I could of course rewrite ProtoType.Decoder only to process Bytes -> a (and use e.g. Http.expectBytesResponse) but that doesn’t feel like a clean solution to me. Also it is very inconsistent with other libraries. Am I missing some obvious alternative solution here or should I aim for requesting an additional Bytes.Decode.width : Bytes.Decoder Int? Thanks for sharing your thoughts!

To clarify: I am currently working on getting the encoding/decoding to work. Eventually these encoders/decoders should be generated automatically based on the provided .proto file(s).


#2

I’m not following what you’re asking. Is it about figuring out how many repeated things there are at decode time? If I gave you a byte sequence, could you tell me how many repeated int32 are in there? The solution in the docs for strings is to use a length field, encoding an int before the repeated part. Another common solution is to use a termination character, like null in C, and then hopefully escaping any inline occourances of 0x00, like other formats do with backslashes.


#3

Thanks for your reply. I have no issues with the ProtoBuf encoding itself (it is explained in detail here). Every encoded field starts with a ‘tag’ describing what field number it is and how it has been encoded (which allows me to determine to the length of the chunk).
The issue is that I have no idea how many chunks there are. I do know how many fields there are to be decoded. However,

  • the server may send more fields than I am interested in;
  • if it is a repeated field, I do not know the number of occurrences.

So my issue is that I somehow need to know the width of the Bytes the decoder receives. Or at least I need to know when I am out of bytes.


#4

Well, the byte width is easy, https://package.elm-lang.org/packages/elm/bytes/latest/Bytes width. If you know the number of elements in the list, you can Loop https://package.elm-lang.org/packages/elm/bytes/latest/Bytes-Decode#loop until you’ve seen the number of elements you expected.


#5

Right. Unfortunately, to get the Bytes to feed to Bytes.width in a decoder I’d have to use Bytes.Decode.bytes first which needs the width as an argument… It’s kind of a chicken-egg problem.


#6

Well, do you know the length of that field or not? :slight_smile: We can’t use the size of the byte stream as our delimiter, because we don’t know if we’ll get more bytes if we just wait a little longer.


#7

No, I do not. A Decoder “turns a sequence of bytes into a nice Elm value”. How can I ever create an Elm value if the stream of Bytes never ends? Theoretically the stream could be endless, but I do not think that’s the case here.


#8

Cool, then we need a way to figure out how long a field is. https://package.elm-lang.org/packages/elm/bytes/latest/Bytes-Encode#strings suggests one approach, I’m sure protobuf itself has something specified for this. Maybe the start position of the next field is what delimits the end of the previous field?


#9

The ProtoBuf docs describe exactly how long each field is. For example,

message Message {
  optional string foo = 1;
  repeated int32 bar = 3;
}

may be received as

0a 07 74 65 73 74 69 6e 67 10 90 02 18 96 01 18 96 01

where

  • 0a = field 1 (length-delimited), 07 = length of 7 bytes, 74 65 73 74 69 6e 67 = “testing” (string);
  • 10 = field 2 (varint), 90 02 = 272 (int32);
  • 18 = field 3 (varint), 96 01 = 150 (int32);
  • 18 = field 3 (varint), 96 01 = 150 (int32).

So basically we received { foo = "testing", bar = [ 150, 150 ] } here. The value 272 is ignored as it is not (any more) part of the spec.

Point is, I do not know how many bar fields are sent. Also I do not know whether I should expect unknown fields like the 272 (at the end of the stream). For this I need to know the full width.

The official official documentation states:

The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own."

I can get hold of the width when receiving the Bytes via HTTP by using Http.expectBytesResponse, but I think it makes sense for protocol buffers (and probably other decoders as well) to be able to define a Bytes.Decode.Decoder that can be used directly in Http.request via Http.expectBytes.


#10

Ah, I see. Unfortunately, the underlying Bytes data structure and offset into it are hidden inside the Decoder. Possibly to support streaming data. Unless you ask Evan to expose that information, or ask the user to measure it themselves, it’s not possible to do this.


#11

Yes I noticed that. Just wanted to know for sure I didn’t overlook anything. I guess more people will run into this, depending on the encoding they use.

@evancz Given my last post, how do you feel about exposing the (remaining) width via e.g. Bytes.Decoder.width?


#12

Great to see that you have started this work! I have just started looking at protocol buffers as well, although from a limited scope of hand written encoders/decoders to start off with.

I was planning to take a two pass approach to the decoding where I would first decode the Bytes into a list of key-value pairs, where field id would be the key. Then in the second pass go look for the message fields in the list and translate into the final internal Elm representation. For repeated fields it would be a matter of concatenating all the values found with that key, if any is present. This will also handle the fact that repeated fields may be interspersed between other fields. (For non repeated values I would need to take the last occurrence, as stated in the spec). For embedded messages I would repeat the two-phase parsing on the associated byte list.


#13

That’s exactly the approach I’m following. It works very well, though I am currently running into a bug when building the dictionary for nested messages.
By the way, I plan on releasing it as a package, so that could save you some time :wink: