Decoding Bytes with offsets

This is in particular about the decoding of opentype fonts, where data is not always sequentially encoded: byte 100 can contain data that is needed to properly decode byte 20. For instance byte 100 starts
a uint16 that defines the length of an array starting at byte 20. currently, it is hard (maybe impossible) to move to byte 100, decode the length, then move back to decode from byte 20.

This might be a quirk in opentype (1), but I suspect that there are more cases where it would be useful to be able to work with offsets when decoding Bytes.

If this is indeed common, that means that decoding binary data is not really a sequential process. Currently,
binary decoders and json decoders move from the beginning to the end of the input. Decoding bytes with offsets means jumping through the input to do decoding.

That seems much more complex, so I’m trying to be really cautious. Can we find a nice way to decode this kind of structure? Are offsets actually commonly used in binary protocols?

Some more context

I’m not particularly knowledgable on fonts or binary protocols. My eventual goal (which a bunch of folks in the svg/visualization space have also been thinking about/working on) is using font information for smart layout of svg text, for instance smart label positioning in visualizations and maybe an elm-ui like layout mechanism. I believe font rendering with webgl also interests some.

The opentype spec defines a table at the start, that gives a list of the rest of the tables and their starting position (number of bytes) from the start of the file.
E.g. table “cmap” starts from byte 6234, table “os2” starts at byte 7134, etc.

The most difficult problem is decoding the hhea (horizontal header) and hmtx
(horizontal metrics) tables. Tables are stored without any order, so it can be the case that hmtx occurs before hhea, but the header specifies the actual length of the hmtx table. While a hacky solution might be possible, it would be extremely fragile.

notes

(1): This section at the bottom mentions that the use of offsets allows sharing of data between multiple
fonts in the same file. So maybe this style of encoding is specific to opentype.

(2): offsets might be related to slices (for which a use case is described in this thread), but according to mdn a slice will copy (and thus allocate) which is not really required here.

7 Likes

elm/bytes provides Bytes.Decode.bytes which lets you decode into another Bytes instance. So you could do something like…

type alias InterimData =
    { tableRegion : Bytes
    , tableLength : Int
    }

initialDecoder : Decoder InterimData
initialDecoder =
    Decode.map2 InterimData
        (Decode.bytes 99)
        (Decode.unsignedInt16 BE) -- or maybe LE, I don't know!

tableDecoder : Int -> Decoder Table
tableDecoder size =
    -- decode the table using the size retrieved earlier
    -- (this decoder should be run on the tableRegion field of the decoded InterimData)
1 Like

I have worked with a couple of byte-encoded formats, and most of them try to keep the process ‘sequential’ in that the data at the beginning tells you more about later data. based on the data at the beginning, exactly because this means that you do not have to jump (back!) through the data. But this still means that the binary data essentially is describing a flat representation of a (variable-length) tree, meaning that an andThen-style function probably is necessary (elm-bytes already has this).

If OpenType indeed is an odd type where later bytes explain earlier bytes, then we’d need to perform a ‘preliminary’ decoding of parts (like, in your example ‘byte 100’), so we can find out enough information in other parts to be able to decode this. I think that this is also possible with andThen, but I have no idea about how it would perform, and how readable the resulting code would be.

I was curious if something like seek would be needed, but it seemed like a recipe for major weirdness.

Looking at link (1) from your post I am seeing:

If the font file contains only one font, the Offset Table will begin at byte 0 of the file.

If the font file is an OpenType Font Collection file, the beginning point of the Offset Table for each font is indicated in the TTCHeader.

The way I am reading this, it is not 100% clear that you have to jump to get to the TTCHeader. It’s just not mentioned in that link!

As to sharing information, I think that is something you can handle in the decoder. For example, rather than building the final representation directly, you may need the decoder to produce an intermediate representation, and then have a second phase to use header information to turn it into the full font information.

My path forward here would be to gather a bunch of .ttf files and see if you can figure them out by reading the HEX yourself. From there, it’ll be clearer how it relates to bytes decoders. I am personally interested in learning more about this topic, so I encourage you to share your progress and results when looking into these kinds of files!

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.