Experiences using elm/bytes for decoding font files

Over the past weeks I’ve been writing a opentype parser. I have reached my two main technical goals:

  • Get bounding boxes (width and height) of all the characters in a string.
  • “draw” text using svg: extract the shape of all characters in a string and drawing them in the correct location

Fonts are unusual file formats in an elm context, but they are designed to be optimal in space and quick to decode (in c-like languages). The tricks that are used to save space make decoding much more difficult that for instance JSON, and are much more complex than most protobuf messages. Therefore, I think fonts are a nice case study.

I previously opened a thread, where I pointed out that font files are not encoded in a linear way (and thus cannot be decoded linearly). Since then I’ve learned that this is not that much of a problem. I’ve also collected some notes/thoughts on the elm/bytes API.

deferred parsing

A quick sketch of the problem: A font file contains a header and a set of tables. The header specifies the offset to the beginning of each table. Linear decoding (like we do with JSON) is impossible because the decoding of tables that occur earlier in the file can depend on data stored later on.

But there is a more important reason that linear decoding is undesirable (that I didn’t see clearly back then): Most of the time, we’re not interested in the whole font, but only in certain parts.
For instance, there is a lot of information stored about Arabic or Chinese scripts that is not used at all relevant for latin scripts. Decoding the whole file up-front can thus be quite wasteful.

The solution to both problems is to not parse the whole file, but store chunks of Bytes that are only decoded when needed. How to do the slicing is a bit tricky to figure out. In particular I discovered after quite a while that Decode.string can consume more bytes than it was told if the final byte is the start of a multi-byte utf8 character. This might make sense in most cases, but it’s good to be aware of (and should maybe be mentioned in the docs). There is also currently a bug with offsets into sliced buffers that I had to patch myself in the local Kernel code.

The non-linear decoding means passing around Bytes objects into the decoders, which can become confusing. I think we can make this process nicer by borrowing some features from elm/parser, in particular getSource in combination with a Bytes.slice.

Working with offsets

A specific problem that is annoying is decoding a sequence of elements of unequal size (e.g. an array of a custom type like Array (Maybe Int)). In particular in type 2 charstrings (see below), often only the length (in bytes) is known, not the number of elements. To decode such a sequence, you need to keep careful track of how many bytes have been read to not step out of bounds. This is messy and error-prone.

A quick introduction to type 2 charstrings: They are sequences of bytes that encode the drawing operators of characters (moveto, lineto, curveto). Bytes with a value below 32 (as unsignedInt8) encode operators, other values encode the arguments (with some shifting scheme to encode arguments, so 0…31 can be used as arguments)

A CFF table contains an array of such charstrings (one for every character, and a bunch more). Their start and end position are known, so the widths (in bytes) can be calculated. But when decoding, there is no (at least, not always a) delimiter that marks the end of a charstring. The only way to know when to stop is to keep track meticulously of how many bytes have been read so far. This pattern pops up in a couple of places.

Again, borrowing from elm/parser would be nice. a getOffset (that can be run before and after a parser) makes it easier to keep track of how many bytes are consumed.

Errors are hard to figure out

Again I think elm/parser handles this really well. I see two causes for a decoder to fail:

  • Decoding reads past the end of the Bytes
  • The programmer throws an error. For instance we expect some flag value to be below 3, but it is 42

In both cases I would like to know what I was parsing. A mechanism like elm/parser inContext mechanic would work well I think.

Writing tests for binary decoding is hard

I suppose this is partially inherent to binary data, but with the current elm-test, test inputs have to be hand-crafted because elm-test cannot load a file during testing. (I think some mechanism to inline and decode a json/html/webgl/binary file at compile time would be really neat and useful, but that’s a different topic).
If anyone has ideas how how to test such a parser, I’d love to hear them.

injected html with elm reactor

When serving a non-{html, elm, js, css} file, elm reactor will inject a piece of html. When working with binary files, this is very annoying, because we have to throw away this piece. It uses the filename, so its size is dynamic. It would be nice if elm reactor could serve the files unchanged.

Conclusion

The current elm/bytes API works well in general, and can (with some tricks) be used to decode complex binary files. There are a couple of situations where I think the API can borrow from elm/parser to make binary decoders more pleasant to write.

Links

  • elm-cff is the CFF parser. I’m still in the process of polishing it now that it works. API design for this kind of large file format has been hard, so it’ll likely be a while before I can really release this.
18 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.