Publishing a package with API documentation >512kb

I am trying to publish an update to an elm package (http://package.elm-lang.org/packages/gicentre/elm-vega/latest) but the upload fails with:

Version number 3.0.0 verified (MAJOR change, 2.3.1 => 3.0.0)
Error: The following HTTP request failed.
<http://package.elm-lang.org/register?elm-package-version=0.18&name=gicentre%2Felm-vega&version=3.0.0>

A web handler threw an exception. Details:
File upload exception: Internal error! UploadedFiles: opened new file with pre-existing open handle

Thanks to @ianmackenzie via Slack we have isolated the problem as being caused by the file documentation.json being 682kb which is greater than the (I think undocumented) file size limit of 512kb for any single file in a published package.

I am really at a loss as to what to do. It is quite important that I am able to publish the package, and the documentation is central to its usability. Other packages are dependent on the structure (e.g. https://youtu.be/K-yoLxnm95A) so it would be impractically disruptive to break it into smaller separate packages.

Ideally Iā€™d like to see a higher file size limit, but donā€™t know how best to lobby for this (Iā€™ve pinged Evan, but heā€™s a busy man), or if there are alternatives (e.g. compressed, the documentation.json weighs in at 112kb). Can anyone suggest how best to proceed?

For info, the version I am trying to publish (v.3.0.0) is at https://github.com/gicentre/elm-vega)

I would have thought that the documentation.json file was generated by server after having downloaded your elm sources. Do you have a pre-generated documentation.json or is it actually generated somehow by elm before being downloaded by package server?

Thanks for the suggestion. I had assumed that even though there is a local copy of documentation.json on the repo, the package manager on the server generates a server side copy and that is what is leading to the error. To be sure I removed it locally and from the repo, but I still get the same error. I should add that I have published the package updates several times previously without problem.

There are files on the repo >512kb but these are supplementary files (e.g. datasets for testing etc.). With the exception of documentation.json, none of the core .elm files exceed that limit.

I just had a look at your release.sh. It appears youā€™re cleaning the docs/ folder anyway and publishing with a detached commit. Just to be sure, Iā€™d add the v... directories to the cleaning list: vExamples vTest-gallery vTests.

Thanks for spotting that. Iā€™ve updated release.sh to clean all files and keep things tidy. However the problem remains unfortunately.

I donā€™t know exactly how the package upload process works, but it definitely does involve uploading a copy of the generated documentation.json, so if that ends up too big then publishing will currently fail.

Are you sure it would be too disruptive to split the package into elm-vega and elm-vega-lite as a workaround for now? If the Vega module was in one package and the VegaLite package was in another, then anyone using the current package would have to update their elm-package.json to additionally include elm-vega-lite, but none of their code would have to change (imports etc. would be exactly the same since the module names would be the same).

Does anyone know how other package managers handle this? What do PyPi, NuGet, NPM, Hackage, RubyGems etc. do? Iā€™ve been meaning to do some research there myself but havenā€™t managed to yetā€¦

I would guess this is more likely due to a web server configuration. The file upload maximum size is usually quite low by default or it has been set to this limit.

Are you sure it would be too disruptive to split the package into elm-vega and elm-vega-lite as a workaround for now?

I may be forced to if there is no other solution, but it is problematic. I am wary of ā€œfor nowā€ solutions with published APIs as this may lead to further breaking changes in the future which does erode usersā€™ confidence in the package(s). In my case, I have already published journal articles and presented at conferences referring to the package as ā€˜elm-vegaā€™ so a change now, especially where there would remain a package still called ā€˜elm-vegaā€™ but now with different functionality, isnā€™t ideal.

I guess what it would be useful to know is whether a solution is as simple as changing some web server configuration and there is a will to do so, or whether by policy, large elm packages are actively discouraged and I should therefore restructure my packages.

The package.elm-lang.org website is public on github:

In the 0.18 branch, the snap file upload policy is configured in src/backend/Routes.hs:

uploadFiles :: FilePath -> Snap ()
uploadFiles directory =
    handleFileUploads "/tmp" defaultUploadPolicy perPartPolicy (handleParts directory)
  where
    perPartPolicy info =
      if Map.member (partFieldName info) filesForUpload then
        allowWithMaximumSize $ 2^(19::Int)
      else
        disallow

This is configured to 2^19, which is incidentally 512 KiB.

This might be the cause, I am not sure that this is where your upload fails. I have also no idea if this is intentional or not (I did not find anything in the comments or git log).

@brian and @rtfeldman, I wonder if you have any views on this issue. It would appear this is a matter of policy decision on maximum file sizes rather than a technical one on how to publish packages.

I do think there should be limits on the size of things uploaded to the server, and I think it is worth picking a path thatā€™s better than ā€œoh, someone hit the limit, letā€™s just bump it and hope itā€™s fine.ā€ This is the conclusion we came to when @ianmackenzie ran into this. One path I like is storing things gzipped on the server. The limit can stay the same, but there is more leeway.

On the topic of gzip, I noticed that there is quite significant redundancy in the documentation of this particular library. For example:

Specify point geometry for programmatically creating GeoShapes. This is equivalent to the GeoJson geometry point type in the GeoJSON specification.

It looks like there are tons of functions with documentation like this. I cannot really tell if it is written by a person or generated. Point is, I wonder if you can say it in a different way. Perhaps:

Create the equivalent of a point.

I understand that is not ideal for you, but making changes to the server and asset size limits will take some time and consideration, so making doc changes a path that can unblock you in the meantime. In this particular case, I think it could be nice anyway.

Thanks for considering this. gzipping document files (which are quite verbose even when the doc comments themselves have no redundancy) seems like a good way forward.

A disadvantage though, related to the problem I now have, is that it is not obvious to the developer of a library, what the practical limit is. And more significantly it feels like API design becomes partly driven by what is ultimately an arbitrary package server constraint and one that is currently not obvious to the developer. If I had known this was going to be an issue at the outset, I would have probably kept the two modules in separate packages. To be clear, I am not criticising what you have in place, but reflecting on how I (and possibly others) have ended up here.

As for my documentation style, it was all written by a person (me), not autogenerated [embarrassed smiley]. One of elm-vegaā€™s use cases is to teach students declarative visualization so it was important for the documentation to be self-standing. Making it an index of cross-references I think would hinder the learning process. Most of the (frequent) links in the documentation are to descriptions of the JSON schemas (Vega and Vega-Lite), but if this was the primary source of documentation, users would have to do the mental transformation from JSON to Elm parameterised functions which would lead to confusion.

In terms of the path you choose for possible changes to the server limit, do you have an approximate idea of the time scale you are considering? This will likely influence how I proceed.

1 Like

Optimizing the storage implementation here may be easy, or it may be hard. We just donā€™t know yet, and there are a lot of implementation details to consider so that it works reasonably well for everyone and doesnā€™t add an undue amount of load to the server. Plus, this is not the priority at the moment; it will probably have to take place after the release of 0.19.

Given all that, and the fact that your case is a real outlier (the largest documentation.json in one dump I have is 360kb, and the vast majority are 25kb or smaller) the best way forward for you will be to make your docs smaller. I know this is not what you wanted to hear, but I donā€™t want you to be blocked either! Keep in mind that the documentation format in the upcoming release is more efficient, so youā€™ll have a little more overhead when that happens. So, letā€™s talk about concrete ways to do that without compromising your goals!

I want to start off by saying that I think these are excellent docs in general. Thank you for caring enough to take the time to write them. :heart:

I donā€™t think educational and succinct are necessarily at odds, but self-standing and cross-referencing might be. Do you want to teach your students declarative visualization in general, or vega-lite specifically? Right now, these docs explain a little but then send the reader to the vega-lite docs for almost everything. Itā€™s useful as a refresher, and very searchable, but as someone unfamiliar with the source material it already reads as a cross-reference to me.

I see a couple optimizations without changing the overall structure:

  • Move most of the links in individual functions, especially those which are duplicated, under their section headers. For example: For details see the [Vega-lite projection documentation](https://vega.github.io/vega-lite/docs/projection.html#properties). du says this is 4k per occurence. At 16 instances, removing these would save 64k. Youā€™d regain a little bit of that by adding the headers, but youā€™d probably still see a net loss.
  • Remove duplicate language. For example: ā€œThis is equivalent to the GeoJson geometry multi-polygon type in the GeoJSON specification.ā€ You can drop ā€œin the GeoJSON specificationā€ in these instances with no loss of information.

In terms of larger-scale optimizations, we still donā€™t want to make things worse! So, what would you think about moving some of your educational intent into separate guides? Docs like these must necessarily serve as a reference, and canā€™t always introduce concepts in the order that would work best for your students.

6 Likes

Thank you Brian for taking the time to find a positive way forward and to clarify the timescales / priorities that will affect progress. This does give me something to work towards, even though it is not what I would have chosen given the choice.

I am fully with you in that education and succinctness are not necessarily at odds. Indeed, briefer text, if crafted well is often more desirable. My intention is to use elm-vega as part of two undergraduate and graduate classes I am teaching in data visualization. In that context I can provide additional and separate support materials. But more generally, I am hoping others from the visualization community might be drawn towards an Elm / elm-vega approach to declarative visualization specification (there is considerable interest from the academic community in this work - see for example https://www.gicentre.net/featuredpapers/#/literate2018/ and https://youtu.be/pMHmQX3TZ8A ). But for that to happen I need to document the Elm-specific stuff. If you like, I have an interest in addressing the ā€˜Educationā€™ and ā€˜Scientific Computingā€™ circles of Evanā€™s Elm-Europe Keynote (https://youtu.be/uGlzRt-FYto)

I think because I will need to lose 25% or more of documentation weight I am going to have to rethink what is documented rather than trimming words here and there. But perhaps though by the end of the process, I will have a better set of documentation pages.

In broader terms (and no doubt you are considering this), I think the issue does raise some interesting questions about what might be best practice for Elm package design in the future and what level of scalability and modularity you would support and encourage.

Just to report that after (rather a lot of) culling and restructuring of the documentation, I have now managed to update elm-vega and publish it within the 512 kB limit.

Thanks for the help.

I would still recommend updating elm documentation format and/or API design guidelines pages to encourage brevity by authors of potentially large packages. I can put in a PR there but am not sure if this will become a redundant issue with Elm 0.19.

2 Likes

I agree with this. And besides updating the documentation to encourage brevity, I feel it makes sense to:

  • Mention the current maximum limit on (documentation) file sizes somewhere .
  • Make sure that the uploading failure error message in case of large files is more descriptive than what @jwoLondon before (see the first post of this topic).

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.