But that’s not a satisfactory solution IMHO… Failover server and people with enough permissions to fix the servers in various timezones (so that folks don’t need to wait for US to wake up) might be a step in the right direction?
Should be back as of about 8 hours ago, around when @Janiczek posted. I shared the following info on a GitHub issue about the 502.
What happened?
This has happened once before, but at a better time of day. That time, I spent the next few days setting up monit so that the server could restart itself.
In this case, it looks like the server went down at the worst possible time for US-based admin. We had monit set up to restart the server if this ever happened, but it ran into trouble. It seems that restarting needed the HOME environment variable to be defined, but it was not defined in whatever context monit was trying to run the restart script.
Anyway, once US work hours started again, we got the server back up. Then figured out what the problem was that prevented monit from working. So the issue reported here should be resolved as of US morning.
Additionally, I have added a new admin for the site who lives in Europe. Between these two things, I think the window of possible downtime should be a lot smaller. There will be an Elm event in Japan relatively soon, so perhaps we will be able to find someone there to add, further limiting the amount of time that are not covered by people at work.
Making CI more resilient
For folks who had trouble with CI, it is possible with most systems to cache the ~/.elm directory such that you do not need to talk to package.elm-lang.org or to github.com on normal builds. Only if a new package dependency is added.
People who had this configured did not experience any downtime on CI. I am asking community members to blog about this and make sure any default configurations out there are doing this.
We are hopeful the monit fix and the European admin will help with downtime, but configuring CI to save ~/.elm/ is a great way to be resilient to downtime elsewhere and to reduce build times in general.
What we are doing at my company is using Docker to build everything, in particular we run the elm compiler from a Docker container. This means we do not care what specific machine the CI runs on, as long as it has Docker installed. As we build not only elm projects but also C++, Fortran, etc., this is quite convenient and scales well.
We used to simply put the 0.19.1 directory under version control, which is inconvenient because it bloats the repository & dependencies are duplicated for each elm project. Then we tested removing those submodules (i.e. downloading from the internet at each fresh build) and literally four day later the elm package repository broke…
The problems I have with caching the .elm directory for CI are:
Each project has different dependencies so, eventually, your .elm directory becomes a very large subset of the whole elm package registry
We cannot execute the CI jobs on any runner (CI runners are no longer tool-agnostic)
In this thread, Evan says that he does not want to enable private repositories because it would add a lot of complexity to the package system. I think that simply enabling any git repository (not just Github) is sufficient: just require Github for published packages. This is more or less the idea developed here.
It would bring the following benefits:
Use private packages instead of git submodules
No single point of failure (elm package repository is one)
During the downtime somebody suggested that a CDN could be a good option.
Given Elm’s semantic versioning of packages it should be possible for the server to serve the assets with the Cache-Control: immutable header. The server would then serve the initial request for an asset, the CDN handling all requests thereafter. This would give us HA with very little maintenance overhead.
That looks like a great idea! I still think being able to specify any Git server & rejecting pushes to the package registry if any server is not GitHub would be a good idea, though, but your suggestion would certainly help mitigate the downtime problem.
Why not bake the package cache into your docker image? That way the tooling would still be machine-agnostic, but you won’t have to pay the penalty of downloading the dependencies each build, and you’ll be resilient to network issues.
Yes that’s an interesting idea. I suppose I would need one Docker image per project otherwise my Docker image would end up containing all of elm’s packages! I’ll see if I can figure something out. Thanks for the idea!
Python’s pip is an excellent example of something that just works and nobody complains about, and isn’t locked to github either.
You can just do pip install git+ssh://my.github.enterprise/my-project/my-repo@1234567890
Yes, there are other ways of doing it. In fact, I don’t know any other language that requires you to use GitHub… I’m not sure why we could not use the same technique as PIP: the use case looks fairly similar.