Post-mortem on last week’s problems

The last few days have been terrible, for us and for you. The performance on has been really bad, and the site has been going up and down constantly. We’re really sorry about that.

We started receiving notifications that the server was unavailable about a week ago, and since that we’ve been trying to find out what was causing the outages and bad response times – which became worse after the weekend.

One of the first things we noticed was that out caching server was unable to cache any content at all, due to our Rails backend sending a Set-Coookie HTTP header on every request. Since this would tax the server really badly, we deployed a fix for this on Tuesday and saw some of the most requested content (single commit diffs) being cached again.

We were hoping that we had found the root cause and hoping for better performance as our cache got warm again. Not seeing the desired improvement in response times, we started suspecting there could be other issues causing the bad response times. Analyzing the server load, we found a really high number of Atom requests for pages that rendered slowly and suspected that the combination of polling RSS clients and missing cache support were slowing down the servers. We set up our cache server to force caching of any atom request for an hour, removing Set-Cookie headers in the process. When this didn’t work, we temporarily disabled atom requests entirely, which helped a little.

But the problems persisted. Last night we noticed that rendering a project page on was really slow. Easily over a minute for the most popular ones. Looking through the code, it turned out that we were effectively no longer caching events listed on that page in Memcache, resulting in a lot of database access. Once you’ve found the problem, the solution usually isn’t very far away – which was also the case here. We deployed Christian’s commit from last night this morning, and the servers are handling the load a lot better today.

The reason this happened, and the reason it took so long to find the problem, was that we merged a quite big feature into master a week ago, private repositories. will not be offering private repositories, and there is a “feature switch” which is turned off for Apparently there was still a place or two where performance was affected even if the feature was turned off, which is what happened over the last few days.

We’re not very proud of how we’ve kept you informed about the problems over the last few days. Heads-down, diagnosing the problem, trying different fixes and responding to support email, we neglected updating our status site and accounts.

We did a bad job of keeping you informed, and we will make sure that you are kept fully in the loop if trouble should strike in the future.

[Edit: after publishing this post, we discovered a similar issue in the repository pages on Gitorious, we just deployed a fix for that].

%d bloggers like this: