Post mortem on yesterday’s downtime

Gitorious 3 has been in the next branch of Gitorious’ mainline repository since last autumn. We started out building a new code browser for Gitorious which would perform better than our previous solution, with server-side syntax highlighting of any programming language out there. At that time we had fallen behind on versions of Ruby on Rails, the framework Gitorious is built with, and we realized that we would have to build Gitorious 3 for Rails version 3.

The story with Rails 3 is a long one, and we’ll save that story for another occasion. But for some parts of Gitorious, for example our routing system, the Rails 3 upgrade took a long time. But we were finally able to run Gitorious under Rails 3, using Ruby 1.9. After Ruby 1.8 reached end of life earlier this year, this is now an important reason to upgrade.

As we started discussing the aquisition by Powow AS, Christian, Thomas and I wanted to give the new team a clean slate, with a version of Gitorious with less technical debt and a vastly improved UI. As part of the agreement with Powow we agreed that the old team would take care of upgrading the servers to Gitorious v3 around the time of the acquisition. Last week we picked yesterday as the final date of the deployment, and started planning for the upgrade.


When we started upgrading the servers yesterday morning, most things went according to plan. We followed the upgrade guide we had set up, and after half an hour we were able to view running Gitorious v3. After getting all the components of Gitorious running, we opened to the public around noon. We were expecting some errors to occur, due to the sheer volume of users and repositories on  Running an application in test and staging environments is not the same as running in a producton environment with over half a million users. One of the first issues we addressed was that rendering of a project page was ridiculously slow; a bug we deployed a fix for within 30 minutes after upgrading.

One of the surprises we were not expecting, however, was that my profile page on rendered the international characters in my name wrong. Digging deeper, we found that although the Rails app was configured to use latin1 encoding when communicating with the database, we were not getting latin1 back. We tried changing the database configuration to use utf8, but discovered that the actual data in the database was wrong. In order to not have inconsistent data in the database, with some data using different encodings in the same database we decided to take offline.

With the help of our friends at Redpill Linpro, we tried dumping a copy of the database to a file and then importing that into the MySQL database, using utf-8 for the database dump. Looking through that file, the encoding looked really strange. Where my middle name in UTF-8 should be 7 bytes (\x4d\xc3\xa5\x72\x6e\x65\x73), it was actually 9 bytes in the database. Once the import had completed, the data was in the same state as it was before the import – broken. Seeing 4 byte characters where we should be seeing 2 byte characters made us believe  that the data in the database had been encoded twice. Running a double recode of the data proved us right:

SELECT CONVERT(CAST(CONVERT(fullname USING latin1) AS BINARY) USING utf8) FROM users where id=2;
Marius Mårnes Mathiesen

Now all we had to do was to find a way to recode it twice in the file produced by mysqldump, which (as always) could be done with a Perl one-liner. We ran this oneliner on a SQL file for a single table (extracted from the entire database dump using sed, of course) and used that to recreate the users table in our database. Naturally that didn’t work, so we had to set up the character set related settings in the MySQL server and client, which worked. We now had a strategy for how to recreate the database with all UTF-8 data.

After dropping the database and importing it from the database dump, processed by the Perl oneliner, we finally got Gitorious running again around 7:15pm yesterday. We got all the services running again, and re-opened access to at 7:30pm CET.

Bottom line

We’re all really sorry about this downtime. This is not the kind of service you should expect from us, and we appreciate your patience while we worked to resolve this.

The good news is that we wouldn’t have been able to resolve this situation without help from the new team. Database configuration and character encodings are not of the old team’s strongest skills, while Marcin used tools I personally never knew existed to read the actual byte values from a several GB SQL dump. Further good news: is now running the best release of Gitorious ever, the infrastructure is in a much better state, and the new team has a lot of plans to make Gitorious even better.

On behalf of the old team: please don’t blame “the new guys”, we’re the ones who messed up this time.

And again, we’re really sorry about this.

One Comment

  1. Roman
    Posted October 11, 2013 at 11:48 pm | Permalink

    That’s great you managed to find root case quickly.
    What was that Perl one-liner?

%d bloggers like this: