UseGalaxy.eu Downtime on 18th March

Mar 18, 2019
UseGalaxy.eu
Freiburg devops downtime

A normal upgrade went awry and we had to restore some data from backup. We had a total downtime of around 20 hours.

Incident Timeline

Date	Time	State
March 18	16:00	We began the upgrade procedure, expecting it to proceed smoothly
March 18	18:45	A database migration fails in a bad way. Attempts to recover the situation were not successful. We decide to restore from backup.
March 18	19:45	Backup restore finishes. It was made minutes too late. We decide to re-restore from the most recent backup, and manually extract the needed table from the previous backup.
March 18	20:30	We updated our website to reflect the current situation.
March 18	21:00	We manage to get the database in a partially functional state and debug a problem with zerglings where they refuse to serve 19.01.
March 18	22:00	Due to stress of the incident, Helena identifies that it is likely her error rate will increase if she continues working, maybe with disastrous consequences and stops for the evening.
March 19	08:00	Helena recovers the broken table and manually applies the database migration.
March 19	10:00	We discover additional issues with the new options, namely some upstream bugs we accidentally triggered. The server is kept closed until these issues are resolved.
March 19	11:40	UseGalaxy.eu is back online

The workflow_step_connection table was partially corrupted as part of the failed migration. It is likely that had we not tried to re-run upgrades, and had not tried to manually apply the table changes, we could have prevented the small loss of any new workflow connections created on Monday. We cannot say this for certain, but it is probably. In the future we can do a lot to safeguard against similar errors.

Resolution

We have planned since some time that we will have a test Galaxy where we can test out these upgrades ahead of time, but unfortunately it has not been possible to allocate time to this project, given the current priority/urgency matrix. This task will be re-prioritised following this incident.

We will adjust our backup procedures before upgrades. Currently given the extremely positive experiences and relative safety of database migrations in the past years of running UseGalaxy.eu, we do not make backups before an upgrade, and rely only on the normal daily backups. We will investigate the possibility to trigger a backup immediately before any database changes, to prevent similar issues in the future.

Conclusion

This was a significant interruption to our services for what should have been a simple upgrade. We apologise for the inconvenience this has caused.