On the 20th of January 2021, at 16:47 UTC, our REST API suffered a partial outage for 38 minutes, with partial service resuming over the course of 6 minutes, resulting in total downtime of 44 minutes. The core reason for the outage was a database migration that failed to apply correctly. We manually corrected the migration and service was resumed.
We’re really sorry for this downtime. We work hard to try to ensure 100% uptime, and will take on these learnings to improve the service into the future.
As part of the development work around 3rd party integrations, we have been working on an integration with [Amplitude](https://amplitude.com/). This integration requires a new table to be created in the core Postgres database. Consequently a Django Database Migration was created to facilitate this.
As part of this work, one of our developers manually edited the migration to make a change to the data schema. This was an error; migrations should not be manually edited; the engineer should have created a second migration to modify the data schema.
We have also been migrating our code to use the Black python formatter. This caused issues with regards to our code review process by polluting the code review with additional formatting that made reading the code harder than it ought to be.
The code worked in our local, development and staging environments. This was due to the fact that test data was present in the prod environment but not on the development or staging environments. The migration failed to apply everywhere (because the app thought there was no migration to apply) but the exception was only thrown because there was data in the table in production.
Once our code review had progressed, we merged our code to master and the CI/CD pipelines pushed it into production. This caused the outage. We were also late to be alerted to this on account of it not taking down endpoints like /health; /health was still reporting 200 OK response codes.
We identified the issue quickly, wrote and tested a fix and then deployed it into production.