Summary
On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary.
Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC.
Next Steps
We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future.
https://www.postgresql.org/docs/11/runtime-config-client.html
https://www.postgresql.org/docs/11/runtime-config-logging.html (log_lock_waits
)