Core API outage
Incident Report for Flagsmith
Postmortem

At around 23:35 UTC, July 9th we received an alert that our Core API was not responding. This resulted in our SaaS customers not being able to use the Flagsmith dashboard (app.flagsmith.com). Customers SDK’s serving flags were not impacted for those using the Edge API. Please note, any customers still using our Core API to serve flags were also impacted. This number is limited as we have advised customers to migration to the Edge API starting in June 2022.

Our team resolved the issue at 3:06 UTC, July 10th and the Core API was fully responsive. The root cause of the issue was a database running at maximum CPU caused by requests to an end point that triggered an inefficient query. We also had our load balancer consistently recycling unhealthy API tasks that also strained the system due to unnecessary database connections. These two items combined, resulted in the core API being unresponsive.

We recovered the database by dropping all traffic and terminating all open connections. This allowed the database to be recovered and process traffic normally.

We are mitigating future issues like this by doing the following:

  • Optimizing the query that was triggered that used too much CPU capacity (note that this has been completed and deployed to our production SaaS environment)
  • Add better alerting when inefficient queries are identified in the application
  • Improving our internal tools (e.g. PagerDuty) to improve response time of issue identification triggered by some team members being out of office
Posted Jul 10, 2024 - 13:31 UTC

Resolved
Core API and admin dashboard outage on 10th July 2024.
Posted Jul 09, 2024 - 23:30 UTC