We are currently encountering difficulties with our task processing system.

Incident Report for Flagsmith

Postmortem

Timeline

We were alerted at 23:39 UTC on 18/07/2023 that the queue for our asynchronous task processor was above the acceptable threshold.

Once our team was online in India at 2:59am UTC, the status page was updated. By this time the task processor queue had backed up and the application was not able to write flag change events to the datastore which powers the Edge API.

We investigated multiple avenues to determine the cause of the issues but there were multiple ‘symptoms’ that made determining the root cause very difficult. One specific issue, which turned out to be a red herring, related to the functionality to forward core API requests to the Edge API. This process seemed to be taking much longer than expected. Much of the investigation was spent restricting the usage of this functionality.

At around 9:30am UTC, the cause was attributed to a particular set of tasks in the queue which were causing the processor units to run out of memory. Once it was determined to be safe to do so, these tasks were removed from the queue.

At 10:19 UTC the issue had been resolved and the queue had returned to normal, meaning that flag change events were being written to the Edge API datastore again. Any changes that were not processed at the time were also re-run to ensure that the state was consistent with the expected changes that had been made in the database.

Issue Details

The issue was caused by an environment in the Flagsmith platform that included 400 segments and nearly 5000 segment overrides. This meant that the environment document which is generated to power the Edge API was larger than it was possible for the task processor instances to load into memory, and subsequently write to the Edge API datastore.

To compound the issue, these changes were made via the Flagsmith API which resulted in 1000s of tasks being generated to update the document in the Edge API datastore in a short space of time. Each of these needed to load the offending environment, causing the task processor instances to fall into a cycle of running out of memory.

These tasks were slowly being blocked from being picked up again by the processors but the quantity meant that there were always new versions of the same (or very similar) tasks to pick up.

Next Steps

Implement limits on the size of the environment document
- This will primarily consist of implementing limits on the number of segments and features in a given projects, as well as limiting the total number of segment overrides in a given project.
Deprecate the functionality to forward requests from the Core API to the Edge API. All projects using the Edge API will need to ensure that all connected SDKs are using the Edge API only.

Posted Jul 19, 2023 - 16:24 UTC

Resolved

This incident has been resolved. We will publish a full post-mortem imminently.

Posted Jul 19, 2023 - 10:19 UTC

Monitoring

We have deployed an update which has resumed consumption of the task queue. We are now processing the task queue and expect to be caught up in the next hour.

Posted Jul 19, 2023 - 09:12 UTC

Identified

We have identified a database lock that has caused this issue with the task processor. We are working on an interim fix as we identify the root cause.

Posted Jul 19, 2023 - 08:02 UTC

Update

We are continuing to investigate this issue with the utmost priority.

Posted Jul 19, 2023 - 06:35 UTC

Investigating

At the moment, we are conducting an investigation, which indicates that any flag changes made in the last approximately two hours may not be visible to the client.

Posted Jul 19, 2023 - 02:58 UTC

This incident affected: Edge API.