We were alerted at 23:39 UTC on 18/07/2023 that the queue for our asynchronous task processor was above the acceptable threshold.
Once our team was online in India at 2:59am UTC, the status page was updated. By this time the task processor queue had backed up and the application was not able to write flag change events to the datastore which powers the Edge API.
We investigated multiple avenues to determine the cause of the issues but there were multiple ‘symptoms’ that made determining the root cause very difficult. One specific issue, which turned out to be a red herring, related to the functionality to forward core API requests to the Edge API. This process seemed to be taking much longer than expected. Much of the investigation was spent restricting the usage of this functionality.
At around 9:30am UTC, the cause was attributed to a particular set of tasks in the queue which were causing the processor units to run out of memory. Once it was determined to be safe to do so, these tasks were removed from the queue.
At 10:19 UTC the issue had been resolved and the queue had returned to normal, meaning that flag change events were being written to the Edge API datastore again. Any changes that were not processed at the time were also re-run to ensure that the state was consistent with the expected changes that had been made in the database.
The issue was caused by an environment in the Flagsmith platform that included 400 segments and nearly 5000 segment overrides. This meant that the environment document which is generated to power the Edge API was larger than it was possible for the task processor instances to load into memory, and subsequently write to the Edge API datastore.
To compound the issue, these changes were made via the Flagsmith API which resulted in 1000s of tasks being generated to update the document in the Edge API datastore in a short space of time. Each of these needed to load the offending environment, causing the task processor instances to fall into a cycle of running out of memory.
These tasks were slowly being blocked from being picked up again by the processors but the quantity meant that there were always new versions of the same (or very similar) tasks to pick up.
Implement limits on the size of the environment document
Deprecate the functionality to forward requests from the Core API to the Edge API. All projects using the Edge API will need to ensure that all connected SDKs are using the Edge API only.