Over a period of 5 days we experienced either slowdowns or 500 error response codes from our API. These brown-outs occurred for around 1 minute, 4 times a day. This was due to a misconfiguration on a client's SDK implementation that was sending us very high bursts in traffic following a push notification that was sent out to a large user population.
In order to mitigate this outage, we have upsized our core database with 8x the capacity, and our app server cluster with 16x the capacity. This has provided us with enough capacity to serve these traffic bursts. We have also been in contact with the customer to help improve their SDK implementation to reduce the load on our API.