On June 7th at 10:33 UTC we released a change to our Edge API as part of this issue that filters out server-side only features when a client API key is used. These changes also affected the logic responsible for triggering environment-level integrations, causing them to fail.
Which integrations were affected?
On June 15th at 22:25 UTC we were notified by a customer that they were not seeing data populated from their integration. First thing on June 16th the engineering team began investigating the issue. The team immediately identified that the change described above had changed the signature of a function that was also used by the integrations logic. Unfortunately, this change had not been picked up by our tests or code review, and the subsequent errors were not picked up by our monitoring.
Why didn’t our tests pick this up?
The unit tests covering the integrations logic utilised mocking, meaning that the change to the method signature was not correctly identified as an issue and our end to end test suite did not include the verification of successful integrations.
Why didn’t our monitoring pick this up?
The monitoring in place to track the error rate on the function responsible was using an incorrect aggregation algorithm meaning that the threshold for alert was never breached.
At 18:03 UTC on June 16th a fix was released and service was resumed to all environment-based identity integrations (listed above). Unfortunately, due to the implementation of the asynchronous call to the lambda function that handles the integrations, it is not possible to recover the data that was lost during this period.
spec
correctly (see unittest documentation here for further reading)Alerting & monitoring:
Moving asynchronous invocations of other lambda functions to use persistent queues and/or an event messaging system so that, after issues such as this, tasks can be re-run to ensure no data is lost.