Root cause
A significant number of requests to our internal tasking utility caused it to become overwhelmed, resulting in an inability to accept new requests. This affected our transmissions upstream, which resulted in several queues erring and pausing as a result.
Impact on customers
A subset (less than 10%) of customers experienced message delays, the duration of which depends on time time their subscription was affected. Messages were queued until the issue was resolved, preserving FIFO.
What Happened?
At 11:08 AM ET Redox discovered that several queues were paused due to errors within the engine, halting all traffic to the affected destinations. It was determined that an overwhelming number of requests to the internal tasking utility the previous night was the root cause. Upon confirming the cause, Redox began resuming traffic on the affected feeds.
Learnings / Follow-ups
Logging Clarity & Visibility
Redox has identified and has already begun work on several take-aways from this process. Providing clarity around certain logging and visibility within our paused queues are two such items we have already begun working on.
Process
Redox will also be looking further into how we address issues and process messages when our tasking utility is not available.