Outbound Message Delay
Incident Report for Redox Engine
Postmortem

Root cause

A significant number of requests to our internal tasking utility caused it to become overwhelmed, resulting in an inability to accept new requests. This affected our transmissions upstream, which resulted in several queues erring and pausing as a result.

Impact on customers

A subset (less than 10%) of customers experienced message delays, the duration of which depends on time time their subscription was affected. Messages were queued until the issue was resolved, preserving FIFO.

What Happened?

At 11:08 AM ET Redox discovered that several queues were paused due to errors within the engine, halting all traffic to the affected destinations. It was determined that an overwhelming number of requests to the internal tasking utility the previous night was the root cause. Upon confirming the cause, Redox began resuming traffic on the affected feeds.

Learnings / Follow-ups

Logging Clarity & Visibility

Redox has identified and has already begun work on several take-aways from this process. Providing clarity around certain logging and visibility within our paused queues are two such items we have already begun working on.

Process

Redox will also be looking further into how we address issues and process messages when our tasking utility is not available.

Posted Feb 25, 2022 - 10:54 CST

Resolved
This incident has been resolved.
Posted Jan 20, 2022 - 15:41 CST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 20, 2022 - 13:56 CST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 20, 2022 - 13:38 CST
Update
We are continuing to investigate this issue.
Posted Jan 20, 2022 - 12:40 CST
Investigating
We are currently investigating an issue that is preventing Redox from sending outbound messages to a subset of customers. When a resolution is found, we will resume message processing on a first-in-first-out (FIFO) basis. More updates to come...
Posted Jan 20, 2022 - 11:20 CST
This incident affected: Engine Core.