Traffic delayed and increased errors

Incident Report for Redox Engine

Postmortem

Summary

On May 29, 2025, between 1000CT and 1115CT, some customers experienced elevated error rates and/or delayed processing. The issue impacted some customer traffic during that time, and the duration was <1 1/2 hours.

What Happened

  • This issue was caused by a data consistency error with Redox base configs related to change which led to elevated errors and/or delayed processing for customers using a specific version and subset of Redox base configs.
  • During the incident customers may have experienced an increase in 5XX errors or increased latency, depending on whether traffic was synchronous or asynchronous.
  • Our team was alerted by monitoring at 1007CT and immediately started investigating. Mitigation efforts included immediately preparing to rollback changes, scaling to increase capacity, and active monitoring of system health to ensure message processing continued. By 1102CT errors were decreasing and latency was starting to return to normal levels Full service was restored by 1115CT.

What we are doing about this:

  • Adding automated detection to all environments that would prevent the data consistency errors that caused the incident along with a broader category of data consistency errors.
  • Improving our rollback capabilities for faster mitigation.
  • Auditing and improving our standard operating procedure for the process related to working with this type of data.
Posted Jun 11, 2025 - 18:22 CDT

Resolved

Latency has now resolved and traffic is flowing as expected.
Posted May 29, 2025 - 12:34 CDT

Update

For most feeds we have seen resolved latency. We are continuing to monitor until we see latency fully resolved.
Posted May 29, 2025 - 11:52 CDT

Monitoring

We have implemented a fix and seen that error rates have returned to nominal levels. The message latency is starting to decline. We are continuing to monitor until latency fully resolves.
Posted May 29, 2025 - 11:14 CDT

Identified

We believe we have identified the root cause of the issue and are deploying a fix.
Posted May 29, 2025 - 10:37 CDT

Investigating

We are seeing a number of increased errors with our API and delayed messages of up to 30 minutes. We are currently investigating the issue.
Posted May 29, 2025 - 10:35 CDT
This incident affected: Engine Core.