MLLP Outbound Routing Impacted Affecting Transmission Routing
Incident Report for Redox Engine
Postmortem

Root cause

The initial outage on 10/22 was due to a specific connection causing failures within our System State Manager as there was a missing “connection tag” in Redox’s VPN configuration. This was preventing deploys within the VPN gateway essentially exacerbating a known bug within the VPN code. This caused the software to enter a bad status and restart. Once the configurations were adjusted deploys successfully resumed. On 10/24 customers experience a brief interruption with the ability to send Redox data as a result of the failover performed earlier to resolve the initial outage. There was a saturation within the rekeying connections attempting to established. This caused interruption to traffic as the tunnels could not process as it was overwhelmed with the rekey attempts. The service was restarted which resolved the saturation.

Impact on customers

Redox was receiving a TCP time out error while attempting to send transmission data to a small subset of customers that were specifically on the affected vhost. The total duration of the impact was for about two hours of time.

Learnings / Follow-ups

The Redox Infrastructure team is investigating advanced alerting within VPNs to notify teams of an issue before it becomes impactful.

Redox is scaling to include an additional host that will provide elevation for well-distributed loads.

Posted Nov 04, 2020 - 15:21 CST

Resolved
This incident has been resolved.
Posted Oct 23, 2020 - 06:48 CDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 22, 2020 - 13:39 CDT
Investigating
We are currently investigating this issue.
Posted Oct 22, 2020 - 12:54 CDT
This incident affected: VPN.