The initial outage on 10/22 was due to a specific connection causing failures within our System State Manager as there was a missing “connection tag” in Redox’s VPN configuration. This was preventing deploys within the VPN gateway essentially exacerbating a known bug within the VPN code. This caused the software to enter a bad status and restart. Once the configurations were adjusted deploys successfully resumed. On 10/24 customers experience a brief interruption with the ability to send Redox data as a result of the failover performed earlier to resolve the initial outage. There was a saturation within the rekeying connections attempting to established. This caused interruption to traffic as the tunnels could not process as it was overwhelmed with the rekey attempts. The service was restarted which resolved the saturation.
Redox was receiving a TCP time out error while attempting to send transmission data to a small subset of customers that were specifically on the affected vhost. The total duration of the impact was for about two hours of time.
The Redox Infrastructure team is investigating advanced alerting within VPNs to notify teams of an issue before it becomes impactful.
Redox is scaling to include an additional host that will provide elevation for well-distributed loads.