Root cause
On Monday 10/4, ongoing infrastructure scalability tasks inadvertently caused increased DNS lookups by services between 2:41PM CT and 7:00PM CT.
What Happened?
DNS performance degradation led to application services degrading as a result. We reverted partial changes on Monday (10/4) evening, which immediately improved performance. As volume increased we experienced a more severe degradation on Tuesday 10/5. Reverting all changes deployed between 2:41PM CT and 7:00PM CT on Monday 10/4 resolved DNS lookup performance, and returned message processing to normal speed.
Impact on customers
Customers with high-volume MLLP feeds experienced messages queueing up on the HCO side due to a sporadic increased latency for all HTTP requests requiring DNS resolution on the order of seconds (5-10).
Learnings / Follow-ups
Internal teams are evaluating technology which allows us to improve our testing and alerting on network degradation.