Quote Originally Posted by Ahumaya View Post
Everything looks like it's working now, which is good.

While we will likely never know what the true root cause is, as no one involved will ever have any reason to reveal the cause of a 3+ week issue, we can make a few inferences, and a reasonable idea as to what happened, based on the evidence provided in this thread.

It looks most likely that NTT had a dying node. They were resistant to investigate and resolve this issue, as the replacement of the node would be a cost measured both in a large amount of time and a large amount of money. During this slow death, the node would thrash the very moment it became saturated, causing excessive latency, and worse packet loss.

After about 3 weeks, it is very likely that the node simply died, and automatic network routing within the NTT infrastructure re-routed around the failed node (rather than through it), and thus the issue was resolved. And as the device has entirely failed, NTT will need to replace it, and in the meantime the automatic reroute around it has resolved the immediate issue.

Looking at logs, the specific node which was causing the vast majority of issues within the NTT infrastructure no longer appears in traceroutes, which adds credence to this theory. But alas, we will very likely never know.

Much love, glad we got through this.
You might be right, Ahumaya. I haven't noticed that node come up in my traceroutes since the start of the month.

Hopefully, the node kicking the bucket was the "fix" for the moment.