On 6 April, 2017, starting around 5:50PM CDT, and ending around 6:20PM
CDT we started having failures on *all* our resolving DNS servers all
at once. These are spread out with different networking and different
server versions throughout our network.
The main problems seen were
*) access clients would not be able to lookup any new DNS names while
connecting to various sites.
*) Our main internal systems couldn't find resources required for
normal operation. eg. you might have seen a can't find database
server message while going to webmail.
*) Some management panels would not be able to contact the
proper hosts they manage in order to operate them.
*) Some managed machines may have not been able to contact the
resources they need to for normal operation.
While the set of DNS resolving servers are diverse (ie. different
software versions, OS versions, networking, etc), they do all share a
common config base that stretches very far back, handling different
things that have cropped up over the years.
My initial guess about problems is that some of those workarounds
suddenly found something external to us that conflicted. Or there is a
new externally triggered bug pushed out and being exploited that
affects the software in a new unknown way.
I'm going to be reviewing the base config (knowing many of those
workarounds have now been addressed by the software vendors) and I
will be reworking the config. These config tweaks will be invisible to
any clients, and will be well tested before fully rolled out to all
the DNS resolving servers.
We'll be monitoring everything as we normally do. If you have any
questions please let us know at [log in to unmask]
Doug McIntyre <[log in to unmask]>
-- ipHouse/Goldengate/Bitstream/ProNS --
Network Engineer/Provisioning/Jack of all Trades