On 4/7/2018 starting after midnight CDT, two of our four main web
clusters failed to restart Apache properly when the maintenance job
rolled the log files for the day. Both of these clusters have been
restored by 1:40AM CDT. The main web pool and the web-adv pool were
the two affected. The old blog and new blog clusters (or other
customer clusters) were not affected. This caused an outage for some
of our web hosting platforms for our web customers.
The root cause was that internally in production we run two different
OS releases, FreeBSD 11.1-RELEASE and FreeBSD 10.3-RELEASE. Both of
these clusters haven't been upgraded yet to 11.1 and are on the older
10.3-RELEASE which will be shortly end-of-lifed.
The recent Apache security update was tested and rolled out successfully
on 11.1-RELEASE, but testing was overlooked on 10.3-RELEASE,
and the 10.3 build of the recent Apache security update failed to load
and deploy a critical module.
We have hand deployed the critical module across the board, and
systems are now working at 100% as expected.
In the upcoming week, we will be doing more rolling upgrades of the
systems to get onto 11.1-RELEASE across the board which have already
been field tested on other clusters.
More testing will be done to ensure that all required modules and
code are built and deployed correctly.
Doug McIntyre <[log in to unmask]>
-- ipHouse/Green Cloud Technologies --
Network Engineer/Provisioning/Jack of all Trades