On 4/12/2016 alerts starting coming in around 6:55PM CDT that
something is causing a large load on one of our Zebi storage SAN units
that some of our VM customers (SVC/VDC) customers are stored on. This
outage only affects VMware hosting customers on our VMForge platform.
A lot of VM's were waiting on storage, and stacking up on latency on
zebi4, but yet the load on zebi4 was showing lower than average
utilization, and no alerts, problems or outages shown. All stats
looked normal, just a little bit lower than expected load for this
time of night and system load.
On a hunch, we failed over the controller for Zebi4 to the standby
around 8:10PM CDT, and the system load on the storage node went up to
a more normal pattern. This cleared up the storage latency stacking up
issues across the board for those VMs on Zebi4. Our engineers are now
reviewing all VMs stored there to make sure they have recovered, or to
help them along if needed.
Unfortunately, there are no errors in the logs, nothing out of the
ordinary on the (now standby) controller, everything there is
operating at 100% by appearances, but not in practice. Since the only
way for the vendor to debug this state would be to put everybody back
in the error condition, we're not going to persue that at this time.
We are up-to-date on software revs, although there is one newer available
that the release notes says fixes something else, but we'll most likely
move to that one after reviewing it more.
If you have any further problems or questions please let us know at
[log in to unmask], or call us up at 612-337-6340.
Doug McIntyre <[log in to unmask]>
-- ipHouse/Goldengate/Bitstream/ProNS --
Network Engineer/Provisioning/Jack of all Trades