LISTSERV 16.0 - OUTAGE Archives

Services affected (warnings or errors):

     Customers with servers that boot off of and utilize our Compellent
     storage area network (SAN).

Reason for service degradation:

On Friday our Compellent SAN marked 1 disk of 16 with Bad Regions, the
spare was utilized and a replacement ordered from Compellent.

On Saturday the SAN marked 2 more disks with Bad Regions.  Compellent
was called in to get more spare drives on order.

On Sunday the SAN marked 2 more disks with Bad Regions.  At this
stage, the SAN has 5 disks with Bad Regions.  Compellent was called in
again to get more spare drives on order.

FWIW; we have seen staggered failures of multiple disks in the past, but 
nothing of this magnitude.

At about ~1:30AM Monday morning, one customer started getting slower
I/O and latency, enough for our monitoring to start barking about it.

Early Monday morning 2 of the disks apparently went from Bad Regions
to Failed. It is possible that the system suddenly found itself in a
state that it couldn't safely handle and panicked.

On Monday morning (around 8:30am) the SAN attempted to switch over to
the other controller and failed. The system is dual controller, dual
active. The cause for this failover is unknown and being investigated.
The failed failover caused the initial controller to lock it's cache
memory. Having the cache memory locked severely restricts the I/O
performance of the SAN. The SAN was now also in a "failed failover"
state that it couldn't clear itself.

At this time frame, almost all systems connected to the SAN crashed,
due to bad I/O or latency.

The failed failover that started at 8:30AM also marked all the Bad
Region disks as "Healthy" again, which forced the system to scan
everything all over again, increasing the I/O demand on the system.

By midday Monday, the disk ordered on Friday arrived and we waited for
instructions from Compellent to insert it.

Late Monday afternoon as the system was rebuilding, Compellent
determined which volumes the locked cached contained. We were
instructed by Compellent to do an OS copy of the data off of these
volumes and then destroy the volumes to free up the cache.

We tried for about an hour to get a copy of the data off the volumes
that had locked the cache, but the data was coming back fragmented and
full of holes (over 5%) that the process was aborted. We then couldn't
destroy the volumes ourselves through the admin interface, so we asked
Compellent to destroy them.  Compellent had correctly identified the
root cause of the locked cache, these two volumes were the root cause
of the cache lockup, and we instantly regained the use of cache.

Compellent support then went to work getting the other controller back
up and running, clearing out the "failed fail" event, and getting all
back to the way it was before the failed failover. This was completed
by early Monday evening.

In the end, the disks were all "healthy" even though we know there
those with Bad Regions on them. Also, once the system got normalized,
we finally were able to unmount a completely failed disk and replace
it with the spare received earlier in the day. Unfortunately, this
started the disk scan automated tasks over again. The spare disk space
was consumed rapidly.

Meanwhile, with the system ran almost normal (just slow from disk
rebuilds), Mike and Ben went to work getting systems cleaned up, and
able to boot and mount data to assess what data corruption may have
happened.

As of Tuesday morning, the number of rebuild processes went from ~90
Monday morning to less than 12 as of this email. Once the rebuilds are
done, individual disks will be replaced. More than likely, this will
start another round of automated scans, but it should be less, since
another bad disk will be out of the system and data integrity will be
improved.

Those replacement disks that were ordered over the weekend arrived 
Tuesday midday and are standing by for installation. We also now have 6 
additional spares sitting on the shelf in case another drive fails 
during this high I/O period.

As we recover more I/O on the system, we will begin to migrate volumes
off to another SAN.  We expect the migration process to take the rest of 
this week to complete. This process is in parallel with repairing and 
rebuilding the servers that are still affected.

Support can be reached Monday thru Friday from 8:00am until 6:00pm via
phone at 612-337-6340, or via email at [log in to unmask]

Bil

-- 
Bil K. MacLeslie - TGWGTD - ipHouse http://www.ipHouse.com/