LISTSERV mailing list manager LISTSERV 16.0

Help for OUTAGE Archives


OUTAGE Archives

OUTAGE Archives


OUTAGE@LISTS.IPHOUSE.NET


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

OUTAGE Home

OUTAGE Home

OUTAGE  October 2012

OUTAGE October 2012

Subject:

Issues with our SAN (specific customers only) (update 2)

From:

Bil MacLeslie <[log in to unmask]>

Reply-To:

[log in to unmask]

Date:

Tue, 23 Oct 2012 15:42:18 -0500

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (99 lines)

Services affected (warnings or errors):

     Customers with servers that boot off of and utilize our Compellent
     storage area network (SAN).

Reason for service degradation:

On Friday our Compellent SAN marked 1 disk of 16 with Bad Regions, the
spare was utilized and a replacement ordered from Compellent.

On Saturday the SAN marked 2 more disks with Bad Regions.  Compellent
was called in to get more spare drives on order.

On Sunday the SAN marked 2 more disks with Bad Regions.  At this
stage, the SAN has 5 disks with Bad Regions.  Compellent was called in
again to get more spare drives on order.

FWIW; we have seen staggered failures of multiple disks in the past, but 
nothing of this magnitude.

At about ~1:30AM Monday morning, one customer started getting slower
I/O and latency, enough for our monitoring to start barking about it.

Early Monday morning 2 of the disks apparently went from Bad Regions
to Failed. It is possible that the system suddenly found itself in a
state that it couldn't safely handle and panicked.

On Monday morning (around 8:30am) the SAN attempted to switch over to
the other controller and failed. The system is dual controller, dual
active. The cause for this failover is unknown and being investigated.
The failed failover caused the initial controller to lock it's cache
memory. Having the cache memory locked severely restricts the I/O
performance of the SAN. The SAN was now also in a "failed failover"
state that it couldn't clear itself.

At this time frame, almost all systems connected to the SAN crashed,
due to bad I/O or latency.

The failed failover that started at 8:30AM also marked all the Bad
Region disks as "Healthy" again, which forced the system to scan
everything all over again, increasing the I/O demand on the system.

By midday Monday, the disk ordered on Friday arrived and we waited for
instructions from Compellent to insert it.

Late Monday afternoon as the system was rebuilding, Compellent
determined which volumes the locked cached contained. We were
instructed by Compellent to do an OS copy of the data off of these
volumes and then destroy the volumes to free up the cache.

We tried for about an hour to get a copy of the data off the volumes
that had locked the cache, but the data was coming back fragmented and
full of holes (over 5%) that the process was aborted. We then couldn't
destroy the volumes ourselves through the admin interface, so we asked
Compellent to destroy them.  Compellent had correctly identified the
root cause of the locked cache, these two volumes were the root cause
of the cache lockup, and we instantly regained the use of cache.

Compellent support then went to work getting the other controller back
up and running, clearing out the "failed fail" event, and getting all
back to the way it was before the failed failover. This was completed
by early Monday evening.

In the end, the disks were all "healthy" even though we know there
those with Bad Regions on them. Also, once the system got normalized,
we finally were able to unmount a completely failed disk and replace
it with the spare received earlier in the day. Unfortunately, this
started the disk scan automated tasks over again. The spare disk space
was consumed rapidly.

Meanwhile, with the system ran almost normal (just slow from disk
rebuilds), Mike and Ben went to work getting systems cleaned up, and
able to boot and mount data to assess what data corruption may have
happened.

As of Tuesday morning, the number of rebuild processes went from ~90
Monday morning to less than 12 as of this email. Once the rebuilds are
done, individual disks will be replaced. More than likely, this will
start another round of automated scans, but it should be less, since
another bad disk will be out of the system and data integrity will be
improved.

Those replacement disks that were ordered over the weekend arrived 
Tuesday midday and are standing by for installation. We also now have 6 
additional spares sitting on the shelf in case another drive fails 
during this high I/O period.

As we recover more I/O on the system, we will begin to migrate volumes
off to another SAN.  We expect the migration process to take the rest of 
this week to complete. This process is in parallel with repairing and 
rebuilding the servers that are still affected.

Support can be reached Monday thru Friday from 8:00am until 6:00pm via
phone at 612-337-6340, or via email at [log in to unmask]

Bil

-- 
Bil K. MacLeslie - TGWGTD - ipHouse http://www.ipHouse.com/

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

August 2020
May 2020
April 2020
January 2020
November 2019
October 2019
October 2018
August 2018
July 2018
May 2018
April 2018
February 2018
January 2018
November 2017
October 2017
August 2017
July 2017
May 2017
April 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
April 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
May 2014
April 2014
March 2014
January 2014
December 2013
November 2013
July 2013
June 2013
May 2013
April 2013
February 2013
January 2013
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
October 2011
September 2011
July 2011
June 2011
March 2011
February 2011
January 2011
November 2010
October 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
2008
2007
2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004

ATOM RSS1 RSS2



LISTS.IPHOUSE.NET

CataList Email List Search Powered by the LISTSERV Email List Manager