Services affected (warnings or errors): Customers with servers that boot off of and utilize our Compellent storage area network (SAN). Reason for service degradation: On Friday our Compellent SAN marked 1 disk of 16 with Bad Regions, the spare was utilized and a replacement ordered from Compellent. On Saturday the SAN marked 2 more disks with Bad Regions. Compellent was called in to get more spare drives on order. On Sunday the SAN marked 2 more disks with Bad Regions. At this stage, the SAN has 5 disks with Bad Regions. Compellent was called in again to get more spare drives on order. FWIW; we have seen staggered failures of multiple disks in the past, but nothing of this magnitude. At about ~1:30AM Monday morning, one customer started getting slower I/O and latency, enough for our monitoring to start barking about it. Early Monday morning 2 of the disks apparently went from Bad Regions to Failed. It is possible that the system suddenly found itself in a state that it couldn't safely handle and panicked. On Monday morning (around 8:30am) the SAN attempted to switch over to the other controller and failed. The system is dual controller, dual active. The cause for this failover is unknown and being investigated. The failed failover caused the initial controller to lock it's cache memory. Having the cache memory locked severely restricts the I/O performance of the SAN. The SAN was now also in a "failed failover" state that it couldn't clear itself. At this time frame, almost all systems connected to the SAN crashed, due to bad I/O or latency. The failed failover that started at 8:30AM also marked all the Bad Region disks as "Healthy" again, which forced the system to scan everything all over again, increasing the I/O demand on the system. By midday Monday, the disk ordered on Friday arrived and we waited for instructions from Compellent to insert it. Late Monday afternoon as the system was rebuilding, Compellent determined which volumes the locked cached contained. We were instructed by Compellent to do an OS copy of the data off of these volumes and then destroy the volumes to free up the cache. We tried for about an hour to get a copy of the data off the volumes that had locked the cache, but the data was coming back fragmented and full of holes (over 5%) that the process was aborted. We then couldn't destroy the volumes ourselves through the admin interface, so we asked Compellent to destroy them. Compellent had correctly identified the root cause of the locked cache, these two volumes were the root cause of the cache lockup, and we instantly regained the use of cache. Compellent support then went to work getting the other controller back up and running, clearing out the "failed fail" event, and getting all back to the way it was before the failed failover. This was completed by early Monday evening. In the end, the disks were all "healthy" even though we know there those with Bad Regions on them. Also, once the system got normalized, we finally were able to unmount a completely failed disk and replace it with the spare received earlier in the day. Unfortunately, this started the disk scan automated tasks over again. The spare disk space was consumed rapidly. Meanwhile, with the system ran almost normal (just slow from disk rebuilds), Mike and Ben went to work getting systems cleaned up, and able to boot and mount data to assess what data corruption may have happened. As of Tuesday morning, the number of rebuild processes went from ~90 Monday morning to less than 12 as of this email. Once the rebuilds are done, individual disks will be replaced. More than likely, this will start another round of automated scans, but it should be less, since another bad disk will be out of the system and data integrity will be improved. Those replacement disks that were ordered over the weekend arrived Tuesday midday and are standing by for installation. We also now have 6 additional spares sitting on the shelf in case another drive fails during this high I/O period. As we recover more I/O on the system, we will begin to migrate volumes off to another SAN. We expect the migration process to take the rest of this week to complete. This process is in parallel with repairing and rebuilding the servers that are still affected. Support can be reached Monday thru Friday from 8:00am until 6:00pm via phone at 612-337-6340, or via email at [log in to unmask] Bil -- Bil K. MacLeslie - TGWGTD - ipHouse http://www.ipHouse.com/