LISTSERV 16.0 - OUTAGE Archives

Services affected (warnings or errors):

    Virtualization services hosted on any of our platforms saw an
    interruption of disk input/output at approximately 9:36pm Monday
    night, May 28th, 2012.  This brief interruption lasted > 30
    seconds while fail-over occurred between the two storage
    controllers.

Reason for service degradation:

    bug #1: mpt_sas driver failure caused kernel panic on controller
    #1 which resulted in a kernel core dump

    bug #2: HA failure resources (storage, shared IP address) were not
    released from controller #1 to controller #2 until *after* the
    system dumped core (it takes a bit to write out a large system
    core dump)

    Tegile has addressed both of these bugs with a rapid release of
    new patches to the controllers.  Controller #1 was patched last
    night.  Controller #2 will be patched tomorrow night (May 30th,
    2012, after 11:15pm).

    step 1: graceful fail-over will be done from controller #2 to
    controller #1 which will interrupt disk I/O for ~3-6 second
    (VMware will take care of disk I/O queue during this fail-over)

    step 2: controller #2 will be patched and rebooted

    Please note: bug #2 caused the underlying storage (and networking
    for said storage) to be offline for > 30 seconds which can cause
    disk I/O timeouts that may require a reboot.  Most server
    operating systems were unaffected. RHEL 5/6, Ubuntu 10.04/12.04,
    and Windows Server 2008 were all fine.

    Normal fail-over (tested repeatedly earlier this year) is between 3
    and 6 seconds in length and should not adversely affect the
    availability of any system connected to this storage.

Support can be reached Monday thru Friday from 8:00am until 6:00pm via
phone at 612-337-6340, or via email at [log in to unmask]

-- 
Mike Horwath      ipHouse - Welcome home!       [log in to unmask]
        The universe is an island, surrounded by whatever it is
        that surrounds universes. - Berkeley Fortune