LISTSERV 16.0 - OUTAGE Archives

On Saturday, March 1st, 2014 starting at 16:44 CST, something happened
on our storage backend network between Tegile storage system #1
and the backend storage network. 

This affected only specific types of customers, and only if they
happened to be located on Tegile storage system #1. Potentially
affected customers are all virtual machine type clients, on our VPS,
VMForge, Enterprise Managed Hosting & Enterprise Managed Cloud platforms.

Nothing else would have been affected in our network (ie. no DSL,
email, web hosting, blog hosting, T1, etc). 


It appears after initial analysis that controller one, network LAG
group had both ports up & up on one side, and up and down on the other side
at the 10G switch stack on the backend. 

We attempted to physically transfer all control from the first
controller to second controller, but things were not responding there at all.

Finally, we admin downed the backend network ports for controller 1 on
the backend 10G switch, and that finally triggered a controller
failover event.

This cleared at 18:03 CST, when this fix was put into place. 


We'll be reaching out to vendor support for both halves of it to
determine what happened, and what could have gone wrong. The network
is built redundantly, dual switches, dual LAGs from dual controllers,
etc. But this up on one side, down on the other isn't in the normal
realm of outage monitoring. 

We'll be monitoring all systems closely until we know what the root
cause that caused this was.

If you have any problems or questions please let us know at
[log in to unmask], or call us up at 612-337-6340.


-- 
Doug McIntyre                            <[log in to unmask]>
                    ~.~ ipHouse ~.~
       Network Engineer/Provisioning/Jack of all Trades