Print

Print


Ethernet links went down:	Sat Oct 30 12:33:22 CDT 2010
Ethernet links came up:		Sat Oct 30 12:33:58 CDT 2010

This caused the active load balancer to drop and the standby unit to
take over, then 36 seconds later, it cut back.

The normally standby servers' ethernet links were unaffected during
this though both load balancers are plugged into the same physical
switches on adjacent ports.  This is pointing to a problem on the
normally active load balancer and I'll need to do more log searching
to check on potential permanent failure.

Summary: All services handled on our cluster are back online

Techno-babble starts here...

The cutover from active to standby to active caused the servers behind
the load balancers to cache the ARP (address resolution protocol)
address for the gateway for the servers for the now-standby unit.

I know that sounds confusing, sorry!

To help explain:

   gateway is on 192.168.x.254 xx:xx:xx:xx:xx:17 (normal active unit)
   failure
   gateway is on 192.168.x.254 xy:xy:xy:xy:xy:4a (normal standby unit)
   cutback
   gateway is on 192.168.x.254 xx:xx:xx:xx:xx:17 (normal active unit)

but the systems cached the ..:4a address as it had changed and would
not recheck until after expiration.  This is a completely normal and
desired behaviour.

Oct 30 12:33:31 web-10 kernel: arp: 192.168.x.254 moved from xx:xx:xx:xx:xx:17 to xy:xy:xy:xy:xy:4a on em0

I had to log into each of the servers and clear the ARP cache for the
gateway address and everything came back online.

Web services and mail services handled on our cluster were affected
during this event.

All services are back online, the outage lasted approximately 15
minutes while I logged into each and every server.  During this time
some of the servers had already updated their ARP cache, so the
different services were back online faster than the time it took me to
log into all of the servers.

I find no errors on services at this time and everything looks good.

Now starts the investigation into why the normally active load
balancer thought that *all* of its ethernet ports went down at
precisely the same moment.

Support can be reached Monday thru Friday from 8:00am until 8:00pm via
phone at 612-337-6340, or via email at [log in to unmask]

-- 
Mike Horwath      ipHouse - Welcome home!       [log in to unmask]
        The universe is an island, surrounded by whatever it is
        that surrounds universes. - Berkeley Fortune