LISTSERV 16.0 - OUTAGE Archives

A long time configuration issue on our Juniper routers seems to have
contributed to this routing issue.

The core of the problem deals with how the routing tables are
generated, and our local routes announced to the Internet.

The rest of this message is technical in nature.

To summarize - the configuration issue I mention above has been
alleviated and we should not see this issue again in the future.

If you are a dedicated connectivity or colocation customer, you should
have our oncall pager number.  If you do not, you should contact your
sales person to initiate a couple of things:

	Monitoring of your connection and multiple services (depending
	on type of connection)

	Ask us to send you off-hour contact information, including the
	oncall pager number.  We can send out business cards for sure,
	and we might also have some stickers you can place on routers
	and the like.

What follows is the technical info...

When we announce routes out to the Internet, we choose to announce our
aggregated routes.  That means we take the smaller announcements that
are internal to our network into our local routing table, but we only
announce 1 larger network that encompasses all of these networks
combined.

The misconfiguration that happened deals with the calculation of our
internal routing tables for routes connected to our customers (T1,
colocation, etc, but not dialup or DSL routes).  When a connectivity
or colocation customer is flapping (going up and down repeatedly), it
changes our internal routing tables to tell us that the route is going
up and down.  This is normal.  When a link is down, the network is
removed from the routing tables locally.  Our monitoring can pick up a
downed interface if the interface is down for more than 30-60 seconds.

The problem was that this calculation was also causing our aggregate
route to also flap on the Juniper routers we use in the core of our
network, including our announcement of the aggregate route(s) to our
upstream providers.

This flapping of the customer connectivity caused our aggregate
network 216.250.160.0/19 (32 class C networks) to flap, causing us to
insert and remove the aggregate network from our routing
announcements.  This in turn caused routers out on the global Internet
to 'dampen' our routes (discard the routing announcements) until
the route stabilizes.

Since a customer T1 started flapping yesterday, we had intermittent
routing outages of the network above because of this misconfiguration.
The intermittent outages did not last long enough for our monitors to
see the problem.

Late last night, we did receive a page, our customer's T1 went down
long enough for our monitoring to catch it, and we were paged.  At
that time we thought it was a localized issue and was not causing wide
scale problems.

After further review, this problem started yesterday late afternoon.

Support can be reached Monday thru Friday from 8:00am until 8:00pm,
Saturdays from 11:00am until 4:00pm via phone at 612-337-6340, or via
email at [log in to unmask]

-- 
Mike Horwath                                    [log in to unmask]
                         ipHouse - Welcome home!