Friday 9 September 2011

ISP router ARP cache problems when replacing servers

I experienced a problem today that took a while to understand so figured it was worth sharing...

Our external mail gateway was due for replacement and a new virtual machine was built, configured and tested alongside the old production server. Happy that everything was functioning as expected, the only remaining task was to disconnect the old server from the network and rename the IP address of the new server from its test IP to that of the old server. This would require no changes to DNS and total downtime would be about a minute.

The change was made and... nothing. No traffic to the new server.

Huh? I tested it from another IP on the public network and it was fine. We tried from another network and... nothing.

I changed the IP back to the test address and the server sprang into life.

After a significant amount of time brainstorming with colleagues as to what was happening, we hit upon the possible problem being an ARP cache issue on the ISP provided router. Unfortunately, we don't have administrative access to this router.

Fortunately, the ISP hadn't locked down the console port of the Cisco router and I was able to connect in and run a "show ip arp" command. Sure enough, it showed the MAC address of the old server. This meant that when packets arrived from the Internet the router was trying to forward them to the old server that was no longer on the network. If I had administrative access to the router, I would have been able to flush the ARP cache and all would have been good. But because this was a "managed" router, I wasn't able to do this. I could see the problem, I knew the solution, but couldn't fix it.

I did some research online to see what the default ARP cache timeout was: typically 4 hours.

I logged a call with the ISP which was not a particularly useful experience. The ISP is a subsidiary of Cable & Wireless, and if you've ever had the misfortune of working with that company you'll understand what I'm talking about! I was told I'd get a call back in 8 hours. Brilliant! Not.

There were a couple of other options: Pulling the Ethernet cable from the router would down the interface which I *think* will cause the ARP cache to flush. I didn't have the luxury of doing this in hours.

The final option was to try and get the new server to send a gratuitous ARP request. This is an ARP request that a server broadcasts about itself. The idea is that other devices on the network will update their ARP caches with the information.

My server however was hidden behind a Cisco ASA firewall.

As I was searching for ways to get this working, the ARP cache timed out (possibly due to the router configuration being lower than the default, although I can't see the config to confirm this) and the new server sprang into life.

At first I wasn't sure whether it was the gratuitous ARP that fixed it, but within the next hour, the ISP called and confirmed they cleared the cache. So fair play to them for getting on with it and sorting the problem.

It's been a learning experience in that even the simplest and quickest network change can have unforeseen side effects!

1 comment:

Poiter said...

Where was the cisco router located? I have had instance with a ISP router onsite having issues, and i just cut the power to it to power cycle and that has resolved my issues many times. Was this possible?