Author Topic: Server disappearing off the net  (Read 3876 times)

Server disappearing off the net
« on: 22 May, 2008, 04:43:19 pm »
We've got a server that keeps disappearing off of the network.  Luckily I didn't set it up, so It's Not My Fault [tm]. ;D

It's a Debian box, and it looks like something fairly fundamental is happening, since last time it happened, I pointed nmap on another machine at it, and as far as nmap was concerned, the machine wasn't there.

There are a few entries in the logs which to me just suggest that NFS and Exim had issues seeing the outside world, and these entries are other things being upset by the net going all squirrelly and not the cause of it.

Can anyone make any suggestions of likely things that I can look at?  I'm sitting here "tail -f"-ing the output from syslog and hoping that things die, so I can watch what happens, but so far it's working fine!
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #1 on: 22 May, 2008, 04:47:59 pm »
I take it you've put "*.debug" in syslogd.conf too...

It may also be outputting panic info to the console. So if it's headless then you might want to whack a monitor on it (or set it up to have console over serial). And if you're on the console note that the panic info may be being pushed off the screen by your tailing...
"Yes please" said Squirrel "biscuits are our favourite things."

rae

Re: Server disappearing off the net
« Reply #2 on: 22 May, 2008, 04:48:12 pm »
Check physical first - cables, cards and duplex settings.  What does the connected switch have to say about it?

Re: Server disappearing off the net
« Reply #3 on: 22 May, 2008, 05:07:32 pm »
I'm with Rae. Look at the switch stats.
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #4 on: 22 May, 2008, 05:08:28 pm »
It's connected to a dumb switch, but other machines connected to that switch do not have any problems with their network access, but equally cannot see this machine when it goes a bit nuts any more than any remote machines.

It doesn't permanently die, since it seems to come back eventually, and there seems to be nothing obvious in the logs.  There are entries for any debug events to be logged in syslogd's conf file.

I've checked the cabling and switch as best I can, and they all seem fine.

As usual in these situations it's doing it's best impersonation of a watched kettle at the moment...
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #5 on: 22 May, 2008, 05:10:41 pm »
Try moving it to a different port on the switch. Try a new patch lead. Check the speed and duplex of the Ethernet port and see if its the same as that on a box that isn't having any issues. Buy a managed switch. Unmanaged switches are for home use only.
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #6 on: 22 May, 2008, 05:21:40 pm »
It's not a particularly heavily used machine, and I can't easily justify buying a managed switch to hang it off.

It's not my machine, but the guy who does run it doesn't really know what's happening, but is happy to give me root access so I can poke around and see if I can see if there is anything obviously wrong.

No one other than him or me has root access (as far as I can see...).

When this happens it's more of an annoyance, since it comes back under its own steam.  I'm just not sure why it's going a bit doolally.
Actually, it is rocket science.
 

rae

Re: Server disappearing off the net
« Reply #7 on: 22 May, 2008, 05:32:31 pm »
Assuming it is lightly used, fire up Wireshark and leave it running on the interface.   Next time it barfs, look at the results.    Hardware problems are generally pretty obvious if you have access to the logs, and are the first place to check - duplex settings are crackers.   

By "not seeing it" are you sure that the server is unhappy - i.e. that it not an external problem.   DNS perhaps? 

Re: Server disappearing off the net
« Reply #8 on: 22 May, 2008, 05:36:32 pm »
Duplicate IP addresses?

Before we had a DHCP range the usual routine was to ping an IP on that subnet and if there was no response they'd use that IP address.

Of course that relied on people not typoing either address and interpreting the ping results properly.

Snooping the network traffic for arp requests is interesting. Make a note of the machine's MAC to compare when it next goes screwy.
"Yes please" said Squirrel "biscuits are our favourite things."

Re: Server disappearing off the net
« Reply #9 on: 22 May, 2008, 05:52:15 pm »
Actually, given that this is in academia somebody nicking the IP address isn't entirely implausible, so I will make sure that I've got a record of the MAC address.

I've also got a screen, keyboard and mouse on the machine now, so I can fiddle with it directly next time things go castors up, which the seem determined not to do at the moment.  Of course it's possible that keep a TCP connection open to the machine is stopping things from failing, if it is something weird with the network setup.

Thanks for suggestions, I've got Wireshark on this machine, so it may well be useful to have it sat watching that machine if necessary.
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #10 on: 22 May, 2008, 11:00:40 pm »
It looks like it's something to do with the networking is failing on the server.  When it happens, if I go to the machine and (i) ping a machine elsewhere or (ii) restart the networking or (iii) type in "route" to show the routing table, then after a short pause, all three of these cause the server to start responding again.

I still have no idea why the network is dying like this though...
Actually, it is rocket science.
 

aglet

Re: Server disappearing off the net
« Reply #11 on: 23 May, 2008, 09:05:42 am »
It looks like it's something to do with the networking is failing on the server.  When it happens, if I go to the machine and (i) ping a machine elsewhere or (ii) restart the networking or (iii) type in "route" to show the routing table, then after a short pause, all three of these cause the server to start responding again.

I still have no idea why the network is dying like this though...

Might it be a driver bug?  See if there's a more recent version of whatever driver is required for the NIC (or bung another one in from a better different manufacturer and try that instead).  Does the driver have a debug parameter you can pass to it?  See Documentation/networking/<driver>.txt

Re: Server disappearing off the net
« Reply #12 on: 23 May, 2008, 09:48:25 am »
Could also be a duplicate IP address. If the server has been idle and the other machine with the same IP address has been active on the LAN then all the other machines will have the rogue machines MAC address in their arp caches (IP address to MAC address tables). If you do a ping from the server or restart one of the network services it's possible (certain with ping) that you are putting a packet onto the LAN; this will flush the rogue MAC address from the other machines arp caches and replace it with the correct one and off you go again.
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #13 on: 23 May, 2008, 09:57:02 am »
From another machine, when the server is "dead", what does "nbtstat -A {ip.address}" respond with?   (might not work, as I'm windows based here, but we'll often see a different computer name returned from the netBIOS remote machine table thanks to a screwey network.

Re: Server disappearing off the net
« Reply #14 on: 23 May, 2008, 10:03:06 am »
Another thing to try when the machine has fallen off the net is:

On a windows box try to ping the server that has a problem then do:

arp -a

Check to make sure that the MAC address against the server IP address is actually the MAC address of the NIC in question on the server.

This of course assumes that server and the windows PC are on the same subnet as if not you would just get the MAC address of the routers Ethernet NIC.
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #15 on: 23 May, 2008, 10:53:20 am »
We've solved it at the moment by the somewhat bodged approach of a cron job that causes ping to fire a packet off every ten seconds to another server.  This appears to keep things up.

Apparently the guy who runs it is planning to re-install the server in the next few weeks, so I think I've spent enough time on it, and hopefully after the re-install things will all work perfectly (as if...?!)
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #16 on: 18 June, 2008, 02:35:01 pm »
After a lot of faffing about, experimentation etc, the official reply from our networking people is that "It appears to be a problem due to the normal operation of the new switches."

Which is impressive.  If these switches don't see an outgoing packet from the server every 5 minutes or so, they just seem to give up on it.  The line is still there, it appears to be up and operating, but has no activity.  Once the machine sends a packet out, everything works again, but no incoming packets ever get to it when it's in this state.

I'm very impressed that we have networking support who can't actually come up with a better solution than we did ourselves, ie a cron job pinging every five minutes.  Even my £50 router at home doesn't loose machines when they don't do much, but apparently we buy and install expensive managed switches that do. >:(
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #17 on: 18 June, 2008, 02:40:38 pm »
It's "features" like that which really piss me off.
"Yes please" said Squirrel "biscuits are our favourite things."

tiermat

  • According to Jane, I'm a Unisex SpaceAdmin
Re: Server disappearing off the net
« Reply #18 on: 18 June, 2008, 02:51:45 pm »
After a lot of faffing about, experimentation etc, the official reply from our networking people is that "It appears to be a problem due to the normal operation of the new switches."

Which is impressive.  If these switches don't see an outgoing packet from the server every 5 minutes or so, they just seem to give up on it.  The line is still there, it appears to be up and operating, but has no activity.  Once the machine sends a packet out, everything works again, but no incoming packets ever get to it when it's in this state.

I'm very impressed that we have networking support who can't actually come up with a better solution than we did ourselves, ie a cron job pinging every five minutes.  Even my £50 router at home doesn't loose machines when they don't do much, but apparently we buy and install expensive managed switches that do. >:(

Hmmm, now let me guess, Cisco?  they have really really brilliant features such as this....

They call them features, I call them a PITA...
I feel like Captain Kirk, on a brand new planet every day, a little like King Kong on top of the Empire State

rae

Re: Server disappearing off the net
« Reply #19 on: 18 June, 2008, 02:58:04 pm »
Quote
Which is impressive.  If these switches don't see an outgoing packet from the server every 5 minutes or so, they just seem to give up on it.  The line is still there, it appears to be up and operating, but has no activity.  Once the machine sends a packet out, everything works again, but no incoming packets ever get to it when it's in this state.

I'm very impressed that we have networking support who can't actually come up with a better solution than we did ourselves, ie a cron job pinging every five minutes.  Even my £50 router at home doesn't loose machines when they don't do much, but apparently we buy and install expensive managed switches that do. 

Hmmm.  If they are Cisco, I can confirm that my managed switches (and I have a lot of 6509s....) do not exhibit this behaviour.

Re: Server disappearing off the net
« Reply #20 on: 18 June, 2008, 03:09:47 pm »
Which is impressive.  If these switches don't see an outgoing packet from the server every 5 minutes or so, they just seem to give up on it.  The line is still there, it appears to be up and operating, but has no activity.  Once the machine sends a packet out, everything works again, but no incoming packets ever get to it when it's in this state.

Eh? I don't get this one.
I can understand a MAC address table in the switch being flushed every five minutes - that would effectively "forget" the port to which the server is connected.

But why would that stop incoming packets for that server getting to it?
A switch is a multi-port learning bridge. So packets to a MAC address which has no known port should surely go out on them all till it is recognised? How else would a new device ever be recognised on the LAN, or how else could you plug a server into a new port?





Re: Server disappearing off the net
« Reply #21 on: 18 June, 2008, 03:20:59 pm »
I'm sorry but that sounds like bo***cks to me. My job is configuring switches and routers and has been for the last 15 years and I have never heard of anything like this.

Pat

Cisco CCIE #2305
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #22 on: 18 June, 2008, 03:35:25 pm »
Well, I don't know the exact cause, we don't have any access to the switch, but if we don't touch the server, for just over six minutes, and then ping it, nothing happens.  Watching the activity lights on the back of the server they flicker away like made for this six minutes, and then just stop flickering, the switch seems to stop sending anything (including broadcast packets) to the machine.

Some sort of packet leaving the machine every five minutes cures it.  Moving the machine onto one of the older switches cures it.

It's a bit of a pain.
Actually, it is rocket science.
 

Re: Server disappearing off the net
« Reply #23 on: 19 June, 2008, 09:15:08 am »
Your ping is refreshing a cache somewhere probably. Either an arp cache or a mac-address table. The switch guys are fobbing you off.
I think you'll find it's a bit more complicated than that.

Re: Server disappearing off the net
« Reply #24 on: 19 June, 2008, 09:27:25 am »
Unfortunately, we talk to the IT support bods, eventually they agree it's a networking issue, and go and talk to the networking bods, and they tell us it isn't there fault, it's the new switches...

It wouldn't surprise me if they haven't got a clue what they're doing, I remember taking about ten minutes explaining to one of them once that it was possible for one machine to actually have more than one IP address associated with it... (ignoring the fact that until he started buggering about with the DNS it had been working like that for several years).
Actually, it is rocket science.