Network outages and consequences

Submitted by Piete Brooks on Thu, 17/03/2011 - 15:22

As of about the 11th of March, we have been having a number of problems with servers crashing, machines misbehaving, network dropouts, things going slowly, etc. We believe this to be due to some sort of local networking problem with the switches. Latest update: The supervisor in use on the central switch was was swapped Tue Mar 22 09:52 and the problems did not recur within four hours.

There have been reports of general slowness for a while, some of which has been put down to the filer being over worked (some tuning of clients and the server appear to have helped a bit. We are ordering some new hardware which we hope will fix this). This may have been exacerbated by early stages of the current problem - we're not sure.

A dire consequence is that some of the XenE servers running the 5.6 code loose all contact with the filer, which holds all the filing systems which they provide to their clients, such as the Lab Web Server. After a while, they give up, pass an error to the clients, which then cannot read or write any files. The only fix is to reboot the server, causing all (typically a dozen) clients to reboot.

Some non virtual servers were also deeply upset by the outages, and crashed several times.

It is also reported that it causes all the ACS machines in SW02 to freeze at the same time, only recovering about 5 minutes later.

Phones have been seen to be rebooting.

Most machines work most of the time with occasional outages, whereas a few fail most of the time and only occasionally work (one was fixed by power cycling the phone to which it was connected).

The most serious consequences were that a legacy Xen Fedora Virtual server (with 14 VMs) and an old Xen Enterprise pool (with 79 VMs) kept failing. The 14 VMs have been moved to a recent Enterprise pool, and the six servers constituting the old Enterprise pool have been upgraded (which appears to have 'lost' some of our Condor machines), and have not failed since.

OSFP anomalies have been found in the logs, and individual (redundant) links have been taken down, but it seems none was causing the problems.

This is a dynamic page, last updated Tue Mar 22 14:12:44 GMT 2011 (supervisor swap). It will be changed as things progress, so do keep coming back to read it.

Published by Piete Brooks on Thursday 17th March 2011

Latest Local IT systems news

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge