Submitted by Martyn Johnson on Mon, 28/11/2011 - 12:26
There was a major problem with the CL internal network during the afternoon of Friday 25th November, believed to have been caused by a faulty switch. There were a number of consequential side effects on other systems, which we now believe have been dealt with.
The problem manifested itself as a sudden and widespread loss of connectivity. Under such circumstances all normal diagnostic techniques fail as it is not possible to access equipment remotely and system logs become inaccessible. After establishing that it was not a straightforward failure of core equipment, it became apparent that the network was in a high traffic "meltdown" state. Suspicion fell on a stack of switches serving a cabinet in the main computer room which was actually in the process of being dismantled, though the switches themselves had not been touched. When these switches were disconnected from the network, the load immediately reduced.
We also observed an anomaly on another switch, which turned out to be two ports on different VLANs that had been incorrectly connected together. At the time it was thought that this could have been what triggered the meltdown, since something similar has happened in the past. However subsequent analysis suggests that this was a red herring; the switch had in fact detected the anomalous connection and disabled the ports before it could cause any trouble.
After the disconnection of the suspect switches, some connectivity problems remained. These turned out to be switches that had disabled their trunk ports in an attempt at self-protection, unfortunately cutting themselves off from remote management in the process. These required manual intervention; all recovered without further incident. Many other backup links also needed re-enabling manually.
After the dust had settled, the suspect switch stack was examined and one of its component switches was found to be non-functional. Subsequent examination of the logs (which were inaccessible at the time) indicates that the incident started with a series of anomalies with this switch. The network is designed to be fault tolerant and a clean failure of this switch would certainly not have caused the effects observed. We will probably never know what this switch did during its death throes, but the circumstantial evidence is that it forwarded traffic that it was supposed to block. Most of the fault tolerance mechanisms are designed to deal with failures that cause traffic not to flow - e.g. switches that stop or links that are severed. Byzantine fault tolerance - correct behaviour of the system as a whole in the face of arbitrarily misbehaving components - is hard.
A network problem like this tends to have side effects. One of the more tiresome is that when the Xen host machine temporarily loses contact with the filestore that contains virtual discs, it returns permanent disc errors to its guest virtual machines. The guest systems assume that their discs are broken and mark them read-only, but otherwise continue running. This means that a large number of virtual machines need to be rebooted even though they are superfically running normally.
We believe that everything has now been sorted out; any remaining issues should be reported to sys-admin as usual.