Submitted by Martyn Johnson on Thu, 28/07/2011 - 11:39
The Computing Service send their apologies for the telephone system outage this morning. A small and fairly routine reconfiguration was made to the West Cambridge routers during this morning's CUDN vulnerable period, ironically with the ultimate goal of improving network resilience. Unfortunately a maintenance script failed to run correctly and the voice network became disconnected. Once reported (by mobile phone), the communication channels worked well and the fault was attended to promptly.
It was slightly irritating that many of the phones took the opportunity of the outage to update their firmware when connectivity was restored, thus adding further delay. It appears that this is just the way they work; if there is a non-urgent firmware update pending they will do it at the next restart on the assumption that it is a convenient time.
An outstanding question is why our redundant connectivity via the New Museums Site did not heal the break. I can find nothing wrong with our configuration. In conversation with the CS it appears that there may be a mismatch in how we and they expect the redundancy to work. The main expert on the CUDN architecture was not available at the time; we have agreed to revisit the question when he is.