The online systems of the Ministry of Economic Development suffered a major system outage (3.5 days) in September 2011. The outage related to a storage area network (SAN) supplied by HP. The outage prevented MED staff and customers from searching, filing and processing documents using the Companies Office, PPSR and the Intellectual Property Office of NZ online systems. MED processes some 20,000 transactions a day.
MED has published an independent report by Deloitte into the service failure, see here. The report explains that direct responsibility for this and two earlier outages could not be assigned to any one party or event. Deloitte does determine that the underlying IT architecture, including the SAN, had weaknesses that increased the risk of serious business disruption if there were to be a major incident. The weaknesses were identified as the lack of a standby disaster recovery environment, reliance on tape back-ups and a "tightly integrated application architecture" (important services could not be restored independently of others). Deloitte concluded that these weaknesses delayed the restoration of MED's online services, although Deloitte recognised that having a tightly integrated application architecture reflected good practice in custom development at the time the systems were built and that the Ministry had already started to implement improvements (such as moving to the all of Government IaaS service, which would provide greater DR resilience) at the time the outages happened.
While Deloitte concluded the MED had not appreciated the full impact on itself, customers and NZ Inc. of a serious outage of these services, it also concluded that the reaction to the incidents by all parties was professional and committed, MED communications and governance were sound and appropriate, and decisions made by MED following the incidents were reasonable and planned actions are sound.
The biggest risks posed by MED's affected systems were that MED's DR arrangements were limited to a tape data back-up only (there were no operable servers at the DR site) and the parties did not fully understand the boundaries of HP's responsibility to monitor the SAN. This resulted in gaps in monitoring and "confusion and tension". Deloitte indicates that, in light of these and the other weaknesses, a complete disaster at the MED's principal processing site could have left the services unavailable for up to two weeks.
The risk and reality of severe disasters affecting production sites was brought home to all of us with the Christchurch earthquakes. You need to ensure that your DR arrangements are appropriate for your system and the likely impact on you and your customers should there be a major incident affecting it.