Disaster Management – “Don’t think it can’t happen to you?”

Most of us think of a disaster as being a data centre that has been gutted by fire or hit by an aircraft, in other words, it’s off the air. While these types of disasters rarely happen, others like flooding and power outages (that's when the generators don’t kick in -murphy’s law) are more common. Sabotage is another, then there is ransomware attacks and high-profile data breaches, and finally, data corruption is also a big contender. Each of these has catastrophic consequences for any business. Back-up data centres are the best protection you can get, and then they are only as good as the WAN links that connect them, and they can fail when needed (murphy’s law again). It’s all an insurance game where by and large, you get what you pay for.

However, when the crunch comes, only people can recover systems and restore businesses. The trouble is there is, very little sensible information is prepared in advance of a disaster, and that includes the Disaster Management Plan, which itself needs to be kept current. Being prepared (defeating murphy’s law) usually means that you won’t suffer a disaster, but you must have a plan. If you are a CIO or senior IT exec, you may wonder every now and then what would happen if you actually had a disaster.

I thought about it many times, perhaps more than most, because as a consultant, I had developed several DR and BCP Plans. (Disaster Recovery and Business Continuity Plans). A detailed BCP with a matching DRP, a backup site and paper testing of the two plans are the minimum every business needs if you are to have a chance of surviving a disaster.

An IT disaster in my books, comes as one of two types:

1) A Network failure because of the massive business and customer disruption, but not too bad as these are recovered fairly quickly, and you usually only have to replace a small amount of equipment.

2) A total Data Centre loss which is really the worry as that is a catastrophic loss that is perhaps not quickly recovered from. Some of us would take comfort in the fact that most Data Centres are outsourced so the problem would be for the Managed Services Provider (MSP), and we might personally escape somewhat unscathed.

As a senior IT exec, you are still responsible for ensuring that your DR Plan is workable and will recover you as fast as possible from a total loss. You can take some comfort in knowing that you have a backup Data Centre or that you are doing data mirroring, so whilst you might suffer some initial shock and awe, at least you would be back up and running again fairly soon.

Consider this table below; this example shows that disasters can start from little things and have significant effects.

Presentation components

·       Users (end users, power users, administrators) cannot access the system through any part of the instance (e.g., client or server side, web interface or downloaded application).

·       Infrastructure and back-end services are still assumed to be active/running.

Business Intelligence / Reporting

·       The collection, logging, filtering, and delivery of reported information to end users is not functioning (with or without the user interface layer also being impacted).

·       Standard backup processes (e.g. tape backups) are not impacted, but the active/passive or mirrored processes are not functioning.

·       Specific types of disruptions could include components that process, match and transforms information from the other layers. This includes business transaction processing, report processing and data parsing.

Network Layers

·       Connectivity to network resources is compromised, and/or significant latency issues in the network exist that result in lowered performance in other layers.

·       Assumption is that terminal connections, serially attached devices and inputs are still functional.

Storage Layer

·       Loss of SAN, local area storage, or other storage component.

Database Layer

·       Data within the data stores is compromised and is either inaccessible, corrupt, or unavailable

Hardware components

·       Physical components are unavailable or affected by a given event

Virtual Layer

·       Virtual components are unavailable

·       Hardware and hosting services are accessible

Infrastructure Layer

·       Support functions disabled, such as management services, backup services, and log transfer functions.

·       Other services are presumed functional

Internal/External

·       Interfaces and intersystem communications are corrupt or compromised.

I have a record of building and commissioning Data Centres. As a result, I was once called upon to manage a Data Centre recovery. The Centre had suffered a major water leak that flooded the entire site – Admin, Server, Comms, Power rooms - so the site had to be taken down. The customer did not have a workable Disaster Management Plan or Business Continuity Plans (both of which had to be developed dynamically), and their backup site was in disarray (badly managed) and was unable to be used. It took four days to recover, the business disruption was catastrophic, making the daily news over the four days and affecting the share price.

To be prepared, you need

1.     A disaster recovery plan (DRP) and a business continuity plan. (BCP).

2.     Determine the Maximum Tolerable Downtime (MTD) for each application.

3.     Sort applications into MTD order.

4.     Develop recovery strategies for each application.

5.     Prepare a hardware inventory (e.g., servers, desktops), applications and data inventories. Make sure that all the backups are working. Develop a list of critical applications and data and the hardware to run them. Make sure that application copies are available for re-installation on replacement equipment. Develop a priority list of hardware and application restorations.

6.     Create an emergency response team.

7.     Create procedures for declaring a disaster including an emergency communications plan.

8.     Investigate alternative backup data centres and or processing alternatives.

9.     Document the DR plan.

10.  Practise paper based dry runs of the DR plan and emergency response teams’ procedures.

Performance Questions

11.  Is there a BCP in place?

12.  Is there a DR Plan in place?

13.  Have the plans been paper tested by the disaster recovery response team?

14.  What kind of backup, alternate site arrangement is in place?

15.  What is the current level of risk?

16.  How do you rate your backup provider?

17.  What fundamental rights do you have under the backup agreement?

18.  If the backup site is a shared outsourcers site, what rights and privileges do you have?

Sample Task list

1.     Conduct a risk analysis.

2.     Investigate what level of DR you need.

3.     Work with the business to build a BCP.

4.     Form a DR response team.

5.     Paper test DR plans with response team.

6.     Determine further works required and scope out.

7.     Breakdown the scope of works to task level, ready for loading into the change management project schedule.