How to improve your IT. Part 15 - Disaster Recovery
A series of posts on how to improve the performance of your IT
The Best practice IT Standard is:
A detailed Business Continuity Plan (BCP) with a matching Disaster Recovery Plan (DRP) and a back-up site agreement. The BCP and DRP should at minimum have been paper tested several times.
Most of us think of a disaster as being a data centre that has been gutted by fire or hit by an aircraft, in other words, it’s off the air. While these disaster types rarely happen, flooding however is a common cause, as is power outages, that's when the generators don’t kick in (murphy’s law), sabotage is another and data corruption is a big contender. Each of these can have catastrophic consequences for a business. Back-up data centres are the best protection you can get, and then they are only as good as the WAN link that connects them, and they can fail when needed (murphy’s law again). It’s all an insurance game where by and large you get what you pay for.
However, when the crunch comes, only people can recover systems and restore businesses. The trouble is there is very little sensible information prepared in advance, and what there is must be kept current. Being prepared (defeating murphy’s law) usually means that you won’t suffer a disaster; but you must have a plan.
Consider the list that follows, it is a small example as a reminder that disasters can start from little things (any of the components listed) and have significant effects.
User Interface / Rendering
Presentation components
Users (end users, power users, administrators) are unable to access the system through any part of the instance (e.g., client or server side, web interface or downloaded application).
Infrastructure and back-end services are still assumed to be active/running.
Business Intelligence / Reporting
Processing components
The collection, logging, filtering, and delivery of reported information to end users is not functioning (with or without the user interface layer also being impacted).
Standard backup processes (e.g. tape backups) are not impacted, but the active / passive or mirrored processes are not functioning.
Specific types of disruptions could include components that process, match and transforms information from the other layers. This includes business transaction processing, report processing and data parsing.
Network Layers
Infrastructure components
Connectivity to network resources is compromised and/or significant latency issues in the network exist that result in lowered performance in other layers.
Assumption is that terminal connections, serially attached devices and inputs are still functional.
Storage Layer
Infrastructure components
Loss of SAN, local area storage, or other storage component.
Database Layer
Database storage components
Data within the data stores is compromised and is either inaccessible, corrupt, or unavailable
Hardware/Host Layer
Hardware components
Physical components are unavailable or affected by a given event
Virtualizations (VM's)
Virtual Layer
Virtual components are unavailable
Hardware and hosting services are accessible
Administration
Infrastructure Layer
Support functions are disabled such as management services, backup services, and log transfer functions.
Other services are presumed functional
Internal/External
Dependencies
Interfaces and intersystem communications corrupt or compromised
To be prepared, at a minimum you need
A disaster recovery plan (DRP) and a business continuity plan. (BCP).
Determine the Maximum Tolerable Downtime (MTD) for each application.
Determine a reasonable Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each application.
Sort applications into MTD or RTO order.
Develop priorities and RTO’s.
Develop recovery strategies for each application.
Prepare a hardware inventory (e.g., servers, desktops), applications and data inventories. Make sure that all the back-ups are working. Develop a list of critical applications and data and the hardware to run them. Make sure that application copies are available for re-installation on replacement equipment. Develop a priority list of hardware and application restorations.
Create an emergency response team.
Create procedures for declaring a disaster.
Develop an emergency communications plan.
Investigate alternative back-up data centres and or processing alternatives.
Document the DR plan.
Practise paper based dry runs of the DR plan and emergency response teams’ procedures.
Some business applications cannot tolerate any downtime
They make use of a back-up data centre that can handle all their data processing needs, they run paralleled data mirroring between the two centres, this is a costly solution that only larger companies can afford.
However, there are other solutions available for small to medium-sized businesses with critical business applications and data to protect. Many companies have access to more than one facility. Hardware at an alternate facility can be configured to run similar hardware and software applications when needed. Assuming data is backed up off-site or data is mirrored between the two sites, data can be restored at the alternate site, and processing can continue.
Cloud-based disaster recovery as a service (DRaaS), WAN optimized replication, for highly efficient use of backup storage is growing in popularity, especially among SMBs and mid-sized organizations. The service is based on the protected capacity of your cloud platform and stores a configurable number of daily, weekly, and monthly backups for one base price.
Some vendors provide “hot sites” for IT disaster recovery. These sites are fully configured data centres with commonly used hardware and software products. Subscribers may provide unique equipment or software either at the time of a disaster or store it at the hot site ready for use.
Performance Questions
Is there a BCP in place?
Is there a DR Plan in place?
Have the plans been paper tested by the disaster recovery response team?
What kind of back-up, alternate site arrangement is in place?
What is the current level of risk?
How do you rate your back-up provider?
What fundamental rights do you have under the back-up agreement?
If the back-up site is a shared outsourcers site, what rights and privileges do you have?
Sample Task list
Conduct a risk analysis.
Investigate what level of DR you need.
Work with the business to build a BCP.
Form a DR response team.
Paper test DR plans with response team.
Determine further works required and scope out.
Breakdown the scope of works to task level, ready for loading into the change management project schedule.