Sunday, February 26, 2012

High Availability, Business Continuity Planning, and Disaster Recovery

High Availability, Business Continuity Planning, and Disaster Recovery Planning have become hot keywords in the Information Technology industry as information systems have impacted nearly all aspects of modern business. High availability is fairly isolated to Information Technology, but Business Continuity Planning and Disaster Recovery Planning transcend far beyond just the Information Technology function in modern businesses, governments, and non-profit organizations. This post will give a high level overview of these approaches and subsequent posts in the series will give more expansive coverage of the approaches, examples, ideas, and scenarios that are not strictly limited to the IT function.

Business Continuity Planning (BCP)

Many professionals, managers, and executives don't understand Business Continuity Planning  and often mistake the term for disaster recovery. Business Continuity Planning deals with events that adversely affect an organization, but not to the extent where a disaster is declared and the organizational focus shifts to rebuilding. These events can be specific to the IT function, such as the outage of a corporate phone system or e-mail system, or they can deal with other parts of the organization, such as the sudden loss of a top performing manager or exceptionally skilled employee. The business response to these events may or may not involve multiple business units and the recovery effort is not typically an organization wide initiative. The focus is typically on working around the issue and providing the same services/functionality with different resources (ex. relying more on web, e-mail, and social networking communication methods during a PBX outage).

The main focuses from an IT standpoint for BCP are the continuation of critical functions, providing interim solutions for a wide range of small-scale failure scenarios, and providing a recovery plan for critical IT services in the event of a failure. Since the expected IT impact is limited, focus is placed on identifying service dependencies and working to make systems resilient to failures using traditional fault tolerance methods or implementing highly available systems.

Disaster Recovery Planning (DRP)

Disaster Recovery addresses events that are far more serious in scope than those addressed by BCP. These events are typically serious enough to require an organization wide response to restore normal operation. These events typically affect the organization in a way that it cannot continue normal operation without taking corrective action. These can be related to the IT function, such as a major datacenter fire, or not related to the IT function, such as a flu pandemic that affects more than 50% of the organization's workforce. Other events like hurricanes, earthquakes, and floods can also trigger an organization to implement its disaster recovery plan.

From an IT standpoint, a DR event is any event where a system cannot be returned to a normal operational state without rebuilding the system and recovering data from backup. DR events typically lead to a loss of data and a total loss of an IT service (vs. a degradation during a BCP event). Let's take a moment to consider the difference in the case of a datacenter fire where a traditional water fire suppression system is used instead of a dry release fire suppression system. In this scenario, a fire triggers the water release that shorts out all of the electronic circuitry and renders most of the data unrecoverable from the equipment in the datacenter. In the PBX failure scenario mentioned above, alternative communication methods exist, but in this case it is likely that the PBX, Internet, and all other forms of internal communication have been disabled. The organization has to switch to using cell phones or physical runners to communicate and the focus changes to restoring the entire datacenter's services or failing over to a secondary site.

Since DR events deal with loss of data, the focus of the organization from an IT perspective is on limiting the loss of information and minimizing the time required to restore critical services. At a minimum, creation and verification of off-site backups (to tape or to disk) using an enterprise backup software solution is required. Other activities, such as establishing a DR site and implementing highly available services are more or less optional depending on the organization's dependence on IT services.   

High Availability

High Availability is the gold standard for proactively managing the impact of a routine operational outage, major system failure, or total destruction of an IT infrastructure and its related services. High Availability is a different concept from Business Continuity Planning and Disaster Recovery Planning. High Availability is an IT strategy that helps to simplify, reduce, and sometimes eliminate the business impacts from a BCP or DR event as far as IT services are concerned. When systems are highly available, service interruptions happen rarely and services are managed more from a performance management or capacity management viewpoint than an availability management standpoint.

HA systems are not always easy or cheap to design, implement, and maintain. Since the systems typically have a higher level of complexity, they require better hardware, software, facilities, and IT staff to maintain. They provide a higher benefit because organizational stakeholders receive a higher level of availability and the IT department gains flexibility in replacing failed components and performing routine maintenance activities such as patching and upgrading.

The Difference Between Fault Tolerance and Highly Available  

Fault Tolerance and High Availability are often confused because they both indicate a state of service resiliency. High Availability is a superset of fault tolerance, meaning that high availability implies fault tolerance, but fault tolerance does not necessarily provide high availability. High availability requires that a service be resilient to most failures (at the component, system, and site levels) and be load balanced in a way that the services are the same regardless of the user's location.  Fault tolerance often indicates a state where a single component (disk, server, storage controller, NIC, etc.) can fail without impacting the service or the organization's stakeholders.

Some Lessons For Organizations

Implementing High Availability IT services is no longer optional.

The major stakeholders: customers, the organization's staff, and the individuals managing IT services are all demanding higher service uptime and better quality systems. Customers want to  access services 24/7/365 and will typically pick a competitor that allows them to do this. Customers aren't forgiving to outages and issues involving a loss of data or a loss of security. Don't believe me? Consider your response to receiving a letter from your bank that they lost control of your private information (SSN, account number(s), etc) or had an outage that lost your paycheck deposit. For me, this typically leads to a loss of business for the organization.

IT staff don't want to be woken up  before/after hours and the best potential employees typically survey the current state of an organization's IT systems and the on-call structure before agreeing to a new position. High level individuals typically accept or reject an organization based on whether they think the IT function is given the appropriate resources. As a result, if your organization isn't trying to get it right, then it will lose its best IT staff and will never be able to recruit high quality IT staff. There are a few stories of this that I will detail from my personal experience in future posts.

Organizational staff are seeking a borderless office and the ability to do their work from any system at any location at any time of the day. This is especially true for global organizations where  meetings can happen from multiple points around the globe at any time of the day. The typical 9-5 workday and "business hours" are 20th century concepts. In the 21st century, successful organizations are much more flexible and work towards a mutually beneficial relationship with employees.

The most effective organizations will be able to leverage highly available services both on premises and in the cloud to improve service uptime, improve the organizational climate for IT staff, and better serve the organization's stakeholders.

See Also:


  1. It's really awesome blog dear. i get lot of information. i also share some information. Hope you like it. Thanks for sharing it.

    Human Resource Management

  2. For a bit of light relief you might like this cartoon about a disaster recovery plan.