Managing Computer and Network Equipment Failures

While companies will spend tremendous amounts of time and effort selecting the most reliable equipment, sooner or later a critical piece will fail. Therefore, successfully planning for and managing equipment failures is a critical aspect of business continuity. In her book Prepare for the Worst, Plan for the Best: Disaster Preparedness and Recovery for Small Businesses, by Donna R. Childs, which details how to properly manage equipment failures.

Sooner or later a component of your IT equipment will fail. Fortunately, equipment failures occur less frequently than human errors. Sometimes, you encounter an even more frustrating scenario: a computer that periodically malfunctions. This is a fairly common scenario. About half of data losses on desktop computers can be attributed to data corruption on the hard disk, caused in equal measure by physical hard disk surface damage or software glitches. In one-third of the cases, data are lost due to interruptions in power supply. In about 10% of the cases, data are lost from desktop computers due to overheated parts, caused by fans inside the computer clogged by dust or that otherwise malfunctioned. This happens more frequently in dusty or carpeted rooms. The remaining 10% of the cases of data losses from desktop computers can be attributed to other causes, such as processor and board failures.

For our present purposes, let us assume that such an event is localized, meaning that only one system is affected at any given time. As we get further along, we will discuss complete system failures, including the total destruction of the system. We include in this category equipment failures that were originally caused by human error. Imagine that, in the process of repairing a broken lamp, you accidentally pull the computer plug. Afterward, the computer no longer boots because a major data corruption on the hard disk occurred when you mistakenly pulled the plug.

It is possible to create nearly perfect protection against system failures. This can go as far as building so-called high-availability (HA) configurations that call for guaranteed continuous operation and availability in extreme cases. This could be realized with two or more computers that monitor each other, and in the event that one of them malfunctions, the error is automatically detected and corrected and the defective computer is then shut down. These setups are used for critical trading systems, and the space shuttle has five computers in an HA configuration.

However, short of launching a space shuttle or running an expensive financial trading operation, HA setups for typical continuous availability requirements are not cost-effective. They are expensive to set up and to maintain. It is unlikely that your small business has such requirements. But you may have systems that require 24/7availability, for example, revenue-producing Web sites and other e-commerce applications. Many companies would consider it desirable to also have 24/7 e-mail systems availability. You should outsource such systems to a large data center that can do the job for you in a much more cost-effective manner than a small business ever could. They are experienced in professionally managing thousands of servers under nearly continuous operating conditions.

The reason why e-mail is often included as a critical functionality is simply the fact that it has become one of the most valuable resources for small businesses. In particular, businesses that have traditionally done most of their communication via regular mail or mail pouch now prefer to use e-mail for daily, informal communication and the exchange of ideas, for example, law offices. A couple of years ago, you would have visited your lawyer to discuss a contract. Today, you discuss these items efficiently via e - mail. If you are not a company that is doing business in the IT arena, like this example of the law office, and you have to call a third-party service to fix your computer system, you should consider outsourcing your e-mail operation to professionals who ensure its proper functioning around the clock.

For any IT systems that you will maintain inside your business, there are methods to improve the contingency against equipment failure. To begin, rank your systems and associated work flows into the categories:

  • Critical: Systems that directly support your core business operation. As discussed, for a small business, these systems should be outsourced and operated from a data center, but you might have very specific reasons, such as concerns about data security, why you want to keep them in-house and you are willing to accept the occasional lack of availability to keep costs down. Examples of systems you may prefer to maintain in-house are your client relationship management system, your company shared documents database, and your internal Web server.
    The goals: System Availability: close to 24/7; Downtime: less than one hour during business hours; Data Restoration: complete and immediate access on backup server during business hours.

  • Important: Systems that provide important add-on services to your business operation, such as your meeting, scheduling, and calendar system or payroll and accounting.
    The goals: System Availability: expected during regular business hours; Downtime: less than one to two days; Data Restoration: complete with backup data availability on the same or next business day.

  • Optional: Systems that make your daily activities more efficient, such as scanners with text recognition software or video conferencing systems. However, no vital information is stored on such systems.
    The goals: System Availability: usually operational during business hours; Downtime: less than three days or uptime as requested; Data Restoration: not required, but possible on special request.

You probably already have a good idea which category each of your computer systems would fall into. You are now ready to make a direct comparison and assess the trade-offs between system failure and contingency measures that should give you a clear idea on the required budget and man-power to protect each computer system appropriately.

Note the importance of simple regular maintenance on your equipment. This is an effective measure in avoiding failures that are most often caused by trivial circumstances and therefore create more hassle than actual damage. For example, components like monitors, keyboards, and mice often fail because they are directly exposed to human beings who sometimes spill coffee over them. Fortunately, they are easily exchanged as long as you have replacements available. You also want to open each computer from time to time. Fan inlets that are clogged with dust do not allow sufficient cooling of the internal components. This can quickly lead to overheating. Another warning sign of pending equipment failure is grinding noises from fans or hard disks.

 

From Prepare for the Worst, Plan for the Best: Disaster Preparedness and Recovery for Small Businesses by Donna R. Childs. Copyright 2008 John Wiley & Sons, Inc. All Rights Reserved. Used by arrangement with John Wiley & Sons, Inc.