From HORSE - Holistic Operational Readiness Security Evaluation.
Jump to navigation Jump to search


The standard service level agreement (SLA) is usually phrased in terms of availability. However, as we've seen, availability can be a tricky thing to determine and can also be very hard to manage since it depends on uptime which is outside the capability of any clustering product to control.

However, consider the nature of most modern Internet delivered services (the best exemplar being the simple web-server). Most users, on clicking a URL would try again, at least once if they receive an error reply. The Internet has made most web users tolerant of any type of failure they could put down to latency or routing errors. Thus, to maintain the appearance of an operational website, uptime and thus availability are completely irrelevant. The only parameter which plays any sort of role in the user's experience is downtime. As long as you can have the web-server recovered within the time the user will tolerate a retry (by ascribing the failure to the Internet) then there will be no discernible outage, and thus the service level will have met the user's expectation of being fully available.

In the example given above, which most user requirements tend to fall into, it is important to note that since uptime turns out to be largely irrelevant, then any money spent on uptime features is wasted cash. As long as the investment is in a high avaialability harness which switches over fast enough, the cheapest possible hardware may be deployed.

Availability is the percentage of time when system is operational. Availability of a hardware or software component. The most simple representation for availability is as a ratio of the expected value of the uptime of a system to the aggregate of the expected values of up and down time, or

<math>A = \frac{E[\mathrm{Uptime}]}{E[\mathrm{Uptime}]+E[\mathrm{Downtime}]}</math>

Availability is typically specified in nines notation. For example 3-nines availability corresponds to 99.9% availability. A 5-nines availability corresponds to 99.999% availability.

Areas of risk

  • Environmental Disruption: The most significant cause of downtime for remote locations, environmental problems go beyond worries of fires and floods. Cooling and power are key points of exposure and become more so as equipment density increases.
  • Human Error: We call this the "unspoken epidemic" because no one likes to discuss human error and the tremendous impact it can have on availability. But the fact is, it’s the second greatest cause of downtime in remote or unsupervised locations because systems are often housed in janitor closets, wiring closets, and other less than optimally secured settings.
  • Physical Theft: As assets become smaller and more efficient with hot swappable drives, they become more attractive targets that are easier to steal. Physical attacks against digital assets are often overlooked and are a source of real concern. Remember portable media such as USB storage devices, mobile phones with cameras and storage, and other devices such as these.
  • Sabotage: Thrust into the forefront of everyone’s minds as a result of current events, terrorism is now something each of us must plan for, regardless of the probability. But there are very real threats within your own building from disgruntled employees who are far more likely to strike and to succeed.

See also

External links