Recovery Point Objective

From HORSE - Holistic Operational Readiness Security Evaluation.
Revision as of 11:10, 27 October 2012 by Mdpeters (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A “recovery point objective” or “RPO”, is defined by business continuity planning. It is the maximum tolerable period in which data might be lost from an IT service due to a major incident.

The RPO gives systems designers a limit to work to. For instance, if the RPO is set to 4 hours, then in practice, offsite mirrored backups must be continuously maintained. A daily offsite backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. First, staff use a business impact analysis to determine RPO for each service. RPO is NOT determined by the extant backup regime. Second, when any level of preparation of offsite data is required, rather than at the time the backups are off-sited, the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually off-sited.

Recovery point objective (RPO)

When computers used for normal "production" business services are affected by a "major incident" that cannot be fixed quickly, then the Information Technology Service Continuity (ITSC) Plan is performed, by the ITSC recovery team. This plan will always assume that the production computing equipment and the wider geographic location they normally reside at might become completely out of bounds at an unpredictable time, without any warning. The location chosen to rebuild the service (the recovery site) must be at least 20 miles from the normal Production site and suffer no threats in common with the production site (e.g. they should not be near the same coastline). The ITSC Plan must also satisfy two measurements- the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO) for any potentially affected services. These measures are determined by a team of people, called the Business Continuity (BC) team, that quantifies what losses might ensue if the services are not available. It is sobering to think that "potential loss of life" appears in far more IT service risk assessments than one might assume. The RTO and RPO are time intervals, typically expressed in number of hours, specified by the BC team to be the longest time the business can allow for without incurring significant risks or significant loss, allowing system designers to specify designs that are as cost effective as the RTO and RPO will permit.

The RTO is the time it takes to recover the service. The events that mark the start and end of the RTO duration must be pre-agreed between Business Continuity and ITSC staff. It is best to agree to start the RTO clock at the moment when it is decided to proceed with the recovery. Sometimes too much time is taken over the decision to invoke recovery, sometimes Major Incidents do not start at easily definable wall-clock times anyway. The RTO clock should be deemed to stop once the team responsible for testing the service (before it is successfully released to the wider user community) begin work. By defining the RTO in this way it can be set to a very specific time period, which allows better decision making at all levels- accepting that this compromises a little the principle of setting the RTO to be "the amount of time the business can be without the service".

The RPO is deceptively difficult to explain. The RPO is only a measure of the maximum time period in which data might be lost if there is a Major Incident affecting an IT Service- not a direct measure of how much data might be lost. BC staff can then more easily take steps to cover this maximum period and make plans to avoid or mitigate any impact of losing data that is entered in a time period as defined in the RPO. Consider a very simple example- a data entry clerk transfers data to an IT Service, by copy typing from paper forms. If the only consideration is RPO, the clerk needs to keep back enough recent paper forms so that he is certain to be able to retype all of them going back the same amount of time as defined in the RPO. This article does not seek to address the complexities that arise if transactions are completed electronically between organizations, and the home side of such transactions are lost because of a Major Incident.

Data Synchronization Points

A data synchronization point is a point in time. It is used to assess the way in which data backups relate to each other. Data backups need to be related to each other correctly when considering the time of day they were made, or their relationship to computer system activity events. A data synchronization point is a point in time when a set of backups exist which if restored from can be synchronized to the same point in time. Often this point in time is some hours before the last backup is completed, i.e., some hours before the data synchronization point. Backups that have no synchronization points are generally useless. A frequent mistake when setting RPO for traditional daily tape off-sited backups is to assume 24 hours for the RPO. This mistake is the result of not considering that the RPO time begins with the start of the first data backup used in the synchronization point; and must also include time for boxing the tapes; the inevitable contingency time that must be allowed for "waiting for courier transport"; loading and final escape from site (not always at exactly the same time of day- the RPO must be increased by an amount of time equivalent to any such variability). It is also risky to assume that tapes will always be physically intact- the RPO should include enough time to use a previous synchronization point too.

RTO and RPO- Effects on computer system design

The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.

When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance off-siting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronization point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronization at all times) and off-siting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.

If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or "syndicates") hardware between them, according to these risks.

If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.

If very high volumes of high value transactions are to be planned for, then the production hardware can be split across two sites; with a high bandwidth network connection between the two sites constant mirroring of data can be achieved. If the user community is dispersed or at least split across two geographic areas, then the configuration is resilient to single site Major Incidents- with zero RTO and RPO being achievable, and very often little loss of service being experienced at most times of day.

RPO and RTO- a worked example

Schematic ITSC and RTO, RPO, MI

The above figure is an example of how RPO and RTO might pan out in a practical situation. Tape is used for backup in this example. The tapes are sent offsite once per day at around the same time, but this timing is not fully guaranteed. The off-siting operation does happen to occur at roughly the same time of day in the chart above. The daily backup off-siting tasks in this example are as follows:

  • A set of backups are made to tape, possibly via a disk staging area. The synchronization point for each set of backups is late in the backup operation in this example as several large databases have to be backed up and all of them are required for a Synchronization Point (this is typical of such systems).
  • After that the tapes have to be ejected, collated, and cataloged as they are boxed. It is often the case that off-siting operations are batched across a wide spectrum of systems at a data center; generally the backups for all services have to wait for the very last one to be created and boxed before they can be sent to the loading bay for transport.
  • Pickups by offsite data repositories are expensive. Generally a daily pickup with a reasonably priced contract will have only an approximate time for pickup and will be predicated on the data center being ready with the tapes when the van turns up- extra pickups will be generally too expensive to contemplate on a regular basis so a data center must build contingency time into the preparation period before the pickup is due to occur.

All of which must be done before the pickup- and all of which must be included in the RPO calculation because the synchronization point being sent offsite depends on backups that were started very near to the start of these activities. So: a recovered service, after a restore from one of these daily backups, will be very likely to start up as at the end of the online day perhaps 13 or so hours or more, before the restored tapes were driven away from the Production data center.

Against this background, suppose that a Major Incident occurs just before an off-siting pick up (worst case) and as always the assumption is "total site loss, instantly"- so the prepared backups never leave the site. In this case the RPO is set to 48 hours- only twice the normal off-siting cycle. As it happens, on this occasion pickups have been regular for a while and you might make the mistake of thinking that because two off-siting operations have occurred within the RPO period noted above, you have two sets of tapes you might be able to use and still be within the RPO. This is not the case- the earlier set of tapes will produce a recovered service as at a recovery point that is much older than it needs to be to meet the 48 hour RPO. In this example perhaps 12 or 13 hours over that time. In this example, consider the effect of the latest set of off-sited tapes being rendered useless by a critically defective tape in the set (perhaps a 5-10% chance?)- as you can see by the example above, you can now NOT meet the RPO at all. Tape capacity is increasing all the time- fewer tapes mean that individual tape defects damage more backed up data.

To complete the picture, the RTO is noted above too. In this case the service was recovered well before the RTO limit was hit. It is however interesting to contemplate the fact that in this example the RTO does NOT start just after the Major Incident. In this example, as often there is in reality, there is seemingly too much delay. A quick decision to go to invocation of the ITSC Plan is always the best decision; in principle... The rule in setting an RTO should be that the RTO is the longest period of time the business can do without the IT Service in question. On the back of this appropriately economic decisions must be taken at the design stage about how the IT Service is built and run. It must be allowed however that some time has to be spent in making the decision to invoke the ITSC Plan, this decision time is an unknown variable- remember too there are often quite large sums of money spent immediately the decision to invoke is taken- staff being called in for extended periods of 24 hour working cover and large fees charged by some recovery service providers. In the example, there is the almost inevitable fudge that the RTO is set to the maximum time the business can do without the service whilst knowing full well that there is very likely to be a period of decision making before it.