Are You Prepared for a Disaster Recovery?

Article ID: 21139

if a disaster strikes your systems, are you prepared? A single event could bring even the most successful business to its knees. As a support center rep who specializes in disaster recoveries, I asked for input from other experts and included some of their responses as I prepared this article. The information I offer here will give you the general knowledge you need to stand back up quickly after a disaster.

Many things are involved in planning for a disaster recovery, but the simple checklist includes only a backup plan and a recovery plan.

Backup Planning

Backup planning is the most important part of a disaster recovery and one of the most important system operations to be performed. Why? Because without your data, you can't recover. Period.

Backup planning can be simple or complicated. But no matter what type of environment you have, you need to make sure that your mission-critical data is backed up daily.

First, consider the different options available with the save commands performed in the system-supplied version of a full system save (GO SAVE option 21):

  • SAVSYS
  • SAVLIB LIB(*NONSYS)
  • SAVDLO
  • SAV

SAVSYS. A common misunderstanding with the SAVSYS command stems from the name itself. SAVSYS does not save the entire system. For this reason, if you are speaking to a support rep and you say the word "SAVSYS," the rep will most likely ask you to clarify between a SAVSYS and a full system save.

The SAVSYS command includes four areas: Licensed Internal Code (LIC), operating system, user profiles, and configuration objects.

The LIC is the code below the operating system. Both the LIC and the operating system can be saved only with the SAVSYS command (with the exception of a couple of methods that I won't mention because they aren't recommended).

The operating system is only the *BASE OS and does not include any other licensed programs.

User profiles are saved with the SAVSECDTA command that is included in the SAVSYS command. The SAVSECDTA command also saves the passwords, authorization lists, and authority holders.

Configuration objects are saved with the SAVCFG command that is provided in the SAVSYS command. This includes your descriptions (attaches to a resource) such as devices, controllers, and lines. An example is your tape drive description (e.g., TAP01).

SAVLIB LIB(*NONSYS). Once again, the naming of this command is misleading. *NONSYS includes both system-critical and nonsystem libraries. It's called *NONSYS because it saves all of the libraries except the official "system" libraries, which are QDOC, QDOCxxxx, QRCYxxxxx, QRECOVERY, QRPLOBJ, QRPLxxxxx, QSPL, QSPLxxxx, QSRV, QSYS, QSYSxxxxx, and QTEMP. However, it does include system-critical libraries. SAVLIB LIB(*NONSYS) is a combination of the two special values *IBM and *ALLUSR.

SAVLIB LIB(*IBM) provides all of the IBM-supplied libraries except the ones that include user data.

SAVLIB LIB(*ALLUSR) includes all of the user libraries that don't start with the letter Q except for a few libraries that are considered user libraries because they contain user data (such as QUSRSYS). For a list of these libraries, see the help text by pressing F1 on the LIB parameter of the SAVLIB command, or issue a WRKLIB *ALLUSR. (Note: If anyone creates a library that starts with the letter Q, it won't be saved as a part of the *ALLUSR command.)

SAVDLO. DLO stands for Document Library Object. DLOs include documents and folders, and they used to be where mail (distribution objects) was stored. DLOs are still used, but most objects are now located in the IFS, which is described in the next command. You can see your DLOs by issuing a WRKFLR command.

SAV. The SAV command saves the IFS, which is where your directories and directory objects are located. In many systems nowadays, this could be where the bulk of your information is located (rather than production libraries).

The IFS did not exist before V3R2, but it's an important part of your system. So if you use an earlier CL or application to perform backups, make sure it has been updated to include the IFS. Although V3R2 was long ago, I mention this because on several occasions I've worked a disaster recovery where the IFS was not included because the scheduled backup was created before the IFS existed (or because the creator incorrectly thought that only libraries were important).

Choosing a Backup Strategy

What backup strategy is best for you? The short answer is that you should back up as much of your system as possible, as often as possible, and this must include your critical data.

If you have a backup window that is large enough to perform a full system save every night, then you should perform a full system save every night. If your backup doesn't fit on one tape and requires an operator to switch the volumes, then remember that convenience shouldn't take precedence over security.

Instead, look for ways around this, such as a tape library or virtual tape (for V5R4 and later). (See "Virtual Tape: The Real Deal," January 2007, article 20780 at SystemiNetwork.com.) We all have a budget, but statistics will justify your case by showing how much more it will cost when you don't have the proper backups. A full system backup is the easiest way to recover, and it's usually the fastest approach for smaller systems.

If you can't perform a full system backup daily, you'll need to be a little more creative. This will most likely mean performing multiple backups at multiple times. So how do you decide what to backup and how often?

Typically, when planning a backup strategy, system managers ask the wrong questions first. These questions include, "How much down time can we afford?" and "What are the fastest drives we can get for the least amount of money?"

Instead, the first question should be, "If the entire system is lost, how far back can we afford to lose our critical data?" The next question may be, "What is the maximum amount of time we can afford to be down before it is detrimental to our business?" From these two questions, you can start planning your backup strategy.

I like to list data in three categories: critical user data, noncritical user data, and system data. This list is ordered from the most important to the least important when you have a tight backup window.

Critical user data is any data that is critical for your business (e.g., orders, financing data). Critical user data should be backed up daily, and in some cases, several times a day. For instance, if you can't afford to lose your critical user data from more than three hours ago, then you should back it up at least every three hours (eight times a day).

Noncritical user data is user data that pertains to your business but would not cause a catastrophe if it were lost. Certain user data may be noncritical for one system but not another. This may or may not include application customizations and history items.

System data is the least important because you can always restore it from distribution media or another system. But if recovery time is a factor, system data may be critical to save in a certain way. System data includes LIC, operating system, *IBM or licensed program libraries, IBM DLOs, and IBM directories (/QIBM/ProdData and /QOpenSys/QIBM/ProdData). System data usually doesn't change often, so you don't need to back it up frequently if your backup window is tight.

If your backup window is tight, then backup performance is important to you. Here are a few options to speed up your backup performance:

  • Obtain faster hardware (tape drives, fiber cables, and fast IOA/IOP cards).
  • Install the latest hardware firmware.
  • Increase memory.
  • Increase processors and CPW.
  • Install the latest PTFs.
  • Run the backup during the least amount of system activity.
  • Consider a save-while-active approach.
  • Consider using multiple tape drives.
  • Consider virtual tapes (must be set up properly to be faster).
  • Consider saving only changed objects.
  • Specify *DEV for DTACPR and COMPACT on the save command.
  • Use the performance adjuster (system value QPFRADJ) or third-party tuning tools.
  • Consider performing parallel backups.

When setting up your backup strategy, ensure that all parts of your system are saved at some point so you can save time on the recovery. Also, be conscious of how the items are saved to media because there is a specific order to recovering your data or system. (I'll discuss this in a moment.)

Below is an example of a common backup strategy. Remember: This is only an example, and it's probably not the best option for your environment. I recommend that you spend plenty of time planning your backup strategy and perhaps even contact a consultant to help set it up.

  • Daily — save changed objects of critical production libraries; perform SAV of critical production directories
  • Weekly — full save of critical production libraries, noncritical user libraries, and critical user directories
  • Monthly — full system save

You also need to verify that backups complete successfully. Support center reps sometimes find that backups have not been checked and may not have been successful. It is common to mistakenly assume that a scheduled backup job has been successful simply because it appears to have ended normally.

There are different ways to verify that a backup completed successfully, but the best way is to review the job log. If you know what you're are looking for, you can use DSPTAP DATA(*SAVRST) OUTPUT(*PRINT) to create a spool file of all your saved objects. (IFS objects are listed in a separate output.)

If you're performing a full system save, use the IBM Knowledge Base document 387819982 to search the history log to quickly verify that the backup was successful. You need the actual job log itself (with high logging levels) to determine why something wasn't saved. Even if you're not performing a full system save, the message IDs in the document are useful.

DSPLOG PERIOD(('xx:xx:xx'
  'dd/mm/yy'))
MSGID(CPC3702 CPC3707
  CPC9410 CPC370C CPF3771
  CPF3777 CPF9410 CPF3837
  CPC2356 CPF2361)

Once you've determined that your backups are good, it's smart to make a copy in case there's a problem with the media. I recommend that you keep at least one copy on site and at least one copy off site. If you're using virtual media, duplicate that to physical media to send off site, or at least send the image or virtual volume to another system that is located at a different site.

One copy is left on site for quick retrieval, and another copy is sent off site in the event that a site disaster wipes out the physical system or media. Your off-site housing is an important decision because it determines how safe your data is stored and how quickly it can be retrieved.

Recovery Planning

It's better to spend the time now to plan for a disaster than to waste valuable time during an actual disaster. The five "W" questions describe the basic steps to a disaster recovery:

Why? If you know the cause of the problem, you can usually resolve it quickly.

For example, if you arrive one morning and find that your system is down, you don't just declare a disaster. Instead, you'd probably want to know what caused the problem. I once heard a story from a customer whose system went down at night, and it turned out that it had been unplugged because the janitor needed an outlet for her vacuum. In this scenario, they just needed to plug in the system and do an IPL.

Also, you may think that you have a vast problem if every user gets an error when signing on to the system. But this could be due to a simple authority issue to the initial menu. Once you determine the cause of an issue, or you can't spend any more time finding the cause, you're ready to decide what needs to be restored.

What? Determining what needs to be restored is probably the most important recovery step. The answer determines how you will recover.

First, determine whether the data is your critical user data, noncritical user data, or system data. Depending on the magnitude of data, you may be recovering while the system is active, putting the system in a restricted state, or recovering to another system. (You should determine this before you recover and probably before your backup planning.)

List your critical data and determine the impact on production if that data is missing. Find what backup step each piece of data is a part of. This will lead you into the question of when to recover.

When? If you can restore your data while the system is active, you can restore right away. If other parts of production must be shut down while the data is restored, you must determine whether it is more important to recover the problem data now or wait for the rest of production to reach a certain stopping point. If there is a total system loss due to hardware or software, then the recovery should take place immediately.

Where? If you're recovering to the same system, then the "where" is taken care of. However, you should always plan for the possibility that you may not be able to recover to the same system. This is where you will spend a lot of time researching.

Determine what you must have available if you ever need to recover to a different system. This could be a backup system or partition in the same room, a system in a different room, or a system in a different geological location. If a natural disaster strikes, any systems at a single location could potentially be destroyed. This is why many people have turned to disaster recovery sites.

How do you determine which site is right for you? As with almost every answer with technology, it depends. You need to look at your different requirements:

  • travel time
  • power requirements
  • hardware compatibility
  • security requirements
  • system availability
  • technical assistance
  • usage costs

Who? After you decide about all the preceding factors, you can choose who will perform the recovery. This should be the person who has the authority and is the most experienced with performing restores. You should also have several other people listed as backups in case the main person is unable to perform the recovery tasks.

If you don't have staff with good technical skills (or even if you do), it is wise to research and decide on a disaster recovery specialist beforehand. Whomever you choose, it is important that all parties involved are the same ones who perform the disaster recovery testing.

Disaster Recovery Testing

When I sent a note to my save/restore teammates asking for their number one recommendations with disaster recoveries, 75 percent of them responded with the same answer: "Test like you're in a real disaster situation."

Make sure that your testing is exactly as it would be if it were a real disaster. This means have the same people test the recovery who would be doing the real recovery. Don't take just a full system save tape, but take all of your full saves — monthly, weekly, and daily saves.

Find the most complicated time for a disaster to strike, and make that the point from which you should test your recovery. Lots of times, disaster recovery sites will preload systems or give you a temporary system that doesn't match the same system that you'd recover to if it were a real disaster. Make sure that the target system is set up exactly as if it were a real disaster recovery scenario, and I caution against preloading anything besides the LIC.

Disaster recovery testing is your golden opportunity to ensure that everything is correct. If you've already performed your disaster recovery testing and find there are major changes on your system or at the disaster recovery site, then test it again! When you find a problem during your testing, note the problem, fix it, and adjust your recovery procedures so your recovery is better and faster every time.

Most calls we get at the support center involve problems that clients find with their disaster recovery tests. And every call we get like this is one call we won't get when these clients are in an actual disaster situation.

Order of Recovery Steps

The recovery steps depend on how things were saved. You want to use the Backup and Recovery guide, test recoveries, and any additional recovery reports. The main steps are listed below in the order that the system should be restored.

  1. Restore the LIC (save = SAVSYS).


  2. Restore the operating system (save = SAVSYS).


  3. Perform a GO RESTORE (option 21 — see the manual steps below).


  4. Install PTFs if needed.


  5. IPL to B side of system.

The manual steps for GO RESTORE are

  1. Restore the user profiles — RSTUSRPRF (save = SAVSYS or SAVSECDTA).


  2. Restore configuration objects — RSTCFG (save = SAVSYS or SAVCFG).


  3. Restore libraries — RSTLIB (save = SAVLIB). This includes IBM libraries (SAVLIB *NONSYS or *IBM) and user libraries (SAVLIB *NONSYS or *ALLUSR).


  4. Restore DLOs — RSTDLO (save = SAVDLO).


  5. Restore the IFS — RST (save = SAV).


  6. Restore the spool files — commands depend on release and program used.


  7. Restore authorities — RSTAUT (no save command, no media required).

Common Problems

Hopefully, you'll find all of your problems during your disaster recovery testing rather than during a real disaster. Here are some of the most common problems that we hear about at the support center:

Not all user data backed up successfully. Make sure your backup is set up correctly to save what you need it to save. Also, verify that the backups are successful.

Not sure which data is on which tapes. Make sure you properly label your media. On your save commands, you can specify OUTPUT(*PRINT) to create a spool file of the objects saved. Another option is to DSPTAP OUTPUT(*PRINT) and specify the DATA as either *LABELS or *SAVRST, depending on the detail needed.

The best option is to use a media management program, such as BRMS, that uses a database to keep track of your media and contents for you.

Target system is incompatible or incorrect. Ensure that the target system hardware is capable of handling your OS and applications.

Target system tape drive incompatible. Ensure that the target system tape drive is compatible with your save tapes. This is extremely important because this is the drive that you'll use to restore your system.

Target system is preinstalled. One of the most common problems with a disaster recovery or migration is that the target system may already have LIC, operating system, or licensed programs installed. This will cause problems with PTF-level mismatching, missing user and licensed program data in library QSYS (which is the OS), and potential duplicate files that cause pointer and address errors.

For a disaster recovery, make sure that there is no more than LIC installed and that the disks are added to the proper auxiliary storage pools. This will save time from scratching and adding disk units, but you should still reinstall/restore LIC from your backup media and then restore everything else from your backup media.

Problems using an alternate IPL device. Sometimes there are problems with using the target tape device as an alternate IPL device (IPLing and installing LIC directly from a backup device). If you run into this problem, it is much faster to set the device as an alternate installation device than to debug and fix the problem. The only difference is that the LIC media (I_BASE_01) must be in the optical drive so the system can IPL or "boot" from it, and then you must select the tape drive as an alternate installation device.

So make sure that you bring your original distribution media. Also, be aware that a fiber-attached tape drive cannot be used as an alternate IPL device, so it will need to be used as an alternate installation device.

Passwords incorrect or forgotten. When you're installing from tape, the passwords will be the passwords that were on the tape. If you forgot the passwords and you're unable to sign on with a user who has enough authority to change passwords, you can sign on to Dedicated Service Tools (DST) with the DST QSECOFR profile (different from i5/OS QSECOFR profile) and reset the i5/OS QSECOFR password.

However, if you've also forgotten this password, then the system must be scratch/installed with the distribution media, as a reinstall will not override security data. There is no backdoor.

Configuration objects not pointing to the correct resources. When you're restoring to a different system, it is important that you specify "Restore to a different system = Y" on the GO RESTORE option 21 parameter. If you're manually issuing the restore commands, make sure that you specify ALWOBJDIF(*ALL), and on the RSTCFG command, specify SRM(*NONE).

Duplicate/extension files created. If data already exists on the target system and you restore specifying ALWOBJDIF(*YES), then when there are file-level issues, the system will ignore and rename the file by adding an extension of *0001.

Make sure that the target system doesn't already have existing data. If it does, then follow the Backup and Recovery guide's chapter on synchronization.

Not all authorities restored. When you're restoring data to the target system, it is important to recover in the correct order. You must perform RSTUSRPRF *ALL before restoring the objects, and then perform RSTAUT after the objects have been restored so that all of the authorities and authorization lists are correct.

NPTFs level mismatch. PTFs are fixes, and without them, the system may produce unpredictable results. It is important that LIC, operating system, and licensed programs have their PTFs installed from the same media. The reason is that licensed program product PTFs can be dependent on prerequisite LIC and OS PTFs. If those PTFs do not exist, then the licensed program PTFs may be expecting certain code that is missing.

PTFs are included with the objects, so if you're restoring the ENTIRE system from tape, the PTFs will be included. If not, then you will want to reinstall the PTFs from the latest cume/group/individual sets.

Licensed programs are not installed correctly. When licensed programs are installed, part of the process is to copy certain menu, command, and user space objects into library QSYS. If the operating system (library QSYS) was installed from optical media, then it is missing these objects because the licensed programs are later restored and not "installed," so nothing is copied into QSYS.

If this happens, either reinstall the operating system from the backup media or reinstall all of the licensed programs with a GO LICPGM option 1. (Replace if current = Y.) But then you would also have to reinstall all of the PTFs again, so I suggest that you just slip (reinstall) the operating system from your backup media.

Licensed program and third-party application initialization or setup. Some licensed programs or third-party apps will require additional setup or initialization, depending on how you recovered. So make sure to contact the providers, and be sure that you document and test.

Plan for the Future

Disaster recovery planning is a necessity when it comes to protecting the future of your business. The most important part is actually the backup planning. Critical user data is the most important to your business and sometimes cannot be re-created. As long as your user data is backed up, you should be able to recover from a system disaster.

When creating a disaster recovery plan, make sure that you plan for the five "W" questions: Why, What, When, Where, and Who. Consider such things as compatibility, media retrieval time, and security. Once your disaster recovery plan is complete, make sure that you test it exactly as if a real disaster occurred!

Brian Bohner is an IBM support center rep who is an expert in disaster recoveries. You can reach him at bbohner@us.ibm.com.

ProVIP Sponsors

ProVIP Sponsors