There is no more important work for IT than to ensure that our business system complex is resilient and protected. Our level of resiliency is based on many factors, including performing regular and comprehensive backups, maintaining a remote site for our people and equipment, and implementing a high-availability solution to meet our business requirements. In the area of protection, we need to maintain at least some reasonable level of security and confidentiality of our business systems and sensitive data.
In order for IT to function correctly with a view to achieving resiliency and protection of these critical business systems, there are many crucial areas that cannot be overlooked, ignored, or pushed off until we get around to it. In this article, I'll provide a short laundry list of some of the shortcomings I have found when visiting System i sites in the US and around the world.
I have often seen System i backup implementations that are not comprehensive enough to let the company recover in the event of a failure. This seems to occur most frequently when a third-party application vendor has provided an application and a System i to the company. The vendor provides application support but little in the way of system support. The vendor provides a custom backup routine that runs daily or weekly to backup the data from the main business application but does not backup other data that is required to recover.
This hit home a while ago when I visited a small banking company to perform a System i security audit. As part of the audit, I briefly examined the backup/recovery strategy. The company, like many others, has no real in-house System i expertise. The technical folks who work there are Windows administrators, and they have been trained only to set up new users and run the vendor's prescribed backups on the System i.
When I examined the "last save date" on many critical libraries and IFS objects, I was surprised -- and at the same time, not surprised. Many critical pieces of the system had not been backed up for nearly a month. If the company had experienced a RAID failure or other disaster or a simple mistake that required a restore, it would be in serious, serious trouble.
When we rely on an application ISV to manage our systems, we are usually not well-served in the area of backup and recovery and system security. The ISV's business is to provide the business applications that we use to run our companies and to provide support for those applications. Often, system management is outside of its core competency.
If you rely solely on an application software vendor for your backup and recovery strategy and other systems-management details, I urge you to find and contract with a local IBM Business Partner who will work with you to ensure that the systems-management functions, including backup and recovery, are intelligently covered for you. A Business Partner can also help you with the next item on the laundry list -- high availability.
Not long ago, terrible hurricanes hit the U.S. Gulf Coast. It continues to be a tough time for folks in the southeastern U.S. -- who knows when the next big one will come?
Along with individuals and families, many companies were hit extremely hard by the storms and resulting floods. I was speaking to a friend who worked in IT at one of the Gulf Coast casinos. She was absolutely thrilled that she was able to run payroll on a backup system several hundred miles away from the coast. The main IT machine room on the Gulf Coast was out of commission. Yes, payroll is a big deal when employee homes are in shambles.
Events like these are indeed terrible, but hopefully we have all learned something from the pain of those affected. Many companies have no plan for business continuity in the event of a serious outage. We need to be prepared in the event of fire, flood, storms, earthquakes, power failures, and other devastating events. Can you run YOUR payroll if the system is under six feet of water? Can your trucks still move products? Can your customers still order products off your website? When you consider the cost of a lost day (or week or month) of business, the wisdom of investing in a disaster-recovery plan with high availability becomes pretty clear.
As we consolidate our remote servers onto one big box at headquarters, we have to make sure we have a plan and the technology to recover from an outage. Having all our eggs in one basket can be a great way to reduce costs, but it brings new issues to the forefront in the realm of recoverability and business continuity.
It does not take these disastrous events to knock you down. I was talking to a friend at COMMON last fall who told me he was down for three days due to a recent RAID failure. It does happen! We need to be better prepared!
The trend for high-availability software has been good since 9/11, and rightly so. High-availability software vendors have also developed less expensive remote journaling products that can provide you with a selection of options that best meets your business requirements and your budget. If you have not yet invested in high-availability software, please weigh your options and make the right choice for your organization.
All of the high-availability solutions that I have seen require journaling of your database files and other objects. When you journal your files, the system writes a record to the journal each time a record is added, changed, or deleted.
Even if you do not have a high-availability solution in place, you can still journal your files and other objects. Journaling is a feature built into the operating system and requires no additional products to implement.
One major fault that I see in high-availability implementations is that the journal receivers are only kept long enough to push the journaled changes to backup system and then deleted shortly thereafter. This results in the elimination of two of the major benefits of journaling, namely the audit trail and database recovery.
So who updated the payroll file and increased John's salary by 50 percent? Why did we ship 25 cases of our most expensive product to a customer who isn't on our customer file, and why didn't we invoice the shipment?
These types of questions can be answered only if we are keeping our journal data online or saved to backup media. Journaling lets us view information about every change made to a database record (e.g., the user who made the change, the date and time, the workstation name where the user was signed on, the program used to make the change, and the before-and-after image of the record in question).
In the current regulatory environment, this type of forensic data is often required.
We perform our weekly and daily backups faithfully. Excellent. However, let's say it's 4:00 p.m. on a very busy business day. We processed thousands of orders and wrote tons of invoices today. Payroll also ran today . . . yahoo. But Kapow! At 4:01 p.m., your system has a serious outage (e.g., raid failure, fire, flood), and we can't recover using usual means.
So, what do we do? We get the hardware/software problem fixed and load the operating system from our last SAVSYS (Save System) tapes. Then we load last Sunday's weekly backup. Now we load last night's backup, and then what? How do we get back all the changes that have occurred on the system during this VERY busy day? The straight answer is, "We can't!" We cannot restore all these database transactions because we did not collect them.
The way we save database transactions throughout the day is to journal the database. If we had used journaling, we could have restored our system to the exact moment of failure. Without journaling, we have a business nightmare on our hands: a complete day's worth of business transactions gone. . .poof! With journaling, we're heroes!
For those of us fortunate enough to have a robust HA solution, the "4:01 Kapow!" can be handled pretty nicely. However, consider another scenario...
We have a power user using an SQL interface to access the GL chart of accounts. The power user accidentally runs an erroneous SQL command, wiping out all the records from the chart of accounts file. With high-availability software in place, the act of wiping out all the records is synchronized on the backup server, and now both systems have an empty chart of accounts.
If the journal data is still around and has not yet been deleted by the high-availability software, the journaled changes made by the power user can be automatically reversed out, thereby putting the chart of accounts file back to its pre-disaster state. Even the backup box will then be synched to the correct data.
When protecting the privacy of our customers, vendors, and employees, we often build strong safeguards into our security system. We may implement library and object security and also use network exit point software to keep users from viewing or downloading sensitive data to their PCs. We often set our security policy very rigidly to keep folks from accessing our production files, often not even providing a log-in account to a production system for IT users.
However, we often also have Test and QA systems or environments that are hardly protected at all. With HIPAA, PCI, and the privacy laws in effect, we still manage to convince ourselves that using raw production data in our test environment is okay. It's just so quick and easy to do a CPYLIB(Copy Library) from production to test.
So you want to know what terrible disease someone has? Just look at the test data! Want to know someone's Social Security number, the DOB, and the spouse's name? Just look at the test data!
I have seen cases in which even bank account numbers and PIN codes are left unaltered in test data files. That, along with medical diagnosis and blood test results sitting unprotected in test data files is unconscionable and possibly illegal.
Recently, I was teaching an i5/OS security class in one of our southern states. It was open for public registration, so there were students there from many companies. I was discussing the importance of keeping sensitive data out of the wrong hands. The example I was using dealt with payroll information and keeping it secure from prying eyes. I noticed two gentlemen at the back of the room looking puzzled at each other. I asked if there was a question. The response from one was the serious question, "How are we supposed to know what to ask for on our next review if we don't know what others are making?"
These two gentlemen were sent by their company to a security class so that they could manage security for their systems. We obviously need to protect our sensitive data better by either implementing field level encryption or at least writing some scripts to generate test data while scrambling sensitive information.
Resolving some of the issues I have discussed here may require a significant allocation of resources in time, personnel, and cash. If you are not the decision maker at your organization and cannot make these resource commitments, I have a few tips for you.
Prepare a basic document that defines the problem (maybe even include parts of this article) and your recommendation on how to best tackle it. If you feel you should, you might also want to provide some kind of ballpark estimate of associated time and other costs and the potential savings. Deliver the document to your boss and make your best case verbally.
Now you can relax. You did what you could. It's not your problem anymore. It now belongs to your boss.