Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL

Chapter 16. Disaster Recovery Considerat... > Avoiding High Availability Pitfalls ...

Avoiding High Availability Pitfalls in the Real World

To help you see the importance of high availability and disaster recovery, the following are some real-world scenarios in which SAP enterprise customers failed to give disaster recoverability the attention we have given it in this chapter:

  • End-to-end critical points of potential failure— This UNIX/Oracle customer spent much time working on database and SAP application layer–specific SPOF but missed basic infrastructure SPOFs. A bad network switch and a less than optimally configured NIC (set to autosense rather than hard-coded for 100Mb Ethernet) caused intermittent failures for literally months. Had the customer practiced availability through redundancy and followed NIC configuration best practices, it could easily have eliminated most of its resulting downtime. Had it failed the system over to its DR site and conducted end-to-end load testing on the primary system, it might have quickly found the problem as well.

  • Compressing the promote-to-production process— In an effort to “get a change in fast,” this customer did what many of us have done before and compressed the promote-to-production process. Instead of keeping a change in its technical sandbox for a few weeks, and then promoting it to development, test, training, and finally production (which normally would consume another six weeks), the customer pushed the change through in less than a week. We were happy to hear that the customer at least went through the process, instead of ignoring it altogether. In the end, though, a memory leak issue that only manifested itself over a period of time (certainly greater than a week) caused the customer substantial and recurring unplanned downtime.

  • Poor documentation— We provided a customer several viable methods for maintaining its SAP DR processes and procedures. We covered the pros and cons of using websites, file shares, SAP Enterprise Portal, Oracle Portal, Microsoft SharePoint Portal Server, and even a simple Excel Workbook approach to publicly house its documentation. In the end, the customer never actually made a decision one way or the other. And when it suffered an unplanned outage one summer, not only did it have little consistent formal documentation to help it, but the fact that the documentation was spread out all over the place without the benefit of centralization and version control only exacerbated an already tense issue. The customer was fortunate, though; at least it could get to what it had. Imagine what would have happened if access to its various file shares and individual desktops had been compromised.

  • Change management shortcomings— There are so many “keys” to sound change management that we can only recommend reading the change management chapters (Chapter 27, “Functional Configuration Change Control,” and Chapter 34, “Technical Change Control”) in their entirety. But the real gotchas probably boil down to ignoring change management, poorly documenting and therefore poorly implementing change processes, ignoring the change control process, and overly compressing the promote-to-production process. With regard to this first point, we received a call from a frantic outsourcing partner at 6:30 a.m. one morning. It was in the middle of doing an upgrade but never actually tested the firmware updates associated with updating its disk subsystem to new technology. Four hours later, the partner was back in business, but it far exceeded its SLA for unplanned quarterly downtime that day.

  • Employee limitations— Here’s an example of a problem that could easily have been avoided. One of our small SAP-on-Windows/Oracle customers hired a new SAP operator and junior Basis specialist. Familiar with Oracle and SAP, he apparently had no real understanding of hardware, including basic RAID 5 limitations. One day he noticed a disk drive glowing amber instead of the usual green, and he soon determined that he lost a drive on his database. Rather than replacing it, he sat around for weeks looking at the failed drive, comfortable in his ignorant assumption that he could lose plenty more drives before he really needed to worry. He had no idea that the particular RAID group to which the drive belonged was set up for single sparing, and that the single spare had already kicked in as expected and was doing its job of covering for the failed drive. Meanwhile, a couple of hot-pluggable replacement drives were sitting in a data center cabinet a few feet away. Our junior friend learned about his system’s limitations the hard way—as bad luck would have it, he lost another drive in the same RAID group (followed by his job later that year).

  • Eliminating documentation support— Another customer laid off two operators, and no specific person was ever held accountable again for maintaining SAP operations and monitoring tasks, including related “how-to” process documentation. As it turned out, no one actually monitored SAP on third shift for months, and a problem eventually cropped up. But with no knowledge of how to identify the issue, much less how to troubleshoot and resolve it, half of the SAP TSO was awakened that night. Imagine what would have happened if the system had needed to be failed over to its DR site; with a lack of knowledge and lack of documentation processes, the failover and system restart could easily have proved catastrophic.

  • No training— In a similar case, a company’s HA failover mechanism for SAP ERP worked nicely, but the people who were formally responsible for failing the system back to the original site were reorganized. Many individuals in this reorganized team were new to the team and had been neither trained nor apprised of where to obtain the system documentation. Think about it—no one still on the team had ever actually gone through the procedure for moving the production system back to the original site and pointing end users to it again. The company got lucky and was able to contact one of its former colleagues who had moved into a new role in the company but still remembered the process (and had the documentation, too). Fortunately, their lesson learned paid off before a real disaster occurred.

  • No stress testing performed— This customer opted to incur the risk associated with foregoing stress testing, which had included a provision for testing basic failover capabilities while a load was on the system (at a total cost of $45,000–$60,000). After the customer went live, it found the usual programming issues early on, and discovered some other easily resolved performance problems related to the number of background processes deployed. More importantly, though, SPOFs existed in that the customer deployed only a single server to run all background work processes. The SAP EarlyWatch Service had caught this, and the customer had fixed it weeks before go-live. But an enterprising junior Basis administrator took it upon himself to change it back without understanding the consequences. A stress test would have caught this sooner, both from a performance perspective and in the end-to-end system review that we perform as a prerequisite to testing. And everyone would have understood the consequences.

  • People-related SPOF— Only one person, the DBA, in the entire technical organization at this company knew how to set up and configure the disk subsystem for SAP. When he left the company, though, this fact was not fully understood until the system suffered a failure and many people tried unsuccessfully to access the password-protected system. When that failed to work, the company blew away the system and actually reconfigured the disk subsystem, incorrectly. Lots of people were called that weekend, and it was decided that declaring a disaster would be too expensive and cumbersome versus simply working through the issues one at a time. Bottom line: Availability through redundancy does not apply only to hardware or process-related SPOFs. It also applies to people.

  • Poor communication— The SAP IT staff went to a lot of trouble to script a really nice failover/failback routine for its production SAP cluster, and it even set up an alternative access method for getting into SAP R/3. But it failed to share how to access the system with the system’s end users. This lack of communication rose to the surface as the team ran into a variety of issues during its quarterly mini-DR test, including issues with using the SAP WebGUI versus the classic fat-client SAPGUI. In failover, it could no longer connect to its front-ending SAP system and had to fall back to directly logging in to R/3 using the fat-client SAPGUI.

  • Making promises your system can’t keep— A very large customer of ours decided after it had already sized for and purchased its SAP storage solution that it needed an SAP DR system installed 500 miles away from the primary site. And it needed to meet some really aggressive failover times. Of course, by this time in the project the customer was already live—on a cost-effective SAP-on-Windows solution with which the local IT team was comfortable. The DR solution it actually needed, however, was only supported at that time in a UNIX/Oracle environment and further required a specific model of disk subsystem running a specific version of disk controller firmware. Rather than scrapping the current infrastructure, the IT organization was forced to renegotiate its customer SLAs, able to promise less availability than initially requested. The IT team was lucky that the business was so flexible.


You are currently reading a PREVIEW of this book.


Get instant access to over $1 million worth of books and videos.


Start a Free 10-Day Trial

  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint