Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Reliability (as defined in this book) is about avoiding system problems, making core processes more fault-tolerant, or, when problems do occur, having capabilities in place so symptoms are spotted early and corrective actions are taken before downstream affects are too severe. The system’s reliability is ultimately measured by the end user’s experience, based on his or her exposure to unexpected interruptions, failures, and general instability. In extreme cases, reliability problems can lead to availability issues, as failures take down whole components—although we’ll focus specifically on that later. So, in summary, some practical examples of reliability management tasks include the following:
Proactive avoidance of known or potential problems.