Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
88 Chapter 5 Detection Patterns 12. Fault Correlation ... An error or a failure has been detected. There are a number of potential causes for it. In order to be more fault tolerant the system has been designed to fail silently when it fails. As a result, all that is known when the failure is fi rst detected is that it has failed. In order to process an error or failure the system needs to identify the error, and what fault caused the error. What fault is activating? Errors can be caused by any of several faults. Particular errors keep occurring. The system has been Riding Over Transients (26) but the system needs to make certain that the current error is the one that it is interested in ignoring. If it is a different error then the system should initiate error processing. What has the error done? Has execution stopped? If so, what capabilities are no longer available? What was the size of the stack at the time of the error? Were logs collected? What data is incorrect? Is it frequently changing data or is it a constant? The system should correct the actual fault that has caused the problem. Too large a recovery will impact too much of the system. Identification of the fault allows targeted recovery actions to be taken. [CBF 04] Fault tolerance is about handling the unanticipated and undetectable errors that occur during execution. But as faults are being removed from the system during design and test, different common errors will have been uncovered, isolated, and corrected. The clues to error types, or their `signatures', learned during these activities help you know what kinds of errors are likely to occur during normal execution. For example, if a large and complex system has a large number of off-by-one errors uncovered during test, the error signature found is that loops are traversed one too