Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
If you found out you had to power off an entire data center, do a lot of maintenance, then bring it all back up, would you know how to manage the event? Some companies are lucky enough to be able to do this every quarter or once a year. SAs delay tasks that require interruption of service, such as hardware upgrades, parts replacement, or network changes, until this window. Sometimes a weekly timeslot is allocated for major and risky changes to consolidate downtime to a specific time when customers will be least affected. Other times we are forced to do this because of physical maintenance such as construction, power or cooling upgrades, or office moves. Other times we need to do this for emergency reasons, such as a failing cooling system. This chapter describes as a technique for managing such major planned outages. Along the way will be tips useful in less dramatic settings. Projects like this require more planning, more orderly execution, and considerably more testing. We call this the flight director technique, named after the role of the flight director in NASA space launches.[1]
[1] The origin of this chapter’s techniques and terminology was Paul Evans, an avid observer of the space program. The first flight directors wore a vest, like the one worn by the flight director in Apollo 13. The terminology helped everyone remember that the role of SA in the vest was different from normal.
Although most people clean their houses or apartments on a weekly or monthly basis, an annual spring cleaning is certainly useful. Similarly, networks sometimes need massive, disruptive cleaning. Cooling systems must be powered off, drained, cleaned, and refilled. Messy nests of wires become impediments to working effectively and sometimes must be tidied. Large volumes of data must be moved between file servers to optimize performance for users or simply to provide room for growth. Improvements that involve many changes can be done much more efficiently if all users agree to a large window of downtime. The flight director technique guides the activities before the window, during execution, and after execution (see Table 20.1).