Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
216 CHAPTER 9 Protocol Processing If the expected case fails, the implementation can revert to the standard BSD processing. For example, Chandranmenon and Varghese [CV98a], which describe this expected-case optimization in which the code keeps two lists and directly reuses the existing BSD code (which is hard to get right!) when the expected case fails. The expected case is reported by Chandranmenon and Varghese [CV98a] as taking 38 SPARC instructions, which is comparable with Jacobson's TCP estimates. As with header prediction, it is worth applying Caveat Q8 and examining the sensitivity of this optimization of this implementation to the assumptions. Actually, it turns out to be pretty bad. This is because measurements indicate that many recent implementations, including Linux, have senders send out fragments in reverse order! Thus fragments arrive in reverse order 9% of the time [SMC01]. This seemingly eccentric behavior is justified by the fact that it is only the last fragment that carries the length of the entire packet; by sending it first the sender allows the receiver to know what length buffer to allocate after the first fragment is received, assuming the fragments arrive in FIFO order. Note that the FIFO assumption still holds true. However, Figure 9.10 has a concealed but subtle additional assumption: that fragments will be sent in offset order. Before reading further, think how you might modify the implementation of Figure 9.10 to handle this case. The solution, of course, is to use the first fragment to decide which of two expected cases to optimize for. If the first fragment is the first fragment (offset 0), then the implementation uses the mode described in Figure 9.10. If the first fragment is the last (last bit set), the implementation jumps to a different state, where it expects fragments in reverse order. This is just the dual of Figure 9.10, where the next fragment should have its last byte number to be 1 less (as opposed to 1 more) than the start offset of the previous fragment. Similarly, the next fragment is expected to be pasted at the start of the list and not the end. 9.5 CONCLUSIONS This chapter describes techniques for efficient buffer allocation, CRC and checksum calcula- tion, protocol processing such as TCP, and finally reassembly. For buffer allocation, techniques such as the use of segregated pools and batch allocation promise fast allocation with potential trade-offs: the lack of storage efficiency (for segregated pools) versus the difficulty of coalescing noncontiguous holes (for batch allocation). Buffer sharing is important to use memory efficiently and can be done by efficiently stealing buffers from large users or by using dynamic thresholds. For CRC calculation, efficient multibit remainder calculation finesses the obvious waste (P1) of calculating CRCs one bit at a time, even using LFSR implementations. For checksum calculation, the main trick is to fit the computation to the underlying machine architecture, using large word lengths, lazy checks for carries, and even parallelism. The optimizations for TCP, UDP, and reassembly are all based on optimizing simple expected cases (e.g., FIFO receipt, no errors) that cut through a welter of corner cases that the protocol must check for but rarely occur. Figure 9.1 presents a summary of the techniques used in this chapter together with the major principles involved. Beyond the specific techniques, there are some general lessons to be gleaned. First, when considering the buffer-stealing algorithm, it is tempting to believe that finding the user with