Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

Share this Page URL

CHAPTER 16 Measuring Network Traffic > 16.6 REDUCING PROCESSING USING SAMPLED N... - Pg. 388

388 CHAPTER 16 Measuring Network Traffic behind probabilistic counting is to compute a metric of how uncommon a certain pattern within a flow ID is. It then keeps track of the degree of "uncommonness" across all packets seen. If the algorithm sees very uncommon patterns, the algorithm concludes it saw a large number of flows. More precisely, for each packet seen, the algorithm computes a hash function on the flow ID. It then counts the number of consecutive zeroes, starting from the least significant position of the hash result; this is the measure of uncommonness used. The algorithm keeps track of X, the largest number of consecutive zeroes seen (starting from the least significant position) in the hashed flow ID values of all packets seen so far. At the end of the interval, the algorithm converts X, the largest number of trailing zeroes seen, into an estimate 2 X for the number of flows. Intuitively, if the stream contains two distinct flows, on average one flow will have the least significant bit of its hashed value equal to zero; if the stream contains eight flows, on average one flow will have the last three bits of its hashed value equal to zero -- and so on. Note that hashing is essential for two reasons. First, implementing the algorithm directly on the sequence of flow IDs itself could make the algorithm susceptible to flow ID assignments where the traffic stream contains a flow ID F with many trailing zeroes. If F is in the traffic stream, then even if the stream has only a few flows, the algorithm without hashing will wrongly report a large number of flows. Notice that adding multiple copies of the same flow ID to the stream will not change the algorithm's final result, because all copies hash to the same value. A second reason for hashing is that accuracy can be boosted using multiple independent hash functions. The basic idea with one hash function can guarantee at most 50% accuracy. By using N independent hash functions in parallel to compute N separate estimates of X, probabilistic counting greatly reduces the error of its final estimate. It does so by keeping the average value of X (as a floating point number, not an integer) and then computing 2 X . Better algorithms for networking purposes are described in Estan et al. [EVF02]. The bottom line is that a chip can count approximately the number of flows with small error but with much less memory than required to track all flows. The computation of each hash function can be done in parallel. Flow counting can be seen as an application of Principle P3b, trading accuracy in the estimate for low storage and time. 16.6 REDUCING PROCESSING USING SAMPLED NETFLOW So far we have restricted ourselves to packet counting. However, several applications might require packet logs. Packet logs are useful for analysts to retrospectively analyze for patterns and attacks. In networking, there are general-purpose traffic measurement systems, such as Cisco's NetFlow [Net], that report per-flow records for very fine grained flows, where a flow is identified by a TCP or UDP connection. Unfortunately, the large amount of memory needed to store packet logs requires the use of DRAM to store the logs. Clearly, writing to DRAM on every packet arrival is infeasible for high speeds, just as it was for counter management. Basic NetFlow has two problems. 1. Processing Overhead: Updating the DRAM slows down the forwarding rate.