Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Many STL programmers, when faced with the need for a data structure offering fast lookups, immediately think of the standard associative containers, set, multiset, map, and multimap. That’s fine, as far as it goes, but it doesn’t go far enough. If lookup speed is really important, it’s almost certainly worthwhile to consider the nonstandard hashed containers as well (see Item 25). With suitable hashing functions, hashed containers can be expected to offer constant-time lookups. (With poorly chosen hashing functions or with table sizes that are too small, the performance of hash table lookups may degrade significantly, but this is relatively uncommon in practice.) For many applications, the expected constant-time lookups of hashed containers are preferable to the guaranteed logarithmic-time lookups that are the hallmark of set, map and their multi companions.
Even if logarithmic-time lookup is what you want, the standard associative containers may not be your best bet. Counterintuitively, it’s not uncommon for the standard associative containers to offer performance that is inferior to that of the lowly vector. If you want to make effective use of the STL, you need to understand when and how a vector can offer faster lookups than a standard associative container.
The standard associative containers are typically implemented as balanced binary search trees. A balanced binary search tree is a data structure that is optimized for a mixed combination of insertions, erasures, and lookups. That is, it’s designed for applications that do some insertions, then some lookups, then maybe some more insertions, then perhaps some erasures, then a few more lookups, then more insertions or erasures, then more lookups, etc. The key characteristic of this sequence of events is that the insertions, erasures, and lookups are all mixed up. In general, there’s no way to predict what the next operation on the tree will be.
Many applications use their data structures in a less chaotic manner. Their use of data structures fall into three distinct phases, which can be summarized like this:
1. | Setup. Create a new data structure by inserting lots of elements into it. During this phase, almost all operations are insertions and erasures. Lookups are rare or nonexistent. |
2. | Lookup. Consult the data structure to find specific pieces of information. During this phase, almost all operations are lookups. Insertions and erasures are rare or nonexistent. There are so many lookups, the performance of this phase makes the performance of the other phases incidental. |
3. | Reorganize. Modify the contents of the data structure, perhaps by erasing all the current data and inserting new data in its place. Behaviorally, this phase is equivalent to phase 1. Once this phase is completed, the application returns to phase 2. |
For applications that use their data structures in this way, a vector is likely to offer better performance (in both time and space) than an associative container. But not just any vector will do. It has to be a sorted vector, because only sorted containers work correctly with the lookup algorithms binary_search, lower_bound, equal_range, etc. (see Item 34). But why should a binary search through a (sorted) vector offer better performance than a binary search through a binary search tree? Because some things are trite but true, and one of them is that size matters. Others are less trite but no less true, and one of those is that locality of reference matters, too.
Consider first the size issue. Suppose we need a container to hold Widget objects, and, because lookup speed is important to us, we are considering both an associative container of Widgets and a sorted vector<Widget>. If we choose an associative container, we’ll almost certainly be using a balanced binary tree. Such a tree would be made up of tree nodes, each holding not only a Widget, but also a pointer to the node’s left child, a pointer to its right child, and (typically) a pointer to its parent. That means that the space overhead for storing a Widget in an associative container would be at least three pointers.
In contrast, there is no overhead when we store a Widget in a vector; we simply store a Widget. The vector itself has overhead, of course, and there may be empty (reserved) space at the end of the vector (see Item 14), but the per-vector overhead is typically insignificant (usually three machine words, e.g., three pointers or two pointers and an int), and the empty space at the end can be lopped off via “the swap trick” if necessary (see Item 17). Even if the extra space is not eliminated, it’s unimportant for the analysis below, because that memory won’t be referenced when doing a lookup.
Assuming our data structures are big enough, they’ll be split across multiple memory pages, but the vector will require fewer pages than the associative container. That’s because the vector requires no per-Widget overhead, while the associative container exacts three pointers per Widget. To see why this is important, suppose you’re working on a system where a Widget is 12 bytes in size, pointers are 4 bytes, and a memory page holds 4096 (4K) bytes. Ignoring the per-container overhead, you can fit 341 Widgets on a page when they are stored in a vector, but you can fit at most 170 when they are stored in an associative container. You’ll thus use about twice as much memory for the associative container as you would for the vector. If you’re working in an environment where virtual memory is available, it’s easy to see how that can translate into a lot more page faults, therefore a system that is significantly slower for large sets of data.
I’m actually being optimistic about the associative containers here, because I’m assuming that the nodes in the binary trees are clustered together on a relatively small set of memory pages. Most STL implementations use custom memory managers (implemented on top of the containers’ allocators — see Items 10 and 11) to achieve such clustering, but if your STL implementation fails to take steps to improve locality of reference among tree nodes, the nodes could end up scattered all over your address space. That would lead to even more page faults. Even with the customary clustering memory managers, associative containers tend to have more problems with page faults, because, unlike contiguous-memory containers such as vector, node-based containers find it more difficult to guarantee that container elements that are close to one another in a container’s traversal order are also close to one another in physical memory. Yet this is precisely the kind of memory organization that minimizes page faults when performing a binary search.
Bottom line: storing data in a sorted vector is likely to consume less memory than storing the same data in a standard associative container, and searching a sorted vector via binary search is likely to be faster than searching a standard associative container when page faults are taken into account.
Of course, the big drawback of a sorted vector is that it must remain sorted! When a new element is inserted, everything beyond the new element must be moved up by one. That’s as expensive as it sounds, and it gets even more expensive if the vector has to reallocate its underlying memory (see Item 14), because then all the elements in the vector typically have to be copied. Similarly, if an element is removed from the vector, all the elements beyond it must be moved down. Insertions and erasures are expensive for vectors, but they’re cheap for associative containers. That’s why it makes sense to consider using a sorted vector instead of an associative container only when you know that your data structure is used in such a way that lookups are almost never mixed with insertions and erasures.
This Item has featured a lot of text, but it’s been woefully short on examples, so let’s take a look at a code skeleton for using a sorted vector instead of a set:
vector<Widget> vw; // alternative to set<Widget>
... // Setup phase: lots of insertions,
// few lookups
sort(vw.begin(), vw.end()); // end of Setup phase. (When
// simulating a multiset, you might
// prefer stable_sort instead see
// Item 31.) When emulating a set,
// be sure vw holds no duplicates!
Widget w; // object for value to look up
... // start Lookup phase
if (binary_search(vw.begin(), vw.end(), w)) ... // lookup via binary_search
vector<Widget>::iterator i =
lower_bound(vw.begin(), vw.end(), w); // lookup via lower_bound;
if (i != vw.end() && !(w < *i)) ... // see Item 19 for an explana-
// tion of the "!(w < *i)" test
pair<vector<Widget>::iterator,
vector<Widget>::iterator> range =
equal_range(vw.begin(), vw.end(), w); // lookup via equal_range
if (range.first != range.second) ...
... // end Lookup phase, start
// Reorganize phase
sort(vw.begin(), vw.end()); // begin new Lookup phase...
As you can see, it’s all pretty straightforward. The hardest thing about it is deciding among the search algorithms (e.g., binary_search, lower_bound, etc.), and Item 45 helps you do that.
Things get a bit more interesting when you decide to replace a map or multimap with a vector, because the vector must hold pair objects. After all, that’s what map and multimap hold. Recall, however, that if you declare an object of type map<K, V> (or its multimap equivalent), the type of elements stored in the map is pair<const K, V>. To emulate a map or multimap using a vector, you must omit the const, because when you sort the vector, the values of its elements will get moved around via assignment, and that means that both components of the pair must be assignable. When using a vector to emulate a map<K, V>, then, the type of the data stored in the vector will be pair<K, V>, not pair<const K, V>.
maps and multimaps keep their elements in sorted order, but they look only at the key part of the element (the first component of the pair) for sorting purposes, and you must do the same when sorting a vector. You’ll need to write a custom comparison function for your pairs, because pair’s operator< looks at both components of the pair.
Interestingly, you’ll need a second comparison function for performing lookups. The comparison function you’ll use for sorting will take two pair objects, but lookups are performed given only a key value. The comparison function for lookups, then, must take an object of the key type (the value being searched for) and a pair (one of the pairs stored in the vector)—two different types. As an additional twist, you can’t know whether the key value or the pair will be passed as the first argument, so you really need two comparison functions for lookups: one where the key value is passed first and one where the pair is passed first.
Here’s an example of how to put all the pieces together:
typedef pair<string, int> Data; // type held in the "map"
// in this example
class DataCompare { // class for comparison
public: // functions
bool operator()(const Data& lhs, // comparison func
const Data& rhs) const // for sorting
{
return keyLess(lhs.first, rhs.first); // keyLess is below
}
bool operator()(const Data& lhs, // comparison func
const Data::first_type& k) const // for lookups
{ // (form 1)
return keyLess(lhs.first, k);
}
bool operator()(const Data::first_type& k, // comparison func
const Data& rhs) const // for lookups
{ // (form 2)
return keyLess(k, rhs.first);
}
private:
bool keyLess(const Data::first_type& k1, // the "real"
const Data::first_type& k2) const // comparison
{ // function
return k1 < k2;
}
};
In this example, we assume that our sorted vector will be emulating a map<string, int>. The code is pretty much a literal translation of the discussion above, except for the presence of the member function keyLess. That function exists to ensure consistency among the various operator() functions. Each such function simply compares two key values, so rather than duplicate the logic, we put the test inside keyLess and have the operator() functions return whatever keyLess does. This admirable act of software engineering enhances the maintainability of DataCompare, but there is a minor drawback. Its provision for operator() functions with different parameter types renders the resulting function objects unadaptable (see Item 40). Oh well.
Using a sorted vector as a map is essentially the same as using it as a set. The only major difference is the need to use DataCompare objects as comparison functions:
vector<Data> vd; // alternative to
// map<string, int>
... // Setup phase: lots of
// insertions, few lookups
sort(vd.begin(), vd.end(), DataCompare()); // end of Setup phase. Simu-
// lated maps must avoid
// dupes; simulated multimaps
// might use stable_sort
string s; // object for value to look up
... // start Lookup phase
if (binary_search(vd.begin(), vd.end(), s,
DataCompare())) ... // lookup via binary_search
vector<Data>::iterator i =
lower_bound(vd.begin(), vd.end(), s,
DataCompare()); // lookup via lower_bound;
if (i != vd.end() && !DataCompare()(s, *i))... // again, see Item 19 for info
// on the
// "!DataCompare()(s, *i)" test
pair<vector<Data>::iterator,
vector<Data>::iterator> range =
equal_range(vd.begin(), vd.end(), s,
DataCompare()); // lookup via equal_range
if (range.first != range.second) ...
... // end Lookup phase, start
// Reorganize phase
sort(vd.begin(), vd.end(), DataCompare()); // begin new Lookup phase...
As you can see, once you’ve written DataCompare, things pretty much fall into place. And once in place, they’ll often run faster and use less memory than the corresponding design using a real map as long as your program uses the data structure in the phased manner described on page 101. If your program doesn’t operate on the data structure in a phased manner, use of a sorted vector instead of a standard associative container will almost certainly end up wasting time.