Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
This section dives into some of the more technical details of
how BigramCollocationFinder—the
Jaccard scoring function from Example 7-9—works. If this is your first reading
of the chapter or you’re not interested in these details, feel free
to skip this section and come back to it later.
A common data structure that’s used to compute metrics related to bigrams is the contingency table. The purpose of a contingency table is to compactly express the frequencies associated with the various possibilities for appearance of different terms of a bigram. Take a look at the bold entries in Table 7-5, where token1 expresses the existence of token1 in the bigram, and ~token1 expresses that token1 does not exist in the bigram.