Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
This function looks a bit complicated, but it's just creating a reference to the wordlocation table for each word in the list and joining them all on their URL IDs (Figure 4-2). wordlocation w0 wordid = word0id urlid wordlocation w1 wordid = word1id urlid wordlocation w2 wordid = word2id urlid Figure 4-2. Table joins for getmatchrows So a query for two words with the IDs 10 and 17 becomes: select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.urlid=w1.urlid and w0.wordid=10 and w1.wordid=17 Try calling this function with your first multiple-word search: >> reload(searchengine) >> e=searchengine.searcher('searchindex.db') >> e.getmatchrows('functional programming') ([(1, 327, 23), (1, 327, 162), (1, 327, 243), (1, 327, 261), (1, 327, 269), (1, 327, 436), (1, 327, 953),.. You'll notice that each URL ID is returned many times with different combinations of word locations. The next few sections will cover some ways to rank the results. Content-based ranking uses several possible metrics with just the content of the page to determine the relevance of the query. Inbound-link ranking uses the link structure of the site to determine what's important. We will also explore a way to look at what people actually click on when they search in order to improve the rankings over time. Content-Based Ranking So far you've managed to retrieve pages that match the queries, but the order in which they are returned is simply the order in which they were crawled. In a large set of pages, you would be stuck sifting through a lot of irrelevant content for any men- tion of each of the query terms in order to find the pages that are really related to your search. To address this issue, you need ways to give pages a score for a given query, as well as the ability to return them with the highest scoring results first. This section will look at several ways to calculate a score based only on the query and the content of the page. These scoring metrics include: Word frequency The number of times the words in the query appear in the document can help determine how relevant the document is. 64 | Chapter 4: Searching and Ranking