Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint

Querying

You now have a working crawler and a big collection of documents indexed, and you’re ready to set up the search part of the search engine. First, create a new class in searchengine.py that you’ll use for searching:

class searcher:
  def __init_  _(self,dbname):
    self.con=sqlite.connect(dbname)

  def __del_  _(self):
    self.con.close(  )

The wordlocation table gives an easy way to link words to tables, so it is quite easy to see which pages contain a single word. However, a search engine is pretty limited unless it allows multiple-word searches. To do this, you’ll need a query function that takes a query string, splits it into separate words, and constructs a SQL query to find only those URLs containing all the different words. Add this function to the definition for the searcher class:

  def getmatchrows(self,q):
    # Strings to build the query
    fieldlist='w0.urlid'
    tablelist=''
    clauselist=''
    wordids=[]

    # Split the words by spaces
    words=q.split(' ')
    tablenumber=0

    for word in words:
      # Get the word ID
      wordrow=self.con.execute(
        "select rowid from wordlist where word='%s'" % word).fetchone(  )
      if wordrow!=None:
        wordid=wordrow[0]
        wordids.append(wordid)
        if tablenumber>0:
          tablelist+=','
          clauselist+=' and '
          clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber)
        fieldlist+=',w%d.location' % tablenumber
        tablelist+='wordlocation w%d' % tablenumber
        clauselist+='w%d.wordid=%d' % (tablenumber,wordid)
        tablenumber+=1

    # Create the query from the separate parts
    fullquery='select %s from %s where %s' % (fieldlist,tablelist,clauselist)
    cur=self.con.execute(fullquery)
    rows=[row for row in cur]

    return rows,wordids

This function looks a bit complicated, but it’s just creating a reference to the wordlocation table for each word in the list and joining them all on their URL IDs (Figure 4-2).

Table joins for getmatchrows

Figure 4-2. Table joins for getmatchrows

So a query for two words with the IDs 10 and 17 becomes:

select w0.urlid,w0.location,w1.location
from wordlocation w0,wordlocation w1
where w0.urlid=w1.urlid
and w0.wordid=10
and w1.wordid=17

Try calling this function with your first multiple-word search:

>>reload(searchengine)
>> e=searchengine.searcher('searchindex.db')
>> e.getmatchrows('functional programming')
([(1, 327, 23), (1, 327, 162), (1, 327, 243), (1, 327, 261),
  (1, 327, 269), (1, 327, 436), (1, 327, 953),..

You’ll notice that each URL ID is returned many times with different combinations of word locations. The next few sections will cover some ways to rank the results. Content-based ranking uses several possible metrics with just the content of the page to determine the relevance of the query. Inbound-link ranking uses the link structure of the site to determine what’s important. We will also explore a way to look at what people actually click on when they search in order to improve the rankings over time.