Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
You now have a working crawler and a big collection of documents indexed, and you’re ready to set up the search part of the search engine. First, create a new class in searchengine.py that you’ll use for searching:
class searcher:
def __init_ _(self,dbname):
self.con=sqlite.connect(dbname)
def __del_ _(self):
self.con.close( )
The wordlocation table gives an
easy way to link words to tables, so it is quite easy to see which pages
contain a single word. However, a search engine is pretty limited unless
it allows multiple-word searches. To do this, you’ll need a query
function that takes a query string, splits it into separate words, and
constructs a SQL query to find only those URLs containing all the
different words. Add this function to the definition for the searcher class:
def getmatchrows(self,q):
# Strings to build the query
fieldlist='w0.urlid'
tablelist=''
clauselist=''
wordids=[]
# Split the words by spaces
words=q.split(' ')
tablenumber=0
for word in words:
# Get the word ID
wordrow=self.con.execute(
"select rowid from wordlist where word='%s'" % word).fetchone( )
if wordrow!=None:
wordid=wordrow[0]
wordids.append(wordid)
if tablenumber>0:
tablelist+=','
clauselist+=' and '
clauselist+='w%d.urlid=w%d.urlid and ' % (tablenumber-1,tablenumber)
fieldlist+=',w%d.location' % tablenumber
tablelist+='wordlocation w%d' % tablenumber
clauselist+='w%d.wordid=%d' % (tablenumber,wordid)
tablenumber+=1
# Create the query from the separate parts
fullquery='select %s from %s where %s' % (fieldlist,tablelist,clauselist)
cur=self.con.execute(fullquery)
rows=[row for row in cur]
return rows,wordids
This function looks a bit complicated, but it’s just creating a
reference to the wordlocation table
for each word in the list and joining them all on their URL IDs (Figure 4-2).
So a query for two words with the IDs 10 and 17 becomes:
selectw0.urlid,w0.location,w1.locationfromwordlocation w0,wordlocation w1wherew0.urlid=w1.urlidandw0.wordid=10andw1.wordid=17
Try calling this function with your first multiple-word search:
>>reload(searchengine)>> e=searchengine.searcher('searchindex.db')>> e.getmatchrows('functional programming')([(1, 327, 23), (1, 327, 162), (1, 327, 243), (1, 327, 261), (1, 327, 269), (1, 327, 436), (1, 327, 953),..
You’ll notice that each URL ID is returned many times with different combinations of word locations. The next few sections will cover some ways to rank the results. Content-based ranking uses several possible metrics with just the content of the page to determine the relevance of the query. Inbound-link ranking uses the link structure of the site to determine what’s important. We will also explore a way to look at what people actually click on when they search in order to improve the rankings over time.