Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
The next step is to set up the database for the full-text index. As I mentioned earlier, the index is a list of all the different words, along with the documents in which they appear and their locations in the documents. In this example, you’ll be looking at the actual text on the page and ignoring nontext elements. You’ll also be indexing individual words with all the punctuation characters removed. The method for separating words is not perfect, but it will suffice for building a basic search engine.
Because covering different database software or setting up a database server is outside the scope of this book, this chapter will show you how to store the index using SQLite. SQLite is an embedded database that is very easy to set up and stores a whole database in one file. SQLite uses SQL for queries, so it shouldn’t be too difficult to convert the sample code to use a different database. The Python implementation is called pysqlite, and you can download it from http://initd.org/tracker/pysqlite.
There is a Windows installer as well as instructions for installing it on other operating systems. Appendix A contains more information on getting and installing pysqlite.
Once you have SQLite installed, add this line to the start of searchengine.py:
from pysqlite2 import dbapi2 as sqlite
You’ll also need to change the __init__, __del__, and dbcommit methods to open and close the
database:
def __init_ _(self,dbname):
self.con=sqlite.connect(dbname)
def __del_ _(self):
self.con.close( )
def dbcommit(self):
self.con.commit( )
Don’t run the code just yet—you still need to prepare the
database. The schema for the basic index is five tables. The first
table (urllist) is the list of URLs
that have been indexed. The second table (wordlist) is the list of words, and the
third table (wordlocation) is a
list of the locations of words in the documents. The remaining two
tables specify links between documents. The link table stores two URL IDs, indicating a
link from one table to another, and linkwords uses the wordid and linkid columns to store which words are
actually used in that link. The schema is shown in Figure 4-1.
All tables in SQLite have a field called rowid by default, so there’s no need to
explicitly specify an ID for these tables. To create a function for
adding all the tables, add this code to the end of searchengine.py so that it’s part of the
crawler class:
def createindextables(self):
self.con.execute('create table urllist(url)')
self.con.execute('create table wordlist(word)')
self.con.execute('create table wordlocation(urlid,wordid,location)')
self.con.execute('create table link(fromid integer,toid integer)')
self.con.execute('create table linkwords(wordid,linkid)')
self.con.execute('create index wordidx on wordlist(word)')
self.con.execute('create index urlidx on urllist(url)')
self.con.execute('create index wordurlidx on wordlocation(wordid)')
self.con.execute('create index urltoidx on link(toid)')
self.con.execute('create index urlfromidx on link(fromid)')
self.dbcommit( )
This function will create the schema for all the tables that you will be using, along with some indices to speed up searching. These indices are important, since the dataset can potentially get very large. Enter these commands in your Python session to create a database called searchindex.db:
>>reload(searchengine)>> crawler=searchengine.crawler('searchindex.db')>> crawler.createindextables( )
Later you’ll be adding an additional table to the schema for a scoring metric based on counting inbound links.
The files that you’re downloading from the Web are HTML and thus
contain a lot of tags, properties, and other information that doesn’t
belong in the index. The first step is to extract all the parts of the
page that are text. You can do this by searching the soup for text
nodes and collecting all their content. Add this code to your gettextonly function:
def gettextonly(self,soup):
v=soup.string
if v==None:
c=soup.contents
resulttext=''
for t in c:
subtext=self.gettextonly(t)
resulttext+=subtext+'\n'
return resulttext
else:
return v.strip( )
The function returns a long string containing all the text on the page. It does this by recursively traversing down the HTML document object model, looking for text nodes. Text that was in separate sections is separated into different paragraphs. It’s important to preserve the order of the sections for some of the metrics you’ll be calculating later.
Next is the separatewords
function, which splits a string into a list of separate words so that
they can be added to the index. It’s not as easy as you might think to
do this perfectly, and there has been a lot of research into improving
the technique. However, for these examples it will suffice to consider
anything that isn’t a letter or a number to be a separator. You can do
this using a regular expression. Replace the definition of separatewords with the following:
def separatewords(self,text):
splitter=re.compile('\\W*')
return [s.lower( ) for s in splitter.split(text) if s!='']
Because this function considers anything nonalphanumeric to be a separator, it will have no problem extracting English words, but it won’t properly handle terms like “C++” (no trouble searching for “python,” though). You can experiment with the regular expression to make it work better for different kinds of searches.
Another possibility is to remove suffixes from the words using a stemming algorithm. These algorithms attempt to convert the words to their stems. For example, the word “indexing” becomes “index” so that people searching for the word “index” are also shown documents containing the word “indexing.” To do this, stem the words while crawling documents and also stem the words in the search query. A full discussion of stemming is outside the scope of this chapter, but you can find a Python implementation of the well-known Porter Stemmer at http://www.tartarus.org/˜martin/PorterStemmer/index.html.
You’re ready to fill in the code for the addtoindex method. This method will call the
two functions that were defined in the previous section to get a list
of words on the page. Then it will add the page and all the words to
the index, and will create links between them with their locations in
the document. For this example, the location will be the index within
the list of words.
Here is the code for addtoindex:
def addtoindex(self,url,soup):
if self.isindexed(url): return
print 'Indexing '+url
# Get the individual words
text=self.gettextonly(soup)
words=self.separatewords(text)
# Get the URL id
urlid=self.getentryid('urllist','url',url)
# Link each word to this url
for i in range(len(words)):
word=words[i]
if word in ignorewords: continue
wordid=self.getentryid('wordlist','word',word)
self.con.execute("insert into wordlocation(urlid,wordid,location) \
values (%d,%d,%d)" % (urlid,wordid,i))
You’ll also need this to update the helper function getentryid. All this does is return the ID
of an entry. If the entry doesn’t exist, it is created and the ID is
returned:
def getentryid(self,table,field,value,createnew=True):
cur=self.con.execute(
"select rowid from %s where %s='%s'" % (table,field,value))
res=cur.fetchone( )
if res==None:
cur=self.con.execute(
"insert into %s (%s) values ('%s')" % (table,field,value))
return cur.lastrowid
else:
return res[0]
As you’re crawling, you’ll also want to be remembering which pages linked to each other. This will become important later when link-based scoring methods are introduced. Add the “addlinkref” method to your crawler class.
def addlinkref(self,urlFrom,urlTo,linkText):
words=self.separateWords(linkText)
fromid=self.getentryid('urllist','url',urlFrom)
toid=self.getentryid('urllist','url',urlTo)
if fromid==toid: return
cur=self.con.execute("insert into link(fromid,toid) values (%d,%d)" %
(fromid,toid))
linkid=cur.lastrowid
for word in words:
if word in ignorewords: continue
wordid=self.getentryid('wordlist','word',word)
self.con.execute("insert into linkwords(linkid,wordid) values (%d,%d)" %
(linkid,wordid))
Finally, you’ll need to fill in the code for isindexed, which determines whether the page
is already in the database, and if so, whether there are any words
associated with it:
def isindexed(self,url):
u=self.con.execute \
("select rowid from urllist where url='%s'" % url).fetchone( )
if u!=None:
# Check if it has actually been crawled
v=self.con.execute(
'select * from wordlocation where urlid=%d' % u[0]).fetchone( )
if v!=None: return True
return False
Now you can rerun the crawler and have it actually index the pages as it goes. You can do this in your interactive session:
>>reload(searchengine)>> crawler=searchengine.crawler('searchindex.db')>> pages= \.. ['http://kiwitobes.com/wiki/Categorical_list_of_programming_languages.html']>> crawler.crawl(pages)
The crawler will probably take a long time to run. Instead of waiting for it to finish, I recommend that you download a preloaded copy of searchindex.db from http://kiwitobes.com/db/searchindex.db and save it in the directory with your Python code.
If you’d like to make sure that the crawl worked properly, you can try checking the entries for a word by querying the database:
>>[row for row in crawler.con.execute(.. 'select rowid from wordlocation where wordid=1')][(1,), (46,), (330,), (232,), (406,), (271,), (192,),...
The list that is returned is the list of all the URL IDs containing “word,” which means that you’ve successfully run a full-text search. This is a great start, but it will only work with one word at a time, and will just return the documents in the order in which they were loaded. The next section will show you how to expand this functionality by doing these searches with multiple words in the query.