Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL
Help

7. Google+: TF-IDF, Cosine Similarity, and Collocations

Chapter 7. Google+: TF-IDF, Cosine Similarity, and Collocations

Note

Initial printings of this book from February 2011 through February 2012 featured Google Buzz as the backdrop for data in this chapter. This chapter has been fully revised (with as few changes made as possible) to now feature Google+ instead. Example files have been updated and renamed with the plus__ prefix, but previous buzz__ example files are still available online with the other example code.

This short chapter begins our journey into text mining,[46] and it’s something of an inflection point in this book. Earlier chapters have mostly focused on analyzing structured or semi-structured data such as records encoded as microformats, relationships among people, or specially marked #hashtags in tweets. However, this chapter begins munging and making sense of textual information in documents by introducing Information Retrieval (IR) theory fundamentals such as TF-IDF, cosine similarity, and collocation detection. As you may have already inferred from the chapter title, Google+ initially serves as our primary source of data because it’s inherently social, easy to harvest,[47] and has a lot of potential for the social web. Toward the end of this chapter, we’ll also look at what it takes to tap into your Gmail data. In the chapters ahead, we’ll investigate mining blog data and other sources of free text, as additional forms of text analytics such as entity extraction and the automatic generation of abstracts are introduced. There’s no real reason to introduce Google+ earlier in the book than blogs (the topic of Chapter 8), other than the fact that Google+ activities (notes) fill an interesting niche somewhere between Twitter and blogs, so this ordering facilitates telling a story from cover to cover. All in all, the text-mining techniques you’ll learn in any chapter of this book could just as easily be applied to any other chapter.


  

You are currently reading a PREVIEW of this book.

                                                                                                                    

Get instant access to over $1 million worth of books and videos.

  

Start a Free 10-Day Trial


  
  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint