Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
You may recall from Chapter 7 that perhaps the most fundamental weaknesses of TF-IDF and cosine similarity are that these models inherently don’t require a deep semantic understanding of the data. Quite the contrary, the examples in that chapter were able to take advantage of very basic syntax that separated tokens by whitespace to break an otherwise opaque document into a bag of tokens and use frequency and simple statistical similarity metrics to determine which tokens were likely to be important in the data. Although you can do some really amazing things with these techniques, they don’t really give you any notion of what any given token means in the context in which it appears in the document. Look no further than a sentence containing a homograph[51] such as “fish” or “bear” as a case in point; either one could be a noun or a verb.