Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Information retrieval is an extensive field with many specialties.
This discussion narrows in on TF-IDF, one of the most fundamental
techniques for retrieving relevant documents from a corpus. TF-IDF
stands for term frequency-inverse document
frequency and can be used to query a corpus by calculating
normalized scores that express the relative importance of terms in the
documents. Mathematically, TF-IDF is expressed as the product of the
term frequency and the inverse document frequency, tf_idf =
tf*idf, where the term tf
represents the importance of a term in a specific document, and idf represents the importance of a term
relative to the entire corpus. Multiplying these terms together produces
a score that accounts for both factors and has been an integral part of
every major search engine at some point in its existence. To get a more
intuitive idea of how TF-IDF works, let’s walk through each of the
calculations involved in computing the overall score.