Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
One of the most intuitive measurements that can be applied to unstructured text is a metric called lexical diversity. Put simply, this is an expression of the number of unique tokens in the text divided by the total number of tokens in the text, which are elementary yet important metrics in and of themselves. It could be computed as shown in Example 1-7.
Example 1-7. Calculating lexical diversity for tweets
>>>words = []>>>for t in tweets:...words += [ w for w in t.split() ]... >>>len(words) # total words7238 >>>len(set(words)) # unique words1636 >>>1.0*len(set(words))/len(words) # lexical diversity0.22602928985907708 >>>1.0*sum([ len(t.split()) for t in tweets ])/len(tweets) # avg words per tweet14.476000000000001