Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Given MASI’s partiality to partially overlapping terms, and that
we have insight suggesting that
overlap in titles is important, let’s try to cluster job titles by
comparing them to one another using MASI distance, as an extension of
Example 6-2. Example 6-6 clusters
similar titles and then displays your contacts accordingly. Take a look
at the code—especially the nested loop invoking the
DISTANCE function that makes a greedy decision—and then
we’ll discuss.
Example 6-6. Clustering job titles using a greedy heuristic (linkedin__cluster_contacts_by_title.py)
# -*- coding: utf-8 -*-
import sys
import csv
from nltk.metrics.distance import masi_distance
CSV_FILE = sys.argv[1]
DISTANCE_THRESHOLD = 0.34
DISTANCE = masi_distance
def cluster_contacts_by_title(csv_file):
transforms = [
('Sr.', 'Senior'),
('Sr', 'Senior'),
('Jr.', 'Junior'),
('Jr', 'Junior'),
('CEO', 'Chief Executive Officer'),
('COO', 'Chief Operating Officer'),
('CTO', 'Chief Technology Officer'),
('CFO', 'Chief Finance Officer'),
('VP', 'Vice President'),
]
seperators = ['/', 'and', '&']
csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
contacts = [row for row in csvReader]
# Normalize and/or replace known a....