Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint
Share this Page URL
Help

Clustering Contacts by Job Title > A Greedy Approach to Clustering

A Greedy Approach to Clustering

Given MASI’s partiality to partially overlapping terms, and that we have insight suggesting that overlap in titles is important, let’s try to cluster job titles by comparing them to one another using MASI distance, as an extension of Example 6-2. Example 6-6 clusters similar titles and then displays your contacts accordingly. Take a look at the code—especially the nested loop invoking the DISTANCE function that makes a greedy decision—and then we’ll discuss.

Example 6-6. Clustering job titles using a greedy heuristic (linkedin__cluster_contacts_by_title.py)

# -*- coding: utf-8 -*-

import sys
import csv
from nltk.metrics.distance import masi_distance

CSV_FILE = sys.argv[1]

DISTANCE_THRESHOLD = 0.34
DISTANCE = masi_distance

def cluster_contacts_by_title(csv_file):

    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
        ]

    seperators = ['/', 'and', '&']

    csvReader = csv.DictReader(open(csv_file), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]

# Normalize and/or replace known a....

  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial