Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint
Share this Page URL
Help

A Lean, Mean Data-Collecting Machine > Measuring Influence

Measuring Influence

When someone shares information via a service such as Twitter, it’s only natural to wonder how far the information penetrates into the overall network by means of being retweeted. It should be fair to assume that the more followers a person has, the greater the potential is for that person’s tweets to be retweeted. Users who have a relatively high overall percentage of their originally authored tweets retweeted can be said to be more influential than users who are retweeted infrequently. Users who have a relatively high percentage of their tweets retweeted, even if they are not originally authored, might be said to be mavens—people who are exceptionally well connected and like to share information.[26] One trivial way to measure the relative influence of two or more users is to simply compare their number of followers, since every follower will have a direct view of their tweets. We already know from Example 4-6 that we can get the number of followers (and friends) for a user via the /users/lookup and /users/show APIs. Extracting that information from these APIs is trivial enough:

for screen_name in screen_names:
    _json = json.loads(r.get(getRedisIdByScreenName(screen_name, "info.json")))
    n_friends, n_followers = _json['friends_count'], _json['followers_count']

Counting numbers of followers is interesting, but there’s so much more that can be done. For example, a given user may not have the popularity of an information maven like Tim O’Reilly, but if you have him as a follower and he retweets you, you’ve suddenly tapped into a vast network of people who might just start to follow you once they’ve determined that you’re also interesting. Thus, a much better approach that you might take in calculating users’ potential influence is to not only compare their numbers of followers, but to spider out into the network a couple of levels. In fact, we can use the very same breadth-first approach that was introduced in Example 2-4.

Example 4-8 illustrates a generalized crawl function that accepts a list of screen names, a crawl depth, and parameters that control how many friends and followers to retrieve. The friends_limit and followers_limit parameters control how many items to fetch from the social graph APIs (in batches of 5,000), while friends_sample and followers_sample control how many user objects to retrieve (in batches of 100). An updated function for getUserInfo is also included to reflect the pass-through of the sampling parameters.

Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)

# -*- coding: utf-8 -*-

import sys
import redis
import functools
from twitter__login import login
from twitter__util import getUserInfo
from twitter__util import _getFriendsOrFollowersUsingFunc

SCREEN_NAME = sys.argv[1]

t = login()
r = redis.Redis()

# Some wrappers around _getFriendsOrFollowersUsingFunc that 
# create convenience functions

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)
getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

def crawl(
    screen_names,
    friends_limit=10000,
    followers_limit=10000,
    depth=1,
    friends_sample=0.2, #XXX
    followers_sample=0.0,
    ):

    getUserInfo(t, r, screen_names=screen_names)
    for screen_name in screen_names:
        friend_ids = getFriends(screen_name, limit=friends_limit)
        follower_ids = getFollowers(screen_name, limit=followers_limit)

        friends_info = getUserInfo(t, r, user_ids=friend_ids, 
                                   sample=friends_sample)

        followers_info = getUserInfo(t, r, user_ids=follower_ids,
                                     sample=followers_sample)

        next_queue = [u['screen_name'] for u in friends_info + followers_info]

        d = 1
        while d < depth:
            d += 1
            (queue, next_queue) = (next_queue, [])
            for _screen_name in queue:
                friend_ids = getFriends(_screen_name, limit=friends_limit)
                follower_ids = getFollowers(_screen_name, limit=followers_limit)

                next_queue.extend(friend_ids + follower_ids)

                # Note that this function takes a kw between 0.0 and 1.0 called
                # sample that allows you to crawl only a random sample of nodes
                # at any given level of the graph

                getUserInfo(user_ids=next_queue)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "Please supply at least one screen name."
    else:
        crawl([SCREEN_NAME])

        # The data is now in the system. Do something interesting. For example, 
        # find someone's most popular followers as an indiactor of potential influence.
        # See friends_followers__calculate_avg_influence_of_followers.py

Assuming you’ve run crawl with high enough numbers for friends_limit and followers_limit to get all of a users’ friend IDs and follower IDs, all that remains is to take a large enough random sample and calculate interesting metrics, such as the average number of followers one level out. It could also be fun to look at his top N followers to get an idea of who he might be influencing. Example 4-9 demonstrates one possible approach that pulls the data out of Redis and calculates Tim O’Reilly’s most popular followers.

Example 4-9. Calculating a Twitterer’s most popular followers (friends_followers__calculate_avg_influence_of_followers.py)

# -*- coding: utf-8 -*-

import sys
import json
import locale
import redis
from prettytable import PrettyTable

# Pretty printing numbers
from twitter__util import pp 

# These functions create consistent keys from 
# screen names and user id values
from twitter__util import getRedisIdByScreenName 
from twitter__util import getRedisIdByUserId

SCREEN_NAME = sys.argv[1]

locale.setlocale(locale.LC_ALL, '')

def calculate():
    r = redis.Redis()  # Default connection settings on localhost

    follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME,
                        'follower_ids')))

    followers = r.mget([getRedisIdByUserId(follower_id, 'info.json')
                       for follower_id in follower_ids])
    followers = [json.loads(f) for f in followers if f is not None]

    freqs = {}
    for f in followers:
        cnt = f['followers_count']
        if not freqs.has_key(cnt):
            freqs[cnt] = []

        freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']})

    # It could take a few minutes to calculate freqs, so store a snapshot for later use

    r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'),
          json.dumps(freqs))

    keys = freqs.keys()
    keys.sort()

    print 'The top 10 followers from the sample:'

    fields = ['Date', 'Count']
    pt = PrettyTable(fields=fields)
    [pt.set_field_align(f, 'l') for f in fields]

    for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:]
                                    for user in freqs[k]]):
        pt.add_row([user, pp(freq)])

    pt.printt()

    all_freqs = [k for k in keys for user in freqs[k]]
    avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs)

    print "\nThe average number of followers for %s's followers: %s"         % (SCREEN_NAME, pp(avg))

# psyco can only compile functions, so wrap code in a function

try:
    import psyco
    psyco.bind(calculate)
except ImportError, e:
    pass  # psyco not installed

calculate()

Note

In many common number-crunching situations, the psyco module can dynamically compile code and produce dramatic speed improvements. It’s totally optional but definitely worth a hard look if you’re performing calculations that take more than a few seconds.

Output follows for a sample size of about 150,000 (approximately 10%) of Tim O’Reilly’s followers. For statistical analysis, this high of a sample size relative to the population ensures a tiny margin of error and a very high confidence level.[27] That is, the results can be considered very representative, though not quite the same thing as the absolute truth about the population:

The top 10 followers from the sample:
aplusk 4,993,072
BarackObama 4,114,901
mashable 2,014,615
MarthaStewart 1,932,321
Schwarzenegger 1,705,177
zappos 1,689,289
Veronica 1,612,827
jack 1,592,004
stephenfry 1,531,813
davos 1,522,621

The average number of followers for timoreilly's followers: 445

Interestingly, a few familiar names show up on the list, including some of the most popular Twitterers of all time: Ashton Kutcher (@aplusk), Barack Obama, Martha Stewart, and Arnold Schwarzenegger, among others. Removing these top 10 followers and recalculating lowers the average number of followers of Tim’s followers to approximately 284. Removing any follower with less than 10 followers of her own, however, dramatically increases the number to more than 1,000! Noting that there are tens of thousands of followers in this range and briefly perusing their profiles, however, does bring some reality into the situation: many of these users are spam accounts, users who are protecting their tweets, etc. Culling out the top 10 followers and all followers having fewer than 10 followers of their own might be a reasonable metric to work with; doing both of these things results in a number around 800, which is still quite high. There must be something to be said for the idea of getting retweeted by a popular Twitterer who has lots of connections to other popular Twitterers.



[25] Whenever Twitter goes over capacity, an HTTP 503 error is issued. In a browser, the error page displays an image of the now infamous “fail whale.” See http://twitter.com/503.

[26] See The Tipping Point by Malcolm Gladwell (Back Bay Books) for a great discourse on mavens.

[27] It’s about a 0.14 margin of error for a 99% confidence level.

  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint