Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint
Share this Page URL
Help

A Lean, Mean Data-Collecting Machine > Souping Up the Machine with Basic Friend...

Souping Up the Machine with Basic Friend/Follower Metrics

Redis should serve you well on your quest to efficiently process and analyze vast amounts of Twitter data for certain kinds of queries. Adapting Example 4-2 with some additional logic to house data in Redis requires only a simple change, and Example 4-4 is an update that computes some basic friend/follower statistics. Native functions in Redis are used to compute the set operations.

Example 4-4. Harvesting, storing, and computing statistics about friends and followers (friends_followers__friend_follower_symmetry.py)

# -*- coding: utf-8 -*-

import sys
import locale
import time
import functools
import twitter
import redis
from twitter__login import login

# A template-like function for maximizing code reuse,
# which is essentially a wrapper around makeTwitterRequest
# with some additional logic in place for interfacing with 
# Redis
from twitter__util import _getFriendsOrFollowersUsingFunc

# Creates a consistent key value for a user given a screen name
from twitter__util import getRedisIdByScreenName

SCREEN_NAME = sys.argv[1]

MAXINT = sys.maxint

# For nice number formatting
locale.setlocale(locale.LC_ALL, '')  

# You may need to setup your OAuth settings in twitter__login.py

t = login()

# Connect using default settings for localhost
r = redis.Redis()  

# Some wrappers around _getFriendsOrFollowersUsingFunc 
# that bind the first two arguments

getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, 
                               t.friends.ids, 'friend_ids', t, r)

getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
                                 t.followers.ids, 'follower_ids', t, r)

screen_name = SCREEN_NAME

# get the data

print >> sys.stderr, 'Getting friends for %s...' % (screen_name, )
getFriends(screen_name, limit=MAXINT)

print >> sys.stderr, 'Getting followers for %s...' % (screen_name, )
getFollowers(screen_name, limit=MAXINT)

# use redis to compute the numbers

n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids'))

n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))

n_friends_diff_followers = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'friend_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'follower_ids')])
r.delete('temp')

n_followers_diff_friends = r.sdiffstore('temp',
                                        [getRedisIdByScreenName(screen_name,
                                        'follower_ids'),
                                        getRedisIdByScreenName(screen_name,
                                        'friend_ids')])
r.delete('temp')

n_friends_inter_followers = r.sinterstore('temp',
        [getRedisIdByScreenName(screen_name, 'follower_ids'),
        getRedisIdByScreenName(screen_name, 'friend_ids')])
r.delete('temp')

print '%s is following %s' % (screen_name, locale.format('%d', n_friends, True))
print '%s is being followed by %s' % (screen_name, locale.format('%d',
                                      n_followers, True))
print '%s of %s are not following %s back' % (locale.format('%d',
        n_friends_diff_followers, True), locale.format('%d', n_friends, True),
        screen_name)print '%s of %s are not being followed back by %s' % (locale.format('%d',
        n_followers_diff_friends, True), locale.format('%d', n_followers, True),
        screen_name)
print '%s has %s mutual friends'     % (screen_name, locale.format('%d', n_friends_inter_followers, True))

Aside from the use of functools.partial (http://docs.python.org/library/functools.html) to create getFriends and getFollowers from a common piece of parameter-bound code, Example 4-4 should be pretty straightforward. There’s one other very subtle thing to notice: there isn’t a call to r.save in Example 4-4, which means that the settings in redis.conf dictate when data is persisted to disk. By default, Redis stores data in memory and asynchronously snapshots data to disk according to a schedule that’s dictated by whether or not a number of changes have occurred within a specified time interval. The risk with asynchronous writes is that you might lose data if certain unexpected conditions, such as a system crash or power outage, were to occur. Redis provides an “append only” option that you can enable in redis.conf to hedge against this possibility.

Note

It is highly recommended that you enable the appendonly option in redis.conf to protect against data loss; see the “Append Only File HOWTO” for helpful details.

Consider the following output, relating to Tim O’Reilly’s network of followers. Keeping in mind that there’s a rate limit of 350 OAuth requests per hour, you could expect this code to take a little less than an hour to run, because approximately 300 API calls would need to be made to collect all the follower ID values:

timoreilly is following 663
timoreilly is being followed by 1,423,704
131 of 633 are not following timoreilly back
1,423,172 of 1,423,704 are not being followed back by timoreilly
timoreilly has 532 mutual friends

Note that while you could choose to settle for harvesting a smaller number of followers to avoid the rate limit?imposed wait, the API documentation does not state that taking the first N pages’ worth of data would yield a truly random sample, and it appears that data is returned in reverse chronological order—so, you may not be able to extrapolate in a predictable way whether your logic depends on it. For example, if the first 10,000 followers returned just so happened to contain the 532 mutual friends, extrapolation from those points would result in a skewed analysis because these results are not at all representative of the larger population. For a very popular Twitterer such as Britney Spears, with well over 5,000,000 followers, somewhere in the neighborhood of 1,000 API calls would be required to fetch all of the followers over approximately a four-hour period. In general, the wait is probably worth it for this kind of data, and you could use the Twitter-streaming APIs to keep your data up-to-date so that you never have to go through the entire ordeal again.

Warning

One common source of error for some kinds of analysis is to forget about the overall size of a population relative to your sample. For example, randomly sampling 10,000 of Tim O’Reilly’s friends and followers would actually give you the full population of his friends, yet only a tiny fraction of his followers. Depending on the sophistication of your analysis, the sample size relative to the overall size of a population can make a difference in determining whether the outcome of an experiment is statistically significant, and the level of confidence you can have about it.

Given even these basic friend/follower stats, a couple of questions that lead us toward other interesting analyses naturally follow. For example, who are the 131 people who are not following Tim O’Reilly back? Given the various possibilities that could be considered about friends and followers, the “Who isn’t following me back?” question is one of the more interesting ones and arguably can provide a lot of insight about a person’s interests. So, how can we answer this question?

Staring at a list of user IDs isn’t very riveting, so resolving those user IDs to actual user objects is the first obvious step. Example 4-5 extends Example 4-4 by encapsulating common error-handling code into reusable form. It also provides a function that demonstrates how to resolve those ID values to screen names using the /users/lookup API, which accepts a list of up to 100 user IDs or screen names and returns the same basic user information that you saw earlier with /users/show.

Example 4-5. Resolving basic user information such as screen names from IDs (friends_followers__get_user_info.py)

# -*- coding: utf-8 -*-

import sys
import json
import redis
from twitter__login import login

# A makeTwitterRequest call through to the /users/lookup 
# resource, which accepts a comma separated list of up 
# to 100 screen names. Details are fairly uninteresting. 
# See also http://dev.twitter.com/doc/get/users/lookup
from twitter__util import getUserInfo

if __name__ == "__main__":
    screen_names = sys.argv[1:]

    t = login()
    r = redis.Redis()

    print json.dumps(
            getUserInfo(t, r, screen_names=screen_names),
            indent=4
          )

Although not reproduced in its entirety, the getUserInfo function that’s imported from twitter__util is essentially just a makeTwitterRequest to the /users/lookup resource using a list of screen names. The following snippet demonstrates:

def getUserInfo(t, r, screen_names):
    info = []
    response = makeTwitterRequest(t, 
                                  t.users.lookup,
                                  screen_name=','.join(screen_names)
                                 )

    for user_info in response:
        r.set(getRedisIdByScreenName(user_info['screen_name'], 'info.json'),
              json.dumps(user_info))
        r.set(getRedisIdByUserId(user_info['id'], 'info.json'), 
              json.dumps(user_info))

        info.extend(response)

    return info

It’s worthwhile to note that getUserInfo stores the same user information under two different keys: the user ID and the screen name. Storing both of these keys allows us to easily look up a screen name given a user ID value and a user ID value given a screen name. Translating a user ID value to a screen name is a particularly useful operation since the social graph APIs for getting friends and followers return only ID values, which have no intuitive value until they are resolved against screen names and other basic user information. While there is redundant storage involved in this scheme, compared to other approaches, the convenience is arguably worth it. Feel free to take a leaner approach if storage is a concern.

An example user information object for Tim O’Reilly follows in Example 4-6, illustrating the kind of information available about Twitterers. The sky is the limit with what you can do with data that’s this rich. We won’t mine the user descriptions and tweets of the folks who aren’t following Tim back and put them in print, but you should have enough to work with should you wish to conduct that kind of analysis.

Example 4-6. Example user object represented as JSON data for Tim O’Reilly

{
    "id": 2384071,
    "verified": true,
    "profile_sidebar_fill_color": "e0ff92",
    "profile_text_color": "000000",
    "followers_count": 1423326,
    "protected": false,
    "location": "Sebastopol, CA",
    "profile_background_color": "9ae4e8",
    "status": {
        "favorited": false,
        "contributors": null,
        "truncated": false,
        "text": "AWESOME!! RT @adafruit: a little girl asks after seeing adafruit ...",
        "created_at": "Sun May 30 00:56:33 +0000 2010",
        "coordinates": null,
        "source": "<a href=\"http://www.seesmic.com/\" rel=\"nofollow\">Seesmic</a>",
        "in_reply_to_status_id": null,
        "in_reply_to_screen_name": null,
        "in_reply_to_user_id": null,
        "place": null,
        "geo": null,
        "id": 15008936780
    },
    "utc_offset": -28800,
    "statuses_count": 11220,
    "description": "Founder and CEO, O'Reilly Media. Watching the alpha geeks...",
    "friends_count": 662,
    "profile_link_color": "0000ff",
    "profile_image_url": "http://a1.twimg.com/profile_images/941827802/IMG_...jpg",
    "notifications": false,
    "geo_enabled": true,
    "profile_background_image_url": "http://a1.twimg.com/profile_background_...gif",
    "name": "Tim O'Reilly",
    "lang": "en",
    "profile_background_tile": false,
    "favourites_count": 10,
    "screen_name": "timoreilly",
    "url": "http://radar.oreilly.com",
    "created_at": "Tue Mar 27 01:14:05 +0000 2007",
    "contributors_enabled": false,
    "time_zone": "Pacific Time (US & Canada)",
    "profile_sidebar_border_color": "87bc44",
    "following": false
}

The refactored logic for handling HTTP errors and obtaining user information in batches is provided in the following sections. Note that the handleTwitterHTTPError function intentionally doesn’t include error handling for every conceivable error case, because the action you may want to take will vary from situation to situation. For example, in the event of a urllib2.URLError (operation timed out) that is triggered because someone unplugged your network cable, you want to prompt the user for a specific course of action.

Example 4-5 brings to light some good news and some not-so-good news. The good news is that resolving the user IDs to user objects containing a byline, location information, the latest tweet, etc. is a treasure trove of information. The not-so-good news is that it’s quite expensive to do this in terms of rate limiting, given that you can only get data back in batches of 100. For Tim O’Reilly’s friends, that’s only seven API calls. For his followers, however, it’s over 14,000, which would take nearly two days to collect, given a rate limit of 350 calls per hour (and no glitches in harvesting).

However, given a full collection of anyone’s friends and followers ID values, you can randomly sample and calculate measures of statistical significance to your heart’s content. Redis provides the srandmember function that fits the bill perfectly. You pass it the name of a set, such as timoreilly$follower_ids, and it returns a random member of that set.