Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
When someone shares information via a service such as Twitter,
it’s only natural to wonder how far the information penetrates into the
overall network by means of being retweeted. It should be fair to assume
that the more followers a person has, the greater the potential is for
that person’s tweets to be retweeted. Users who have a relatively high
overall percentage of their originally authored tweets retweeted can be
said to be more influential than users who are retweeted infrequently.
Users who have a relatively high percentage of their tweets retweeted,
even if they are not originally authored, might be said to be
mavens—people who are exceptionally well connected
and like to share information.[26] One trivial way to measure the relative influence of two
or more users is to simply compare their number of followers, since
every follower will have a direct view of their tweets. We already know
from Example 4-6 that we can get the number of
followers (and friends) for a user via the /users/lookup
and /users/show APIs. Extracting that information from
these APIs is trivial enough:
for screen_name in screen_names:
_json = json.loads(r.get(getRedisIdByScreenName(screen_name, "info.json")))
n_friends, n_followers = _json['friends_count'], _json['followers_count']
Counting numbers of followers is interesting, but there’s so much more that can be done. For example, a given user may not have the popularity of an information maven like Tim O’Reilly, but if you have him as a follower and he retweets you, you’ve suddenly tapped into a vast network of people who might just start to follow you once they’ve determined that you’re also interesting. Thus, a much better approach that you might take in calculating users’ potential influence is to not only compare their numbers of followers, but to spider out into the network a couple of levels. In fact, we can use the very same breadth-first approach that was introduced in Example 2-4.
Example 4-8 illustrates a
generalized crawl function that
accepts a list of screen names, a crawl depth, and parameters that
control how many friends and followers to retrieve. The friends_limit and followers_limit parameters control how many
items to fetch from the social graph APIs (in batches of 5,000), while
friends_sample and followers_sample control how many user objects
to retrieve (in batches of 100). An updated function for getUserInfo is also included to reflect the
pass-through of the sampling parameters.
Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)
# -*- coding: utf-8 -*-
import sys
import redis
import functools
from twitter__login import login
from twitter__util import getUserInfo
from twitter__util import _getFriendsOrFollowersUsingFunc
SCREEN_NAME = sys.argv[1]
t = login()
r = redis.Redis()
# Some wrappers around _getFriendsOrFollowersUsingFunc that
# create convenience functions
getFriends = functools.partial(_getFriendsOrFollowersUsingFunc,
t.friends.ids, 'friend_ids', t, r)
getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,
t.followers.ids, 'follower_ids', t, r)
def crawl(
screen_names,
friends_limit=10000,
followers_limit=10000,
depth=1,
friends_sample=0.2, #XXX
followers_sample=0.0,
):
getUserInfo(t, r, screen_names=screen_names)
for screen_name in screen_names:
friend_ids = getFriends(screen_name, limit=friends_limit)
follower_ids = getFollowers(screen_name, limit=followers_limit)
friends_info = getUserInfo(t, r, user_ids=friend_ids,
sample=friends_sample)
followers_info = getUserInfo(t, r, user_ids=follower_ids,
sample=followers_sample)
next_queue = [u['screen_name'] for u in friends_info + followers_info]
d = 1
while d < depth:
d += 1
(queue, next_queue) = (next_queue, [])
for _screen_name in queue:
friend_ids = getFriends(_screen_name, limit=friends_limit)
follower_ids = getFollowers(_screen_name, limit=followers_limit)
next_queue.extend(friend_ids + follower_ids)
# Note that this function takes a kw between 0.0 and 1.0 called
# sample that allows you to crawl only a random sample of nodes
# at any given level of the graph
getUserInfo(user_ids=next_queue)
if __name__ == '__main__':
if len(sys.argv) < 2:
print "Please supply at least one screen name."
else:
crawl([SCREEN_NAME])
# The data is now in the system. Do something interesting. For example,
# find someone's most popular followers as an indiactor of potential influence.
# See friends_followers__calculate_avg_influence_of_followers.py
Assuming you’ve run crawl with
high enough numbers for friends_limit
and followers_limit to get all of a
users’ friend IDs and follower IDs, all that remains is to take a large
enough random sample and calculate interesting metrics, such as the
average number of followers one level out. It could also be fun to look
at his top N followers to get an idea of who he
might be influencing. Example 4-9
demonstrates one possible approach that pulls the data out of Redis and
calculates Tim O’Reilly’s most popular followers.
Example 4-9. Calculating a Twitterer’s most popular followers (friends_followers__calculate_avg_influence_of_followers.py)
# -*- coding: utf-8 -*-
import sys
import json
import locale
import redis
from prettytable import PrettyTable
# Pretty printing numbers
from twitter__util import pp
# These functions create consistent keys from
# screen names and user id values
from twitter__util import getRedisIdByScreenName
from twitter__util import getRedisIdByUserId
SCREEN_NAME = sys.argv[1]
locale.setlocale(locale.LC_ALL, '')
def calculate():
r = redis.Redis() # Default connection settings on localhost
follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME,
'follower_ids')))
followers = r.mget([getRedisIdByUserId(follower_id, 'info.json')
for follower_id in follower_ids])
followers = [json.loads(f) for f in followers if f is not None]
freqs = {}
for f in followers:
cnt = f['followers_count']
if not freqs.has_key(cnt):
freqs[cnt] = []
freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']})
# It could take a few minutes to calculate freqs, so store a snapshot for later use
r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'),
json.dumps(freqs))
keys = freqs.keys()
keys.sort()
print 'The top 10 followers from the sample:'
fields = ['Date', 'Count']
pt = PrettyTable(fields=fields)
[pt.set_field_align(f, 'l') for f in fields]
for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:]
for user in freqs[k]]):
pt.add_row([user, pp(freq)])
pt.printt()
all_freqs = [k for k in keys for user in freqs[k]]
avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs)
print "\nThe average number of followers for %s's followers: %s" % (SCREEN_NAME, pp(avg))
# psyco can only compile functions, so wrap code in a function
try:
import psyco
psyco.bind(calculate)
except ImportError, e:
pass # psyco not installed
calculate()
In many common number-crunching situations, the psyco module can dynamically
compile code and produce dramatic speed improvements. It’s totally
optional but definitely worth a hard look if you’re performing
calculations that take more than a few seconds.
Output follows for a sample size of about 150,000 (approximately 10%) of Tim O’Reilly’s followers. For statistical analysis, this high of a sample size relative to the population ensures a tiny margin of error and a very high confidence level.[27] That is, the results can be considered very representative, though not quite the same thing as the absolute truth about the population:
The top 10 followers from the sample: aplusk 4,993,072 BarackObama 4,114,901 mashable 2,014,615 MarthaStewart 1,932,321 Schwarzenegger 1,705,177 zappos 1,689,289 Veronica 1,612,827 jack 1,592,004 stephenfry 1,531,813 davos 1,522,621 The average number of followers for timoreilly's followers: 445
Interestingly, a few familiar names show up on the list, including some of the most popular Twitterers of all time: Ashton Kutcher (@aplusk), Barack Obama, Martha Stewart, and Arnold Schwarzenegger, among others. Removing these top 10 followers and recalculating lowers the average number of followers of Tim’s followers to approximately 284. Removing any follower with less than 10 followers of her own, however, dramatically increases the number to more than 1,000! Noting that there are tens of thousands of followers in this range and briefly perusing their profiles, however, does bring some reality into the situation: many of these users are spam accounts, users who are protecting their tweets, etc. Culling out the top 10 followers and all followers having fewer than 10 followers of their own might be a reasonable metric to work with; doing both of these things results in a number around 800, which is still quite high. There must be something to be said for the idea of getting retweeted by a popular Twitterer who has lots of connections to other popular Twitterers.
[25] Whenever Twitter goes over capacity, an HTTP 503 error is issued. In a browser, the error page displays an image of the now infamous “fail whale.” See http://twitter.com/503.
[26] See The Tipping Point by Malcolm Gladwell (Back Bay Books) for a great discourse on mavens.
[27] It’s about a 0.14 margin of error for a 99% confidence level.