Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Now that we’ve prepared you to work with regressions, this chapter’s case study will focus on using regression to predict the amount of page views for the top 1,000 websites on the Internet as of 2011. The top five rows of this data set, which was provided to us by Neil Kodner, are shown in Table 5-3.
For our purposes, we’re going to work with only a subset of the
columns of this data set. We’ll focus on five columns:
Rank, PageViews, UniqueVisitors,
HasAdvertising, and IsEnglish.
The Rank column tells us the website’s position in
the top 1,000 list. As you can see, Facebook is the number one site in
this data set, and YouTube is the second. Rank is an
interesting sort of measurement because it’s an ordinal value in which
numbers are used not for their true values, but simply for their order.
One way to realize that the values don’t matter is to realize that
there’s no real answer to questions like, “What’s the 1.578th website in
this list?” This sort of question would have an answer if the numbers being used
were cardinal values. Another way to emphasize this distinction is to
note that we could replace the ranks 1, 2, 3, and 4 with A, B, C, and D
and not lose any information.