Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Collecting data can be a lot of fun, but if you have a good idea for an algorithm or want to try something out, finding data can be a pain. This appendix contains a collection of links to known datasets. These sets range in size from 20 lines to trillions of lines, so you should have no problem finding a dataset to meet your needs:
http://archive.ics.uci.edu/ml/—The best-known source of datasets for machine learning is the University of California at Irvine. We used fewer than 10 data sets in this book, but there are more than 200 datasets in this repository. Many of these datasets are used to compare the performance of algorithms so that researchers can have an objective comparison of performance.
http://aws.amazon.com/publicdatasets/—If you’re a big data cowboy, then this is the link for you. Amazon has some really big datasets, including the U.S. census data, the annotated human genome data, a 150 GB log of Wikipedia’s page traffic, and a 500 GB database of Wikipedia’s link data.
http://www.data.gov—Data.gov is a website launched in 2009 to increase the public’s access to government datasets. The site was intended to make all government data public as long as the data was not private or restricted for security reasons. In 2010, the site had over 250,000 datasets. It’s uncertain how long the site will remain active. In 2011, the federal government reduced funding for the Electronic Government Fund, which pays for Data.gov. The datasets range from products recalled to a list of failed banks.
http://www.data.gov/opendatasites—Data.gov has a list of U.S. states, cities, and countries that hold similar open data sites.
http://www.infochimps.com/—Infochimps is a company that aims to give everyone access to every dataset in the world. Currently, they have more than 14,000 datasets available to download. Unlike other listed sites, some of the datasets on Infochimps are for sale. You can sell your own datasets here as well.
http://www.datawrangling.com/some-datasets-available-on-the-web—Data Wrangling is a private blog with a large number of links to various data sources on the internet. It’s a bit dated, but many of the links are still good.
http://metaoptimize.com/qa/questions/—This isn’t a data source but a question-and-answer site that’s machine learning focused. There are many practitioners here willing to help out.