The Data Science PipelineAcquisition
As a data scientist on the job, you will often be given a data set and a problem to solve. In these situations, obtaining data might not seem like a high priority. However, using external sources in addition to the original data can be a critical source of leverage. For example, if you want to predict a company's customer
The difficulty of obtaining useful data ranges from trivial (your supervisor emails you a file) to epic (years-long clinical trials). An important part of becoming a seasoned data scientist is developing a sense for when the cost of obtaining data will lead to a commensurate problem-solving payoff. Developing your knowledge of useful and readily accessible data sets helps reduce that cost, so in this section we will get you started by making some concrete suggestions for data sources.
- R packages. Many classic datasets are available as packages in
R . Of particular note is the packagefivethirtyeight
, which includes data for more than 100 articles from the popular data journalism outfit FiveThirtyEight. You can use R from within Python using the package .rpy2
- Kaggle. The data science contest website Kaggle has about 120,000 public data sets.
- Data.gov. A database of over 200,000 open data sets shared by the U.S. Government. See also: data.gov.uk.
- UC Irvine Machine Learning Repository. About 480 datasets, hosted by UC Irvine as a service to the machine learning community.
- Academic Torrents. Datasets from academic papers. Includes a particularly well-known dataset for natural language processing: Enron senior management emails.
- Quandl. A mixture of free and paid financial datasets.