1. FiveThirtyEight
What you may not know is that FiveThirtyEight also makes the data sets used in its articles available online on
Github and on
its own data portal.
Here are some examples:
2. BuzzFeed
BuzzFeed may have started as a purveyor of low-quality clickbait, but these days it also does high-quality data journalism. And, much like FiveThirtyEight, it publishes some of its datasets publicly
on its Github page.
Here are some examples:
3. ProPublica
ProPublica is a nonprofit investigative reporting outlet that publishes data journalism on focused on issues of public interest, primarily in the US. They maintain
a data store that hosts quite a few free data sets in addition to some paid ones (scroll down on that page to get past the paid ones). Many of them are actively maintained and frequently updated. ProPublica also offers
five data-related APIs, four of which are accessible for free.
Here are some examples:
4. Socrata OpenData
Socrata OpenData is a portal that contains multiple data sets that can be explored in the browser or downloaded to visualize. The offerings here are less well curated, so you’ll have to sort through what’s available to find data that’s clean and up-to-date, but the ability to look at the data in table form right in the browser is very helpful, and it has some built-in visualization tools as well.
Here are some examples:
Data sets for Data Processing Projects
Sometimes you just want to work with a large data set. The end result doesn’t matter as much as the process of reading in and analyzing the data. You might use tools like Spark (which
you can learn in our Spark course) or
Hadoop to distribute the processing across multiple nodes. Things to keep in mind when looking for a good data processing data set:
- The cleaner the data, the better — cleaning a large data set can be very time consuming.
- There should be an interesting question that can be answered with the data.
Cloud hosting providers like Amazon and Google are good places to find big data sets. They have an incentive to host data, because they can make you analyze that data using their infrastructure (and thus pay them).
5. AWS Public Data sets
Amazon makes large data sets available on its Amazon Web Services platform. You can download the data and work with it on your own computer, or analyze the data in the cloud using
EC2 and Hadoop via
EMR.
You can read more about how the program works
here, and check out the data sets for yourself
here (although you’ll need
a free AWS account first).
Here are some examples:
6. Google Public Data sets
Google also has a cloud hosting service, which is called
Google Cloud. With Google Cloud, you can use a tool called
BigQuery to explore large data sets.
Google lists all of the data sets
on this page. You’ll need to sign up for a Google Cloud account to see it, but the first 1TB of queries you make each month are
free, so as long as you’re careful, you won’t have to pay anything.
Here are some examples:
- USA Names — contains all Social Security name applications in the US, from 1879 to 2015.
- Github Activity — contains all public activity on over 2.8 million public Github repositories.
- Historical Weather — data from 9000 NOAA weather stations from 1929 to 2016.
7. Wikipedia
Wikipedia is a free, online, community-edited encyclopedia. It contains an astonishing breadth of knowledge, containing pages on everything from the
Ottoman-Habsburg Wars to
Leonard Nimoy. As part of Wikipedia’s commitment to advancing knowledge, they offer all of their content for free, and regularly generate dumps of all the articles on the site. Additionally, Wikipedia offers edit history and activity data, so you can track how a page on a topic evolves over time, and who contributes to it.
Here are some examples:
When you’re working on a machine learning project, you want to be able to predict a column using information from the other columns of a data set. In order to be able to do this, we need to make sure that:
- The data set isn’t too messy — if it is, we’ll spend all of our time cleaning the data.
- There’s an interesting target column to make predictions for.
- The other variables have some explanatory power for the target column.
There are a few online repositories of data sets curated specifically for machine learning. These data sets are typically cleaned up beforehand, and allow for testing algorithms very quickly.
8. Kaggle
Kaggle is a data science community that hosts machine learning competitions. There are a variety of externally-contributed interesting data sets on the site. Kaggle has both live and historical competitions. You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition.
You can download data from Kaggle by entering a
competition. Each competition has its own associated data set. There are also user-contributed data sets available
here, though these may be less well cleaned and curated than the data sets used for competitions.
Here are some examples:
- Satellite Photograph Order — a set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
- Manufacturing Process Failures — a collection of variables that were measured during the manufacturing process. The goal is to predict faults with the manufacturing.
- Multiple Choice Questions — a data set of multiple choice questions and the corresponding correct answers. The goal is to predict the answer for any given question.
9. UCI Machine Learning Repository
The
UCI Machine Learning Repository is one of the oldest sources of data sets on the web. Although the data sets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting data sets.
Here are some examples:
- Email spam — contains emails, along with a label of whether or not they’re spam.
- Wine classification — contains various attributes of 178 different wines.
- Solar flares — attributes of solar flares, useful for predicting characteristics of flares.
10. Quandl
Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to the
large amount of available data, it’s possible to build a complex model that uses many data sets to predict values in another.
Here are some examples:
Data Sets for Data Cleaning Projects
Sometimes, it can be very satisfying to take a data set spread across multiple files, clean it up, condense it all into a single file, and then do some analysis. In data cleaning projects, it can take hours of research to figure out what each column in the data set means. It may turn out that the data set you’re analyzing isn’t really suitable for what you’re trying to do, and you’ll need to start over.
That can be frustrating, but it’s a common part of every data science job, and it requires practice.
When looking for a good data set for a data cleaning project, you want it to:
- Be spread over multiple files.
- Have a lot of nuance, and many possible angles to take.
- Require a good amount of research to understand.
- Be as “real-world” as possible.
These types of data sets are typically found on websites that collect and aggregate data sets. These aggregators tend to have data sets from multiple sources, without much curation. In this case, that’s a good thing — too much curation gives us overly neat data sets that are hard to do extensive cleaning on.
11. data.world
Data.world is a user-driven data collection site (among other things) where you can search for, copy, analyze, and download data sets. You can also upload your own data to data.world and use it to collaborate with others.
The site includes some key tools that make working with data from the browser easier. You can write SQL queries within the site interface to explore data and join multiple data sets. They also have SDKs for R and Python that make it easier to acquire and work with data in your tool of choice (and you might be interested in reading our
tutorial on using the data.world Python SDK.)
All of the data is accessible
from the main site, but you’ll need to create an account, log in, and then search for the data you’d like.
Here are some examples:
12. Data.gov
Data.gov an aggregator of public data sets from a variety of US government agencies, as part of a broader push towards more open government. Data can range from government budgets to school performance scores. Much of the data requires additional research, and it can sometimes be hard to figure out which data set is the “correct” version. Anyone can download the data, although some data sets will ask you to jump through additional hoops, like agreeing to licensing agreements before downloading.
You can browse the data sets
on Data.gov directly, without registering. You can browse by topic area, or search for a specific data set.
Here are some examples:
13. The World Bank
The World Bank is a global development organization that offers loans and advice to developing countries. The World Bank regularly funds programs in developing countries, then gathers data to monitor the success of these programs.
You can
browse world bank data sets directly, without registering. The data sets have many missing values (which is great for cleaning practice), and it sometimes takes several clicks to actually get to data.
Here are some examples:
14. /r/datasets
Reddit, a popular community discussion site, has a section devoted to sharing interesting data sets. It’s called the
datasets subreddit, or /r/datasets. The scope and quality of these data sets varies a lot, since they’re all user-submitted, but they are often very interesting and nuanced.
You can browse the subreddit
here without an account (although a free account will be required to comment or submit data sets yourself). You can also see the most highly-upvoted data sets of all time
here.
Here are some examples:
15. Academic Torrents
Academic Torrents is data aggregator geared toward sharing the data sets from scientific papers. It has all sorts of interesting (and often massive) data sets, although it can sometimes be difficult to get context on a particular data set without reading the original paper and/or having some expertise in the relevant domains of science.
You can browse the data sets
directly on the site. Since it’s a torrent site, all of the data sets can be immediately downloaded, but you’ll need a
Bittorrent client.
Deluge is a good free option that’s available for Windows, Mac, and Linux.
Here are some examples:
- Enron Emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
- Student Learning Factors — a set of factors that measure and influence student learning.
- News Articles — contains news article attributes and a target variable.
Bonus: Streaming data
When you’re building a data science project, it’s very common to download a data set and then process it.
However, as online services generate more and more data, an increasing amount is available in real-time, and not available in downloadable data set form. Some examples of this include data on tweets from Twitter and stock price data. There aren’t many good sources to acquire this kind of data in downloadable form, and a downloadable file would be quickly out of date anyway. Instead, this data is often available in real time as streaming data, via an API.
Here are a few good streaming data sources in case you want to try your hand at a streaming data project.
Twitter has a good streaming API, and makes it relatively straightforward to filter and stream tweets. You can get started
here. There are tons of options here — you could figure out what states are the happiest, or which countries use the most complex language. If you’d like some help getting started working with this Twitter API, check out our tutorial
here.
17. Github
GitHub has an API that allows you to access repository activity and code. You can get started with the API
here. The options are endless — you could build a system to automatically score code quality, or figure out how code evolves over time in large projects.
18. Quantopian
Quantopian is a site where you can develop, test, and optimize stock trading algorithms. In order to help you do that, the site gives you access to free minute-by-minute stock price data, which you can use to build a stock price prediction algorithm.
19. Wunderground
Wunderground
has an API for weather forecasts that is free up to 500 API calls per day. You could use these calls to build up a set of historical weather data, and then use that to make predictions about the weather tomorrow.
If you liked this, you might like to read the other posts in our ‘Build a Data Science Portfolio’ series:
No comments