If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time browsing the internet looking for interesting datasets to analyze. It can be fun to sift through dozens of datasets to find the perfect one, but it can also be frustrating to download and import several CSV files, only to realize that the data isn’t that interesting after all. Luckily, there are online repositories that curate datasets and (mostly) remove the uninteresting ones.
In this post, we’ll walk through several types of data science projects, including data visualization projects, data cleaning projects, and machine learning projects, and identify good places to find datasets for each. Whether you want to strengthen your data science portfolio by showing that you can visualize data well, or you have a spare few hours and want to practice your machine learning skills, we’ve got you covered.
But first, let’s answer a couple quick, foundational questions:
What is a dataset?
A dataset, or data set, is simply a collection of data.
The simplest and most common format for datasets you’ll find online is a spreadsheet or CSV format — a single file organized as a table of rows and columns. But some datasets will be stored in other formats, and they don’t have to be just one file. Sometimes a dataset may be a zip file or folder containing multiple data tables with related data.
How are datasets created?
Different datasets are created in different ways. In this post, you’ll find links to sources with all kinds of datasets. Some of them will be machine-generated data. Some will be data that’s been collected via surveys. Some may be data that’s recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs.
Whenever you’re working with a dataset, it’s important to consider: how was this dataset created? Where does the data come from? Don’t jump right into the analysis; take the time to first understand the data you are working with.
Public Data Sets for Data Visualization Projects
A typical data visualization project might be something along the lines of “I want to make an infographic about how income varies across the different states in the US”. There are a few considerations to keep in mind when looking for a good data set for a data visualization project:
It shouldn’t be messy, because you don’t want to spend a lot of time cleaning data.
It should be nuanced and interesting enough to make charts about.
Ideally, each column should be well-explained, so the visualization is accurate.
The data set shouldn’t have too many rows or columns, so it’s easy to work with.