We have the answers to your questions! - Don't miss our next open house about the data universe!

Datasets: Top 5 places to find quality datasets

- Reading Time: 2 minutes
datasets

Getting started and training in data today will require you to have solid mathematical skills and to study a number of Machine Learning and Deep Learning algorithms.
To understand them and observe their performance, you’ll often need to practice on quality datasets. It’s not always easy to find one. You may have the opportunity to work with quality data in the course of your professional experience, but if you’re practicing outside working hours you’ll need to know reliable data sources.
At Datascientest, we’re pleased to present our Top 5 sites for finding relevant datasets:

It’s the data must-have for any specialist looking for datasets.

Kaggle was founded in 2010 by Anthony Goldbloom and acquired by Google in 2017. It’s a web platform that organizes data competitions. The principle is quite simple: for each competition, an organizer provides a dataset and the problem under consideration. Data scientists are invited to propose solutions, using machine learning algorithms. Those with the best scores can win a prize.

The appeal of Kaggle is twofold: you’ll find quality datasets uploaded by all kinds of companies and individuals, and through competitions test your Machine Learning and Deep Learning skills against other experienced Data Scientists.

The UCI Machine Learning Repository is a set of databases created as an ftp (File Transfer Protocol) archive in 1987 by David Aha and other graduate students at the University of Irvine. Since then it has been widely used by students and researchers around the world. The current version of the website was designed in 2007 by Arthur Asuncion and David Newman.

You’ll find just over 507 datasets, including popular ones such as the Census income Data Set. You can filter the datasets to find those that correspond to issues of interest to you in particular sectors.

For example, you can search for all datasets dealing with regression problems in the social sciences.

Data World is a site where you’ll also find numerous datasets from various organizations such as governments and urban administrations. You’ll find data on a wide range of subjects, including the economy, the environment, health and education. You can also upload your own datasets.

Data gouv is a platform that hosts public data and registers their reuse. You’ll find numerous datasets on news, population censuses, municipalities and real estate. Etalab, a department of the French Interministerial Digital Affairs Directorate, develops and runs the platform.

Data gouv is a platform that hosts public data and registers their reuse. You’ll find numerous datasets on news, population censuses, municipalities and real estate. Etalab, a department of the French Interministerial Digital Affairs Directorate, develops and runs the platform.

Quandl and Yahoo have both developed two APIs that give you easy access to financial data such as real-time stock prices. They also give you access to a wealth of financial information with methods already implemented and usable with the Python language. For example, with the Yahoo Finance API you’ll have easy access to moving averages, an indicator often used in technical analysis to remove transitory fluctuations and analyze longer-term trends, with the get_50day_moving_avg() or get_200day_moving_avg() methods.

Did you like this article?

Don’t hesitate to discover our training offers to learn how to manipulate and exploit Machine Learning models on relevant datasets.

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter

DataNews

Get monthly insider insights from experts directly in your mailbox