Datasets (or data sets) are commonly used in machine learning. They consist of a collection of coherent data that can come in various formats (text, numbers, images, videos, etc.).
What is a Dataset?
Datasets can be represented in different types, including tables, graphs, trees, and more. Algorithms in machine learning often work with structured tables.
Each value in a dataset is associated with an attribute and an observation.
For example, consider data about individuals who may or may not have COVID-19. Attributes would correspond to different characteristics such as age, weight, height, city of residence, symptoms, etc., while each observation would be associated with a different person.
The advantage of datasets is the ability to manipulate and make various modifications to the data. Let’s delve into how to manipulate them in Python.
Handling datasets in Python
It allows you to create datasets or import them and manipulate them before applying machine learning models.
When you obtain a dataset, you often need to make modifications.
For example, data may have various issues, such as missing values (which are often necessary for analysis). There can also be user input errors (misplaced commas, extra zeros, etc.). Data type issues can also arise. Often, attributes (like age, for example) are in text format, but to use statistical functions on this attribute (like calculating the average age, standard deviation, etc.), you need to convert the data of that attribute into numerical format.
The functions and methods in Pandas make it easy to perform these different manipulation steps and make the necessary changes to your dataset.
Once the dataset’s data has been processed, machine learning algorithms are often used on the datasets to predict models.
Let’s revisit the example of our dataset concerning COVID-19 patients.
When you obtain such a dataset, before creating machine learning models, you need to make several modifications:
1. There is no information available for the personal characteristics of patient 4, so you should consider removing the entire row (as it is not usable).
2. The height is given in text format (as evident from the mix of numbers and text in the cell). Therefore, you need to extract the first three characters from each cell and change the data type to numerical format.
3. It’s noticeable that there is an extra 0 in the patient’s weight.
Once all these modifications are made, you can meaningfully analyze the data and create models. With such a dataset, you can typically predict what profile of a person might be likely to have specific symptoms in a particular region, for example.
If it’s necessary to work with datasets, it’s also crucial to ensure the validity of data sources. Working with inaccurate data would be a waste of time.
An article on our blog provides a list of sites where you can find data from reputable sources.
Datasets are indeed highly effective and manipulable for data processing. Our training programs teach various tools for data manipulation and modeling. For more information, please don’t hesitate to get in touch with us.
Top 5 places to find datasets for Machine Learning
Whether you’re interested in aerospace, sports, the environment or traffic on the Paris ring road, find out where and how to retrieve datasets tailored to your needs.
Here are the top 5 sites for open source data retrieval on the Internet.
This tool developed by Google is one of the most efficient ways to find a dataset through simple keyword searches.
For example, if you want to do a Machine Learning project related to tennis, specifically focusing on Roland-Garros and adding Nadal’s performance data to your project, you can simply enter these three keywords into the search bar as you would in a regular Google search.
The search will return all datasets containing the specified keywords along with a brief description and additional information about these datasets (source, publication date, licensing, etc.).
You can refine your search further using advanced parameters such as the date of the last update, usage rights, or dataset availability.
This French government website provides access to public data concerning the French territory. It offers datasets on various topics, allowing users to specify the territorial granularity (departmental, regional, national), the data source (some Ministries provide data), and the time period covered by the data.
Numerous themes are covered, including datasets related to the economy, health, agriculture, environment, tourism, education, and European issues. The website also showcases how the datasets it contains have been reused by other platforms for surveys or publications.
The U.S. agency FEMA (Federal Emergency Management Agency) is dedicated to preventing and protecting the population from threats and dangers that pose a risk on U.S. territory. This organization has established a website to provide open access to databases collecting information on various subjects.
These datasets cover disasters that have occurred on U.S. territory, emergency management, programs to aid populations, and households that have benefited from natural disaster prevention programs.
On each page presenting a dataset, you can find information about it, its contents, and links to download the data.
NASA (National Aeronautics and Space Administration) has decided to make some of its datasets public with the goal of “stimulating your creativity to solve problems here on Earth.”
In addition to providing data, the organization also offers free access to projects conducted by researchers and APIs.
When you access the dataset catalog, you can search by keywords and apply various filters. For each dataset, you will have access to a detailed description of the columns and a preview.
The National Institute of Statistics and Economic Studies provides a wide range of French datasets sorted by themes and geographical granularity.
These datasets cover specific domains such as the economy, demographics, consumption, the labor market, as well as the environment and sustainable development.
In addition to datasets, the institute offers interactive maps, detailed statistics, and time series data.
This provides a brief overview of the sources of freely accessible data available on the Internet.
Generally, some countries and government organizations like Canada, the United Kingdom, or the European Union provide open access datasets. In France, the Open Data Paris website can be an interesting source for collecting data related to the City of Paris.
Lastly, the French company Opendatasoft is responsible for creating open data sites for certain companies and organizations such as Engie, SFR, Euler Hermes, or even the Ministry of National Education and Youth, whose datasets are freely accessible on the Internet.
Now that you know where to find quality datasets, all that’s left is to learn how to train your Machine Learning models on them!