A key algorithm in Machine Learning, linear regression is used to establish relationships between one or more variables. To put this algorithm into practice with ease, data scientists can turn to programming languages, particularly Python. So how do you use linear regression with Python? DataScientest answers the question.
What is linear regression?
Before looking at the practical use of linear regression with Python, we need to go back to basics.
Linear regression - Definition
The linear regression model is a supervised learning algorithm used to predict a continuous target variable (dependent variable) from one or more explanatory variables (independent or predictive variables). In other words, it establishes relationships between 2 or more variables.
When there is only one explanatory variable, we speak of simple linear regression. On the other hand, if there are several, we speak of multiple linear regression.
Whether simple or multiple, linear regression can be used with Python.
💡Related articles:
Linear regression - mathematical translations
The mathematical equation for linear regression is as follows:
Y = Θ0 + Θ1x1 + ... Θnxn
In this equation :
- Y corresponds to the explanatory value ;
- θ corresponds to the bias term or parameter vector;
- x1, x2…, xn correspond to the entity values.
From a visual point of view, linear regression applies when the training data represents a point cloud. In this case, the aim is to identify the straight line that most closely approximates the set of points.
To ensure that this line is as accurate as possible, we measure the mean squared error.
Use cases for linear regression
If linear regression is the first algorithm used in machine learning, it’s because it can be applied in so many different ways.
For example:
- Identify factors influencing the profitability of an investment;
- Predict future sales by analyzing past sales;
- Anticipate consumer behavior;
- Predict the price of a house based on its characteristics;
- etc.
And for every linear regression application, you can use Python.
How to use linear regression with Python?
To explain linear regression with Python, let’s take a concrete example. Here’s the starting hypothesis:
- A restaurateur who already owns several restaurants in several cities wants to expand his business by setting up in different locations.
- To analyze the next cities in which to set up, the restaurateur has two sets of data at his disposal: the profits made in the cities where he is already established, and the populations of the cities.
Since the aim is to make as much profit as possible in the town where he’s going to set up, he needs to predict the profit made in the town where he’s going to set up (dependent variable = Y) as a function of its population (independent variable = X).
So how do you evaluate the linear regression model with Python? Here’s how.
Formatting data
To model linear regression with Python, you need to prepare the training data in the right format. Ideally, you should prepare a CSV file with two columns: one for the population (independent variable) and another for the benefits (independent variable). Here’s what such a file might look like:
Population | Profits |
---|---|
811 000 | 175 000 € |
757 000 | 91 300 € |
551 000 | 21 000 € |
372 000 | - 6 000 € |
… | … |
Load data
This training data must then be loaded into Python. Thanks to the Pandas library, you can easily read CSV files. Here’s how to do it.
import pandas as pd
df=pd.read_csv("D:\DEV\PYTHON_PROGRAMMING\donnees-d-entrainement-regression-lineaire.csv")
The read_csv() function returns a two-dimensional array containing the dependent and independent variables. But to use linear regression with Python, you need to separate the two columns into two Python variables.
For the first column corresponding to population size :
X = df.iloc[0:len(df),0]
For the second column corresponding to profits :
Y = df.iloc[0:len(df),1]
This gives you a simple table containing the entire training data set.
Visualize data
To better understand linear regression with Python, it can be useful to visualize it. This will enable you to identify points and better understand dispersions.
To obtain a scatter plot, you can use Matplotlib, a Python library. Here’s how to get it:
import matplotlib.pyplot as plt
axes = plt.axes()
axes.grid()
plt.scatter(X,Y)
plt.show()
Apply the algorithm
From there, the aim is to find a predictive function F(X) with population size as input and expected profits as output.
To model linear regression with Python, the easiest way is to use the Scikit Learn library by typing this query:
from sklearn.linear_model import LinearRegression.
From there, you can build your template. Here’s the code to write:
reg = LinearRegression(normalize=True)
reg.fit(x,y)
And to find the line f(x)=ax+b with minimum squared error, type :
a = reg.coef_
b = reg.intercept.
Make predictions
To plot the linear regression curve with Python, simply type the code below:
ordonne = np.linspace
plt.scatter(x,y)
plt.plot(ordonne,a*ordonne+b,color='r')
Master linear regression with Python
Linear regression is undoubtedly the algorithm to master in data science. And if using it via Python still seems complex, that’s only temporary.
With the right training, you’ll be able to evaluate any machine learning algorithm across different programming languages.
But which course should you choose? DataScientest, of course. Discover our program.