Rather than explaining relationships between two variables, multiple linear regression establishes relationships between one variable and several explanatory variables. This multidimensional approach allows us to delve deeper into the links between different data sets, while reducing the risk of misinterpretation. Find out more about the multiple linear regression model, its mathematical translations and its advantages.
What is multiple linear regression?
Definition
Before understanding multiple linear regression (also known as multidimensional linear regression or MLR), we need to redefine the basics. And more specifically, linear regression.
The aim of this classification algorithm is to establish relationships between an explanatory variable Y (known as the dependent variable or response variable) and an explanatory variable X (known as the independent variable).
This model takes shape on a point cloud, with X on the y-axis and Y on the x-axis. In this case, linear regression must determine a straight line capable of passing as close as possible to the points in the cloud.
This is done using the method of least squares (or OLS for ordinary least squares), which determines the relationship between X and Y.
From here, it is possible to explain a dependent variable Y on the basis of an independent variable X (this is simple linear regression), or to explain the dependent variable Y on the basis of several independent variables X (at least 2). This is what multiple linear regression is all about. By establishing relationships between different variables, predictions can be made with a minimum of error.
Whatever the model, the dependent variable is always continuous numerical, unlike the independent variables, which can be continuous or categorical (but always numerical).
Mathematical translation
Multiple linear regression can be used whenever this type of data set is available:
Y | X1 | X2 | … | Xn | |
---|---|---|---|---|---|
1 | 15 | 54 | … | … | … |
2 | 58 | 65 | … | … | … |
… | … | … | … | … | … |
n | … | … | … | … | … |
From this table, the RLM takes the following form:
yi = β0 + β1xi1 +…+ βpxip + ϵi
Where
yi = the dependent variables ;
i = index of observations ;
xij = the observed values of the independent variables;
βp = unknown parameters (also sometimes called “partial slopes”) ;
ϵi = residuals (in other words, prediction error).
Like any linear regression, multiple regression is formalized through a point cloud. But unlike simple regression, which is projected onto a two-dimensional graphical plane, multiple linear regression is projected onto a multi-dimensional graph. This is what makes it possible to model the different explanatory variables.
Why use multivariate linear regression?
Making predictions
By identifying correlations between a result (the dependent variable) and several explanatory and independent variables, multiple linear regression enables predictions to be made and insights to be gained.
That’s why this mathematical method is used in so many fields. Here are just a few examples:
- Sales performance: companies can predict sales of a product using various characteristics of the typical buyer, such as age, salary level, geographical location, etc.
- Weather forecasting: meteorologists can predict the week’s weather based on air temperature, humidity levels, atmospheric pressure, etc.
- Medicine: health professionals can anticipate the spread of a virus in a region based on the number of people infected, the speed of contamination, the consumption of a particular food, weather conditions, etc.
- The stock market: financial analysts can predict a share price based on a company’s financial health, current events, economic conditions, etc.
Limiting confusion between explanatory variables
In addition to making predictions, multiple linear regression can also overcome the limitations of simple linear regression.
In some cases, there may be an apparent link between a variable to be explained and an explanatory variable. However, this link does not seem logical.
For example, there is a strong correlation between mint consumption and respiratory capacity. Breathing capacity decreased as mint consumption increased. Does this mean that mint consumption explains these respiratory weaknesses? No, there’s another factor.
Again, there’s a clear correlation between mint consumption and smoking. But also between smoking and respiratory capacity.
In this hypothesis, the mint consumption variable is linked to both the response variable (respiratory capacity) and the explanatory variable (smoking). It thus becomes a confounding factor.
To detect this confounding, simple linear regression is insufficient.
Instead, multiple linear regression should be used. Multiple linear regression is used to determine the relationship between the variable and the explanatory variables. All these explanatory variables are taken into account.
Linear regression and Machine Learning
In addition to explaining a variable in terms of several independent pieces of data, multiple linear regression is also capable of assimilating new rules on its own. As such, this mathematical tool is a must-have for artificial intelligence.
The idea is to start with a training phase using clouds of training points. This produces a high-performance machine learning model capable of accurately establishing the relationships between a variable to be explained and other explanatory variables.
But to develop these models and interpret the results obtained, you need in-depth training. DataScientest makes it possible.
Through our data science training, you’ll learn everything you need to know about multiple linear regression and all the other Machine Learning tools. Come and join us!