## In data science, it is vital to discover and quantify the extent to which two variables are linked. These relationships can be complex and are not necessarily visible. Some of these dependencies, such as linear regressions, weaken the performance of Machine Learning algorithms. It therefore becomes imperative to prepare your data better.

Here we will look at how to obtain dependency between two categorical variables and between **categorical and continuous variables.**

First of all, remember that a categorical variable is a variable which has a finite number of distinct categories or groups. For example, the gender of individuals, the type of equipment or the method of payment. In contrast, **continuous variables** can theoretically take on an infinite number of values.

### Correlation between two categorical variables :

To find out whether two** categorical variables** are related, we use the famous chi-square test. If you’re not familiar with statistical tests, don’t panic!

A statistical test is a procedure for deciding between two hypotheses.

It consists of rejecting or not rejecting a statistical hypothesis, called the null hypothesis H0, based on a set of data.

In the test we are interested in, the null hypothesis is simply “the two variables being tested are independent”. Finally, the test is accompanied by a test statistic which is used to decide whether or not to reject the null hypothesis. Because of the way the test is constructed, this statistic has the good taste to follow a chi-square distribution with a certain degree of freedom.

#### But how do you decide whether or not to reject the null hypothesis?

Without going into the mathematical details, each statistical test has a so-called p-value. This can be seen as a reference value for deciding whether or not to reject the null hypothesis.

If the p-value is below 5%, then the null hypothesis is rejected. The 5% threshold is used by practitioners and may vary depending on the sector of activity.

The **test is easily implemented in Python** using the scipy library and its chi2_contingency function. It allows you to quickly obtain the p-value of the test as well as the associated statistic and degree of freedom. In practice, the chi-square test requires a little work on the data beforehand. To perform the test, you first need to determine the contingency table. This is a cross-tabulation between the two variables. It is easily obtained using the Pandas crosstab function. The test is then performed using the contingency table:

In our example above, the p-value is well below 5%, so we can reject the hypothesis that the two variables being tested are independent.

Finally, we can also measure the level of correlation between the two variables using **Cramer’s V.** This is calculated using the test statistic, the degree of freedom and the dimensions of the contingency table. It returns a value between 0 and 1. If the value returned is greater than 0.9, the relationship can be described as very strong. If the value is less than 0.10, the relationship can be described as weak.

### Correlation between two continuous variables :

As with** categorical variables**, there is a test to determine whether two continuous variables are independent: Pearson’s correlation test. The null hypothesis to be tested is identical: “the two variables tested are independent”. As with the chi-square test, it is accompanied by a test statistic and a p-value that determines whether or not the null hypothesis is rejected.

This test can be** implemented very easily** using the scipy library and its pearsonr function. There is no need to work on the** data beforehand,** provided it contains no missing values. Here is an example of implementation using python :

In our example, the p-value is less than 5%. This means that the variables are not independent. The Pearson coefficient measures the level of correlation between the two variables. It returns a value between -1 and 1. If it is close to 1, this means that the variables are correlated, close to 0 that the variables are uncorrelated and close to -1 that they are negatively correlated. In our example, the coefficient has a value of 0.80319, which means that the variables are highly correlated.

### Correlation between a continuous variable and a categorical variable :

To study this type of correlation, a one-factor analysis of variance (ANOVA) is used to compare sample means. The aim of this test is to conclude whether a categorical variable influences the distribution of a continuous variable to be explained.

Imagine that you have 3 variables. The first gives a customer number, the second a category (1, 2 or 3) and the last the amount spent. The question is: does the category variable have an influence on the amounts spent? Let’s denote µ1, µ2 and µ3 the average amounts spent for each of the 3 categories. A simple reasoning consists in saying that if the category variable has no influence on the sums spent, then the averages should be identical.

In other words,** µ1 = µ2 = µ3**. This is exactly the hypothesis we test when we use analysis of variance. As with the chi-square and Pearson tests, this test is accompanied by a test statistic and a p-value which determines whether or not the null hypothesis is rejected.

This test is **easily implemented in Python** using the statsmodels library. Here is an example of implementation:

In our example, df indicates the degree of freedom of the test statistic F, which follows a Fisher distribution. PR(>F) indicates the p-value of the test. This is less than 5%, so we can conclude that the pledged variable has an influence on main_category.

You now have all the tools you need to study correlations within a dataset. Datascientest will give you the opportunity to go further by learning how to manage a data project from A to Z. Find out more about our training courses!