The chi squared test (or chi 2) is a statistical test for variables that take a finite number of possible values, making them categorical variables. As a reminder, a statistical test is a method used to determine whether a hypothesis, known as the null hypothesis, is consistent with the data or not.
What is the purpose of the Chi squared test?
The advantage of the Chi squared test is its wide range of applications:
- Test of goodness of fit to a predefined law or family of laws, for example: Does the size of a population follow a normal distribution?
- Test of independence, for example: Is hair color independent of gender?
- Homogeneity test: Are two sets of data identically distributed?
How does the Chi squared test work?
Its principle is to compare the proximity or divergence between the distribution of the sample and a theoretical distribution using the Pearson statistic \chi_{Pearson} [\latex], which is based on the chi-squared distance.
First problem: Since we have only a limited amount of data, we cannot perfectly know the distribution of the sample, but only an approximation of it, the empirical measure.
The empirical measure \widehat{\mathbb{P}}_{n,X} [\latex] represents the frequency of different observed values:
Empirical measurement formula
with
The Pearson statistic is defined as :
Pearson’s statistical formula
Under the null hypothesis, which means that there is equality between the distribution of the sample and the theoretical distribution, this Pearson statistic will converge to the chi-squared distribution with d degrees of freedom.
The number of degrees of freedom, d, depends on the dimensions of the problem and is generally equal to the number of possible values minus 1.
As a reminder, the chi-2 law with d degrees of freedom
centred reduced independent.
is that of a sum of squares of d Gaussians
Otherwise, this statistic will diverge to infinity, reflecting the distance between empirical and theoretical distributions.
Limit formula
What are the benefits of the Chi squared test?
So, we have a simple decision rule: if the Pearson statistic exceeds a certain threshold, we reject the initial hypothesis (the theoretical distribution does not fit the data), otherwise, we accept it.
The advantage of the chi-squared test is that this threshold depends only on the chi-squared distribution and the confidence level alpha, so it is independent of the distribution of the sample.
The test of independence:
Let’s take an example to illustrate this test: we want to determine if the genders of the first two children, X and Y, in a couple are independent?
We have gathered the data in a contingency table:
The Pearson statistic will determine if the empirical measure of the joint distribution (X, Y) is equal to the product of the empirical marginal measures, which characterizes independence:
Here, Observation(x,y) represents the frequency of the value (x, y):
for exemple:
For Theory(x, y), X and Y are assumed to be independent, so the theoretical distribution should be the product of the marginal distributions:
Thus, the theoretical probability for (son, son) is:
Let’s calculate the test statistic using the following Python code:
In our case, the variables X and Y have only 2 possible values: daughters or sons, so the dimension of the problem is (2-1)(2-1), which is 1.
Therefore, we compare the test statistic to the chi-squared quantile with 1 degree of freedom using the chi2.ppf function from scipy.stats.
If the test statistic is lower than the quantile and the p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis with 95% confidence.
Thus, we conclude that the gender of the first two children is independent.
What are its limits?
While the chi squared test is very practical, it does have limitations. It can only detect the existence of correlations but does not measure their strength or causality.
It relies on the approximation of the chi-squared distribution with the Pearson statistic, which is only valid if you have a sufficient amount of data. In practice, the validity condition is as follows:
The Fisher exact test can address this limitation but requires significant computational power. In practice, it is often limited to 2×2 contingency tables.
Statistical tests are crucial in Data Science to assess the relevance of explanatory variables and validate modeling assumptions. You can find more information about the chi-squared test and other statistical tests in our module 104 – Exploratory Statistics.