We have the answers to your questions! - Don't miss our next open house about the data universe!

Chi squared test: Find out more about this essential statistical test

- Reading Time: 4 minutes
chi 2

The chi squared test (or chi 2) is a statistical test for variables that take a finite number of possible values, making them categorical variables. As a reminder, a statistical test is a method used to determine whether a hypothesis, known as the null hypothesis, is consistent with the data or not.

What is the purpose of the Chi squared test?

The advantage of the Chi squared test is its wide range of applications:

  • Test of goodness of fit to a predefined law or family of laws, for example: Does the size of a population follow a normal distribution?
  • Test of independence, for example: Is hair color independent of gender?
  • Homogeneity test: Are two sets of data identically distributed?

How does the Chi squared test work?

Its principle is to compare the proximity or divergence between the distribution of the sample and a theoretical distribution using the Pearson statistic \chi_{Pearson} [\latex], which is based on the chi-squared distance.

First problem: Since we have only a limited amount of data, we cannot perfectly know the distribution of the sample, but only an approximation of it, the empirical measure.

The empirical measure \widehat{\mathbb{P}}_{n,X} [\latex] represents the frequency of different observed values:

\forall x \in \mathbb{X} \quad \widehat{\mathbb{P}}_{n,X} (x) = \frac{1}{n} \sum_{k=1}^{n} 1_{X_{k} =x}

Empirical measurement formula

with

X_{1},... ,{X_n} = the sample
{\mathbb{X}} = all possible values

The Pearson statistic is defined as :

\chi_{Pearson} = n \times \chi_{2}(\widehat{\mathbb{P}}_{n,X}, P_{theorique} ) = n \times \sum_{x \in \mathbb{X}} \frac{(\widehat{\mathbb{P}}_{n,X} (x)- P_{theorique}(x))^{2}}{P_{theorique}(x)}

Pearson’s statistical formula

Under the null hypothesis, which means that there is equality between the distribution of the sample and the theoretical distribution, this Pearson statistic will converge to the chi-squared distribution with d degrees of freedom.

The number of degrees of freedom, d, depends on the dimensions of the problem and is generally equal to the number of possible values minus 1.

As a reminder, the chi-2 law with d degrees of freedom

centred reduced independent.

\chi^{2}_{loi}(d)

is that of a sum of squares of d Gaussians

chi^{2}_{loi}(d) := \sum_{k=1}^{d} X_{k} \quad avec \quad X_{k} \sim \mathbb{N}(0,1)

Otherwise, this statistic will diverge to infinity, reflecting the distance between empirical and theoretical distributions.

Sous \quad H_{0} \quad \lim_{n\rightarrow \infty } \chi_{Pearson} = \chi^{2}_{loi}(d). \\ Sous \quad H_{1} \quad \lim_{n\rightarrow \infty } \chi_{Pearson} = \infty

Limit formula

What are the benefits of the Chi squared test?

So, we have a simple decision rule: if the Pearson statistic exceeds a certain threshold, we reject the initial hypothesis (the theoretical distribution does not fit the data), otherwise, we accept it.

The advantage of the chi-squared test is that this threshold depends only on the chi-squared distribution and the confidence level alpha, so it is independent of the distribution of the sample.

The test of independence:

Let’s take an example to illustrate this test: we want to determine if the genders of the first two children, X and Y, in a couple are independent?

We have gathered the data in a contingency table:

\begin{array}{|c|c|c|c|} \hline X / Y & Child 2 : son & Child 2: daughter & Total \\ \hline Child 1: son & 857 & 801 & 1658 \\ \hline Child 1: daughter & 813 & 828 & 1641\\ \hline Total & 1670 & 1629 & 3299 \end{array}

The Pearson statistic will determine if the empirical measure of the joint distribution (X, Y) is equal to the product of the empirical marginal measures, which characterizes independence:

\chi_{Pearson} = n \times \chi^2 (\widehat{\mathbb{P}}_{X \times Y}, \widehat{\mathbb{P}}_{X} \times \widehat{\mathbb{P}}_{Y}) = \sum_{x \in \{daughter, son\}, y\in \{daughter, son\}} \frac{(Observation_{x,y} - Theory_{x,y})^{2}}{Theory_{x,y}}

Here, Observation(x,y) represents the frequency of the value (x, y):

\forall x, y \in \{daughter, son\} \quad Observation_{x,y} = \frac{1}{n} \sum_{k=1}^{n} 1_{(X_{k},Y_{k}) = (x, y)}

for exemple:

Observation(daughter, daughter) = \frac{828}{3299} = 0.251

For Theory(x, y), X and Y are assumed to be independent, so the theoretical distribution should be the product of the marginal distributions:

\forall x, y \in \{daughter, son\} \quad Theory_{x,y} = Observation^{X} \times Observation^{Y} = \sum_{y\in\{daughter, son\}} Observation_{x,y} \times \sum_{x\in\{daughter, son\}} Observation_{x,y}

Thus, the theoretical probability for (son, son) is:

Theory(son, son) = \frac{857+801}{3299} \times \frac{857+813}{3299} = \frac{1658 \times 1670}{3299^{2}} = 0.254

Let’s calculate the test statistic using the following Python code:

In our case, the variables X and Y have only 2 possible values: daughters or sons, so the dimension of the problem is (2-1)(2-1), which is 1.

Therefore, we compare the test statistic to the chi-squared quantile with 1 degree of freedom using the chi2.ppf function from scipy.stats.

If the test statistic is lower than the quantile and the p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis with 95% confidence.

Thus, we conclude that the gender of the first two children is independent.

What are its limits?

While the chi squared test is very practical, it does have limitations. It can only detect the existence of correlations but does not measure their strength or causality.

It relies on the approximation of the chi-squared distribution with the Pearson statistic, which is only valid if you have a sufficient amount of data. In practice, the validity condition is as follows:

\forall x \in \mathbb{X} \quad n \times P_{theoretical}(x) (1- P_{theoretical}(x)) \geq 5

The Fisher exact test can address this limitation but requires significant computational power. In practice, it is often limited to 2×2 contingency tables.

Statistical tests are crucial in Data Science to assess the relevance of explanatory variables and validate modeling assumptions. You can find more information about the chi-squared test and other statistical tests in our module 104 – Exploratory Statistics.

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter

DataNews

Get monthly insider insights from experts directly in your mailbox