🚀 Think you’ve got what it takes for a career in Data? Find out in just one minute!

Analysis of variance (ANOVA): a basic tool for data analysis

-
5
 m de lecture
-
anova

Analysis of Variance (ANOVA) is a straightforward and widely used statistical technique for examining the relationship between two or more variables, especially between an explanatory variable and a target (or dependent) variable. ANOVA helps us understand if the explanatory variable influences the target variable and in what way.

ANOVA is thus employed in diverse contexts and for various issues, ranging from marketing to scientific studies in multiple fields (medicine, biology, demography, etc.). We can envision concrete cases where ANOVA can be used.

The director of a chain of 80 stores wants to know if increasing the brightness of advertising posters can have a positive impact on sales. He divides his stores into four groups. In the first group, he asks not to change the brightness of the advertising posters.

However, he instructs the second, third, and fourth groups of stores to increase the brightness of the advertising posters by 20%, 40%, and 60%, respectively.

One month later, he calculates the average sales for each of the four groups. He observes differences: the brightness level of the posters seems to have favored sales.

Demographers want to study the effect of educational level (below high school diploma, high school diploma, bachelor’s degree, master’s degree) on income.

Using a national study comprising 150,000 individuals across France, they calculate the average income for each of these educational levels. They find that the averages differ, and educational level seems to have a positive effect on income.

How can the store director and demographers be sure that there is a significant relationship between the variables they are examining (brightness of advertising posters and sales on one hand and educational level and income on the other hand) and that the differences they have detected are real?

Fortunately, they can rely on a statistical test developed in 1918 by the British biologist and statistician Ronald A. Fisher: the Analysis of Variance (ANOVA).

What is ANOVA?

ANOVA is an inferential statistical technique developed to test the existence of a significant relationship between two variables in two or more groups. Specifically, it is used when we want to determine if an explanatory variable (in our examples, the brightness level of posters and the level of education) influences a dependent variable (in our examples, store sales and income).

It’s important to note that in the case of ANOVA, the explanatory variable is a categorical variable, meaning it contains values that represent a quality or characteristic that is not quantifiable. On the other hand, the target variable is a quantitative variable, which can be expressed as numerical values.

ANOVA follows the same logic as a mean comparison test like the t-test, but unlike the t-test, it is not limited to analyzing two groups – it can consider multiple groups, which is its strength.

The goal of ANOVA is to reject the null hypothesis, which states that there is no significant difference between the groups being examined, and to retain the alternative hypothesis, which asserts that the differences detected between the groups are indeed real.

To do this, as its name suggests, ANOVA relates between-group variance to within-group variance.

Between-group variance indicates the variance that exists among the groups, such as the variance between the different groups defined by their level of education.

Within-group variance indicates the variance within each group defined by their level of education.

The fundamental idea of ANOVA is that the greater the ratio of between-group variance to within-group variance, the greater the likelihood that the observed differences between the groups are real.

In other words, if between-group variance is greater than within-group variance, it allows us to believe that the observed differences are genuinely related to the membership in different groups, and we can then reject the null hypothesis. The ratio of between-group variance to within-group variance is expressed as the F ratio.

How to calculate the F Ratio

In order to calculate the F ratio, we can break down our analysis of variance problem into several steps. We start by calculating the inter-class variance (across groups) and the intra-class variance (within groups).

To do this, we need to calculate the sum of squared differences (SES) between groups.

The formula is as follows:

SCEInterclasse = \sum_{k=1}^{n} u_{k} \times (\overline{Y_{k}}- \overline{Y})^{2}

or

k = number of different groups
\overline{Y_{k}} = group average
\overline{Y} = the overall average

The between-group sum of squares (SCEInterclasse) can also be understood as the total variation in the dependent variable that can be explained by the independent variable.

Next, we will calculate the within-group sum of squares (SCEIntraclasse), which is the sum of squares of deviations within the groups.

The formula to calculate the sum of squares of deviations within each group is as follows:

SCEIntraclasse = \sum_{k=1}^{n} u_{k} \times (\overline{Y_{i}}- \overline{Y_{k}})^{2}

Où :

\overline{Y_{i}} = each individual score within the group
\overline{Y_{k}} = the group average
Together, the interclass variance and the interclass variance make up the total variance in our observations. This can be represented as follows:

SCEtotal = SCEInterclass + SCEIntraclass

Next, we can calculate our degrees of freedom. For SCEInterclass the degrees of freedom are determined by:

DDLinterclass = K – 1

Where K is the number of groups. For SCEIntraclass the degrees of freedom are determined as follows:

DDLintraclass = N – k

Where
  • N = the total number of observations
  • k = the number of groups
We can now calculate the average interclass squares by dividing the SCEinterclass by the interclass DDLs.

Average of interclass squares = SCEinterclass / DDLIinterclass

We can proceed in the same way to calculate the average of the intraclass squares:

Average of intraclass squares = SCEinterclass / DDLIntraclass

We’ve reached the end of our journey and can finally calculate the F ratio (Fisher’s F)

F ratio = Average of interclass squares / Average of intraclass squares

A high F ratio indicates that the interclass variance is greater than the intraclass variance. This increases the chances of rejecting the null hypothesis and being able to assert that there is indeed a difference between our groups of interest.
It’s important to point out that in order to perform an ANOVA on our data, we need to check that it meets a number of conditions, including the normality of the distributions and the independence of our samples.
More specifically, it is necessary for the quantitative variable under examination to have a normal distribution: this is particularly important for small sample sizes. We also need to examine homoscedasticity: indeed, to be able to perform an ANOVA it is necessary for all the groups studied to have equal (or similar) variance. Finally, before performing an ANOVA we need to check that the observations are independent.

Two-Way ANOVA and post-hoc

The examples of ANOVA we’ve discussed so far examine the relationship between one explanatory variable (brightness level of posters or level of education) and one dependent variable. This simpler version of ANOVA is also known as one-way ANOVA. However, in many real-world scenarios, we are interested in assessing the impact of two or more variables on the dependent variable. For instance, we might want to know if an individual’s income is affected not only by their level of education but also by their gender. In such cases, we use a more complex version of ANOVA called two-way ANOVA.

When we have only one explanatory variable, we can calculate a single F ratio.

However, when significant differences are influenced by multiple independent variables, we need to calculate multiple F ratios. Two-way ANOVA allows us to evaluate the main effects of each independent variable and determine if there is an interaction between them.

ANOVA (one-way or two-way) helps us test for the existence of a significant difference between two or more groups. However, it does not tell us where this difference lies. In other words, using the example of the brightness level of advertising posters, if we observe that increasing brightness has a positive effect on sales, we may wonder which level of brightness is responsible for this increase.

We might hypothesize that only a 60% increase in brightness has a positive effect on sales, while 20% and 40% increases have no effect. To investigate such hypotheses, we need to use post-hoc tests. The most commonly used post-hoc tests are Tukey’s Honestly Significant Difference (HSD) test and the Bonferroni correction.

ANOVA (one-way or two-way) combined with these post-hoc tests provides a good understanding of the relationship between our variables of interest. These techniques are part of the toolkit that a Data Scientist can use daily to understand their data. They help determine if an explanatory variable influences a target variable and in what way. Therefore, gaining expertise in analysis of variance is an important step in embarking on a career as a Data Scientist.

Facebook
Twitter
LinkedIn

DataScientest News

Sign up for our Newsletter to receive our guides, tutorials, events, and the latest news directly in your inbox.

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter

DataNews

Get monthly insider insights from experts directly in your mailbox