Contrasts and Pairwise Comparisons

Contrasts are used in ANOVA to further analyze results from the ANOVA table.  Think of the significance tests in the ANOVA table as “the forest”, while contrasts are “the trees”.

You can design your study around a set of specific contrasts, as well. Often, your research hypotheses are going to be better stated as contrasts among various groups or treatments rather than as overall omnibus ANOVA tests of significance.

Pairwise comparison methods are just a type of contrast.  These essentially calculate a difference between every pair of groups or treatments.  Some common ones that you may know about are the Fisher Least Significant Difference, Tukey’s Honest Significant Difference, and the Bonferroni adjustment, but there are many more.

Let’s discuss these pairwise comparison methods a bit here.  The concept of pairwise comparison methods is simple:  We are going to try to explain the effects of some treatment or factor by comparing each possible level to every other possible level of the treatment.

Now, if we have only two levels of a treatment, there is only one pairwise comparison, and the ANOVA test of the treatment effect is it.  Done and stick a fork in it.

But, if we have more than two levels, then we start to have more pairs.  The general formula is given by the binomial function, so with k levels we have k(k-1)/2 pairs.  For example, with k=3 levels we have 3 pairs.  With k=4 levels we have 6 pairs.  And so on.

Remember that the huge bugaboo in statistical analysis is usually the Type I error.  This is also called the false positive rate and usually symbolized by α.  The traditional level of α is 0.05, or one in twenty, or 5%.  That is what we want to define the chance to be that we will declare a finding when there really is no finding—that is, we were fooled by randomness (a rant for another day).

Why 5%?  You can thank Sir Ronald Fisher for that.  Now, it has become a tradition, nearly set in stone.

So here’s the deal:  When we do a statistical test, we try to set the Type I error to 5%.  Then, we hope that this means that only one in twenty times are we going to declare a finding when actually there is nothing but noise in our data.

Think of it this way: The 5% is kind of like a lottery ticket.  A bad lottery ticket.  When you “win”, you win the booby prize of thinking that you found something interesting in your data to tell everyone about, but really you are sadly mistaken.  So, we actually don’t want to win this lottery.

Performing a single pairwise comparison is kind of like playing this lottery once.  Now, if I give you a single lottery ticket with a 5% chance of winning, you might be okay with that.  But, if I give you another ticket, now your chances of winning are going up.   Think about the situation with four levels and six pairwise comparisons—that’s like getting six lottery tickets!

Obviously, the more comparisons you do, the more you stand to win the booby prize.  What to do?

Well, here’s a possible solution.  Let’s make the chance of winning on any one ticket small enough so that your overall chance of winning even with all those ticket is controlled back down to the soothing 5% level.  Aha!  Now, we can make all those comparisons without losing sleep over our soon-to-be-destroyed reputation!

But, as with everything in life, sadly, there is a price to pay for this peace of mind.  That price is simple:  It is going to be much harder to actually find anything to be different.

So, with that in mind, we generally have three categories of pairwise comparisons.  We have conservative ones, liberal ones, and at least nominally exact ones (the libertarians).  Conservative tests correct too much, so that you are less likely to be able to detect differences but are definitely controlling your Type I error. Liberal tests correct too little or not at all, so you are more likely to spot differences but you are not controlling your Type I error.  Nominally exact ones attempt to exactly control the Type I error to the level you set (invariably 5%).

Why “nominally exact”?  Well, that’s a bit of a made-up term on my part.  I just want to alert you to the fact that the “exactness” of these tests comes with a heavy reliance upon a variety of assumptions in the background.

Here’s a little list to give you some idea of how to think about these types of tests:

  • The Fisher LSD is a liberal test.  It is more likely to find differences.  It under-controls false positives.
  • The Tukey HSD is a nominally exact test.  It controls the false positives to our specification (usually, to 5%).
  • The Bonferroni is a conservative test.  It is less likely to find differences.  It over-controls false positives.

One way to interpret all this is as follows:  If you can’t see it using the Fisher LSD tests, it isn’t there.  If you see something using the Bonferroni test, it’s there.

In exploratory studies where you don’t want to miss interesting results, you would probably want to use a more liberal test.  On the other hand, in a confirmatory study, you would probably want to use either an exact control or the over-control provided by the Bonferroni method.

Here is an interesting note:  You do not have to perform ANOVA first before you use the pairwise comparisons.   There is no theoretical justification for that.  But, do not waste your time trying to buck convention on this one—you’ll get nowhere.  So, just do what everyone else does:  Report the omnibus ANOVA test first as significant, then give the pairwise comparisons.

It used to be thought that running the ANOVA first, followed by the Fisher LSD tests was a way to control the Type I error.  However, it is not methodologically sound.  But, the procedure was called the “protected Fisher LSD comparisons” method, and so that has to be good, right?

Wrong.  Statisticians have a bad habit of naming things in ways that sound great.  After all, who can argue with statistics that are “most powerful”, or “unbiased”, or “protected”, or presented as “objective”?  This trend seems to be have been started by Fisher himself, but I haven’t really researched the topic.  There is also another bad habit at work, that of creating a massive amount of jargon.  This jargon seems designed only to keep people out of statistics, if you ask me.

How to Report ANOVA Results

Here’s the basic layout that I recommend for writing about ANOVA results:

1. Provide the ANOVA table.  (This may be optional.)

2. Provide a graphical or tabular summary of the data, or both.

3. Give the results of general tests of significance.

4. Give specific quantitative or qualitative results.

By general tests of significance, I really mean the so-called “omnibus” tests of significance that are shown in the ANOVA table. These are generally tests that try to answer questions like “Does Factor X have any effect on growth?”  In some cases, you may need to report such things as overall model significance, which answers the really general question of “Is there any explanatory value of any of these factors compared to just taking the mean of all the data?”

These omnibus tests include test of main effects and tests of interaction effects.  Without going into detail just yet, main effects are probably what most people think of when they wonder if some experimental factor has any effect.  However, interaction effects can be very interesting results themselves, so don’t be upset if you find statistically significant interactions!

An interaction effect means that the effect of one factor depends somehow on the effect of another factor.  Or, more correctly, the effect of one factor is associated to a different effect depending upon the level of some other factor.

By specific quantitative results, I mean such items as estimates of differences or other specific quantities, perhaps along with confidence limits.  By qualitative results, I generally mean breakout analysis such as pairwise comparisons or specific contrasts created from the model.  These are sort of the “drill-down” of the model.

So, the general model for a unit of ANOVA would be something like this, assuming an imaginary study of IQ in response to Factor X:

4.3 Factor X Effect on IQ

Figure 1:  (The raw data or the data in summary form)

Table 1:   (The ANOVA table)

Figure 1 displays IQ scores by levels of Factor X.    Table 1 gives the results of ANOVA.  Factor X seems to have affected IQ (p=0.0498).  Specifically, low levels of Factor X were associated with an IQ drop of about 10 points with a 95% CI of (0.9, 21.1) when compared to the average of medium and high levels. Medium and high levels of Factor X were associated with very similar levels of IQ, 100 and 102 points respectively.

If there was more than one factor in the analysis, then things start to get more complicated.  However, you can use the same basic structure, either folding the other factor(s) into the mix or giving each factor its own little paragraph.  I tend to start with them separated, and after all analyses have been completed it may be that some general organizational scheme will present itself.