Contrasts and Pairwise Comparisons

Contrasts are used in ANOVA to further analyze results from the ANOVA table.  Think of the significance tests in the ANOVA table as “the forest”, while contrasts are “the trees”.

You can design your study around a set of specific contrasts, as well. Often, your research hypotheses are going to be better stated as contrasts among various groups or treatments rather than as overall omnibus ANOVA tests of significance.

Pairwise comparison methods are just a type of contrast.  These essentially calculate a difference between every pair of groups or treatments.  Some common ones that you may know about are the Fisher Least Significant Difference, Tukey’s Honest Significant Difference, and the Bonferroni adjustment, but there are many more.

Let’s discuss these pairwise comparison methods a bit here.  The concept of pairwise comparison methods is simple:  We are going to try to explain the effects of some treatment or factor by comparing each possible level to every other possible level of the treatment.

Now, if we have only two levels of a treatment, there is only one pairwise comparison, and the ANOVA test of the treatment effect is it.  Done and stick a fork in it.

But, if we have more than two levels, then we start to have more pairs.  The general formula is given by the binomial function, so with k levels we have k(k-1)/2 pairs.  For example, with k=3 levels we have 3 pairs.  With k=4 levels we have 6 pairs.  And so on.

Remember that the huge bugaboo in statistical analysis is usually the Type I error.  This is also called the false positive rate and usually symbolized by α.  The traditional level of α is 0.05, or one in twenty, or 5%.  That is what we want to define the chance to be that we will declare a finding when there really is no finding—that is, we were fooled by randomness (a rant for another day).

Why 5%?  You can thank Sir Ronald Fisher for that.  Now, it has become a tradition, nearly set in stone.

So here’s the deal:  When we do a statistical test, we try to set the Type I error to 5%.  Then, we hope that this means that only one in twenty times are we going to declare a finding when actually there is nothing but noise in our data.

Think of it this way: The 5% is kind of like a lottery ticket.  A bad lottery ticket.  When you “win”, you win the booby prize of thinking that you found something interesting in your data to tell everyone about, but really you are sadly mistaken.  So, we actually don’t want to win this lottery.

Performing a single pairwise comparison is kind of like playing this lottery once.  Now, if I give you a single lottery ticket with a 5% chance of winning, you might be okay with that.  But, if I give you another ticket, now your chances of winning are going up.   Think about the situation with four levels and six pairwise comparisons—that’s like getting six lottery tickets!

Obviously, the more comparisons you do, the more you stand to win the booby prize.  What to do?

Well, here’s a possible solution.  Let’s make the chance of winning on any one ticket small enough so that your overall chance of winning even with all those ticket is controlled back down to the soothing 5% level.  Aha!  Now, we can make all those comparisons without losing sleep over our soon-to-be-destroyed reputation!

But, as with everything in life, sadly, there is a price to pay for this peace of mind.  That price is simple:  It is going to be much harder to actually find anything to be different.

So, with that in mind, we generally have three categories of pairwise comparisons.  We have conservative ones, liberal ones, and at least nominally exact ones (the libertarians).  Conservative tests correct too much, so that you are less likely to be able to detect differences but are definitely controlling your Type I error. Liberal tests correct too little or not at all, so you are more likely to spot differences but you are not controlling your Type I error.  Nominally exact ones attempt to exactly control the Type I error to the level you set (invariably 5%).

Why “nominally exact”?  Well, that’s a bit of a made-up term on my part.  I just want to alert you to the fact that the “exactness” of these tests comes with a heavy reliance upon a variety of assumptions in the background.

Here’s a little list to give you some idea of how to think about these types of tests:

  • The Fisher LSD is a liberal test.  It is more likely to find differences.  It under-controls false positives.
  • The Tukey HSD is a nominally exact test.  It controls the false positives to our specification (usually, to 5%).
  • The Bonferroni is a conservative test.  It is less likely to find differences.  It over-controls false positives.

One way to interpret all this is as follows:  If you can’t see it using the Fisher LSD tests, it isn’t there.  If you see something using the Bonferroni test, it’s there.

In exploratory studies where you don’t want to miss interesting results, you would probably want to use a more liberal test.  On the other hand, in a confirmatory study, you would probably want to use either an exact control or the over-control provided by the Bonferroni method.

Here is an interesting note:  You do not have to perform ANOVA first before you use the pairwise comparisons.   There is no theoretical justification for that.  But, do not waste your time trying to buck convention on this one—you’ll get nowhere.  So, just do what everyone else does:  Report the omnibus ANOVA test first as significant, then give the pairwise comparisons.

It used to be thought that running the ANOVA first, followed by the Fisher LSD tests was a way to control the Type I error.  However, it is not methodologically sound.  But, the procedure was called the “protected Fisher LSD comparisons” method, and so that has to be good, right?

Wrong.  Statisticians have a bad habit of naming things in ways that sound great.  After all, who can argue with statistics that are “most powerful”, or “unbiased”, or “protected”, or presented as “objective”?  This trend seems to be have been started by Fisher himself, but I haven’t really researched the topic.  There is also another bad habit at work, that of creating a massive amount of jargon.  This jargon seems designed only to keep people out of statistics, if you ask me.

Leave a Reply