Analyzing Larger Contingency Tables

If you have a larger table than a two-by-two, then you can still use the χ² test for association. The problem in this case is presenting the results and interpreting them.

The overall test, if statistically significant, simply shows that there is some association other than “chance” in the table. But it does not tell you what the association is.

Many times, it is fairly obvious what the association is because there are clear patterns in the data. In those cases, you can usually get by with a bit of slightly vague verbiage.

This association is evidently due to the generally increasing incidence of darker feather color with the larger breeds of parrot.

This type of verbiage, along with some graphical displays, may be more than enough to make the point.

What are your choices if you want more quantitative statements? Not many, unfortunately. You will likely have to use a log-linear model to make comparisons that are analogous to those available in the linear model situation.

Sometimes it is reasonable to collapse the data in the table to fewer categories in order to summarize the relationships.

TGSGTS Progress Report

At this point, The Graduate Student’s Guide to Statistics is at the 41% completion mark.   To make this book more useful to you and to optimize my writing effort, I need your feedback!

Here is how it will work:  When you purchase the book from Leanpub, make sure I can see your e-mail address from Leanpub.  Then, let me know what you need!  If I can help you by writing up a section for the book, then you will have most likely helped many more people with the same issues.

Pilot Study

If you are planning a research program that requires long-term experiments, say on the order of several months or more, then you are really going to want to do some sort of pilot study. The same is true for studies that will require intensive resources of time or money, such as large surveys. If the studies that your research program will comprise are relatively small scale, then the need for a pilot study is less critical.

A pilot study is generally a limited version of your study, perhaps in a separate population or even separate organism, or at a much smaller scale, that is used to shake out problems in advance of performing the real study. Flat out, pilot studies are always worth the time it takes to set them up and perform them.

Another great use of pilot studies is to gain information that can help you to better design your research studies. By getting some initial estimates of the variation in your data, you may be able to optimize your study in terms of sample size.

How the pilot study should be carried out depends heavily on the research study you are proposing. Remember, the goal is to shake out any serious problems that might kill your research study at an awkward time—such as after it has been underway for a year.

Using imagination and forethought is a necessary prerequisite to designing a research program. But, no matter how good your imagination, it will inevitably fall short of the reality that your study and your succeeding statistical analysis will actually experience. So, the goal of the pilot study is to find problems at all levels of the process, including:

  • Logistics
  • Recruitment
  • Randomization
  • Sample handling
  • Measurement
  • Data collection

Here are a few ideas to help flesh out the concept:

  • For a survey, a pilot study could consist of administering the survey to your classmates, friends, and relatives. Or, it could consist of administering the survey to a small sample from the population you intend to survey.
  • For an animal observational study, you could apply your techniques to watching the local animals or the neighborhood cats and dogs.
  • For a crop study, you can often shake out issues at the back end of the study by buying produce and running your tests on that as a proxy.

In cases where you cannot perform a pilot study of some sort, you may be able to get some of the benefit of a pilot study by a careful review of existing literature and data. Or, by asking other people who have run similar studies for their advice on the matter. In the best cases, your committee members are going to be able to provide that sort of information.

General Strategy for Getting a Sample Size

Here is a general strategy for figuring out the sample size for your study.  In essence, it presupposes that you have your statistical analysis specified to at least the design and the exact statistical test you are going to use.

  1. Figure out the size of the signal that is important to you to detect.
  2. Determine the amount of variation to expect in the samples.
  3. Set the false positive rate (nearly always the traditional 5% level).
  4. Set the power.  A good starting level is 90%, but 80% is kind of traditional, too.
  5. Determine the sample size based on these quantities and the  proposed statistical test.
  6. Moderate the sample size if needed for expected drop-out,  non-completion, or similar issues.

If the total sample size you obtain is unrealistic in terms of time, cost, availability, or physical reality, then you need to make changes in your proposed design.  Or, you may have some serious soul-searching to do, and may need to rethink your plan of research.  Be happy you figured that out now!

Let’s talk about each point in turn.  First, you need to figure out the size of the signal that is important to you to detect.  This is going to be based on practical or academic considerations.  Examples of such considerations could be “Demonstrate a 15 lbs improvement in yield per acre over control”, or “Show a 50% increase in mortality”, or “Show an increase of at least 5 points on the scale due to intervention”.  Examples like “Show a positive association of behavior with treatment” are going to require some sort of quantification to move forward.

As mentioned above, determining the amount of variation in the samples can be done in a few ways.  One very good way is to mine the available literature for similar studies, then collect estimates of variability from those studies.  Another good way is to conduct a pilot study with the express purpose of estimating the variability.  Another useful method is to elicit estimates from subject-matter experts.  There are some handy tricks for this that statisticians know.

The false positive rate is nearly always going to be set at the 5% level.  (That is, the Type I error rate, or α, is going to be 0.05.)  You could set it lower, but trying to set it higher is going to most likely just cause you problems.

Most people never consider the so-called “power” of their study.  The power in simple language is

The probability that my study will actually give me a statistically significant result if there really is something there to see.

In other words, if what you think is happening really is happening, how likely is your study to actually show it?

Now you can see why setting the power to 90% might be a good idea!  If there is really something to see in the study, don’t you want the study to have a good chance of seeing it?

TGSGTS Progress Report

At this point, The Graduate Student’s Guide to Statistics is at about the 40% completion mark. I added a chapter on sample size considerations, and have made some minor edits to other sections of the book. It looks to me like it will be useful to have a set of chapters that cover how to write up the most common statistical methods. There are already stub chapters on ANOVA and regression.

Contrasts and Pairwise Comparisons

Contrasts are used in ANOVA to further analyze results from the ANOVA table.  Think of the significance tests in the ANOVA table as “the forest”, while contrasts are “the trees”.

You can design your study around a set of specific contrasts, as well. Often, your research hypotheses are going to be better stated as contrasts among various groups or treatments rather than as overall omnibus ANOVA tests of significance.

Pairwise comparison methods are just a type of contrast.  These essentially calculate a difference between every pair of groups or treatments.  Some common ones that you may know about are the Fisher Least Significant Difference, Tukey’s Honest Significant Difference, and the Bonferroni adjustment, but there are many more.

Let’s discuss these pairwise comparison methods a bit here.  The concept of pairwise comparison methods is simple:  We are going to try to explain the effects of some treatment or factor by comparing each possible level to every other possible level of the treatment.

Now, if we have only two levels of a treatment, there is only one pairwise comparison, and the ANOVA test of the treatment effect is it.  Done and stick a fork in it.

But, if we have more than two levels, then we start to have more pairs.  The general formula is given by the binomial function, so with k levels we have k(k-1)/2 pairs.  For example, with k=3 levels we have 3 pairs.  With k=4 levels we have 6 pairs.  And so on.

Remember that the huge bugaboo in statistical analysis is usually the Type I error.  This is also called the false positive rate and usually symbolized by α.  The traditional level of α is 0.05, or one in twenty, or 5%.  That is what we want to define the chance to be that we will declare a finding when there really is no finding—that is, we were fooled by randomness (a rant for another day).

Why 5%?  You can thank Sir Ronald Fisher for that.  Now, it has become a tradition, nearly set in stone.

So here’s the deal:  When we do a statistical test, we try to set the Type I error to 5%.  Then, we hope that this means that only one in twenty times are we going to declare a finding when actually there is nothing but noise in our data.

Think of it this way: The 5% is kind of like a lottery ticket.  A bad lottery ticket.  When you “win”, you win the booby prize of thinking that you found something interesting in your data to tell everyone about, but really you are sadly mistaken.  So, we actually don’t want to win this lottery.

Performing a single pairwise comparison is kind of like playing this lottery once.  Now, if I give you a single lottery ticket with a 5% chance of winning, you might be okay with that.  But, if I give you another ticket, now your chances of winning are going up.   Think about the situation with four levels and six pairwise comparisons—that’s like getting six lottery tickets!

Obviously, the more comparisons you do, the more you stand to win the booby prize.  What to do?

Well, here’s a possible solution.  Let’s make the chance of winning on any one ticket small enough so that your overall chance of winning even with all those ticket is controlled back down to the soothing 5% level.  Aha!  Now, we can make all those comparisons without losing sleep over our soon-to-be-destroyed reputation!

But, as with everything in life, sadly, there is a price to pay for this peace of mind.  That price is simple:  It is going to be much harder to actually find anything to be different.

So, with that in mind, we generally have three categories of pairwise comparisons.  We have conservative ones, liberal ones, and at least nominally exact ones (the libertarians).  Conservative tests correct too much, so that you are less likely to be able to detect differences but are definitely controlling your Type I error. Liberal tests correct too little or not at all, so you are more likely to spot differences but you are not controlling your Type I error.  Nominally exact ones attempt to exactly control the Type I error to the level you set (invariably 5%).

Why “nominally exact”?  Well, that’s a bit of a made-up term on my part.  I just want to alert you to the fact that the “exactness” of these tests comes with a heavy reliance upon a variety of assumptions in the background.

Here’s a little list to give you some idea of how to think about these types of tests:

  • The Fisher LSD is a liberal test.  It is more likely to find differences.  It under-controls false positives.
  • The Tukey HSD is a nominally exact test.  It controls the false positives to our specification (usually, to 5%).
  • The Bonferroni is a conservative test.  It is less likely to find differences.  It over-controls false positives.

One way to interpret all this is as follows:  If you can’t see it using the Fisher LSD tests, it isn’t there.  If you see something using the Bonferroni test, it’s there.

In exploratory studies where you don’t want to miss interesting results, you would probably want to use a more liberal test.  On the other hand, in a confirmatory study, you would probably want to use either an exact control or the over-control provided by the Bonferroni method.

Here is an interesting note:  You do not have to perform ANOVA first before you use the pairwise comparisons.   There is no theoretical justification for that.  But, do not waste your time trying to buck convention on this one—you’ll get nowhere.  So, just do what everyone else does:  Report the omnibus ANOVA test first as significant, then give the pairwise comparisons.

It used to be thought that running the ANOVA first, followed by the Fisher LSD tests was a way to control the Type I error.  However, it is not methodologically sound.  But, the procedure was called the “protected Fisher LSD comparisons” method, and so that has to be good, right?

Wrong.  Statisticians have a bad habit of naming things in ways that sound great.  After all, who can argue with statistics that are “most powerful”, or “unbiased”, or “protected”, or presented as “objective”?  This trend seems to be have been started by Fisher himself, but I haven’t really researched the topic.  There is also another bad habit at work, that of creating a massive amount of jargon.  This jargon seems designed only to keep people out of statistics, if you ask me.

How to Report ANOVA Results

Here’s the basic layout that I recommend for writing about ANOVA results:

1. Provide the ANOVA table.  (This may be optional.)

2. Provide a graphical or tabular summary of the data, or both.

3. Give the results of general tests of significance.

4. Give specific quantitative or qualitative results.

By general tests of significance, I really mean the so-called “omnibus” tests of significance that are shown in the ANOVA table. These are generally tests that try to answer questions like “Does Factor X have any effect on growth?”  In some cases, you may need to report such things as overall model significance, which answers the really general question of “Is there any explanatory value of any of these factors compared to just taking the mean of all the data?”

These omnibus tests include test of main effects and tests of interaction effects.  Without going into detail just yet, main effects are probably what most people think of when they wonder if some experimental factor has any effect.  However, interaction effects can be very interesting results themselves, so don’t be upset if you find statistically significant interactions!

An interaction effect means that the effect of one factor depends somehow on the effect of another factor.  Or, more correctly, the effect of one factor is associated to a different effect depending upon the level of some other factor.

By specific quantitative results, I mean such items as estimates of differences or other specific quantities, perhaps along with confidence limits.  By qualitative results, I generally mean breakout analysis such as pairwise comparisons or specific contrasts created from the model.  These are sort of the “drill-down” of the model.

So, the general model for a unit of ANOVA would be something like this, assuming an imaginary study of IQ in response to Factor X:

4.3 Factor X Effect on IQ

Figure 1:  (The raw data or the data in summary form)

Table 1:   (The ANOVA table)

Figure 1 displays IQ scores by levels of Factor X.    Table 1 gives the results of ANOVA.  Factor X seems to have affected IQ (p=0.0498).  Specifically, low levels of Factor X were associated with an IQ drop of about 10 points with a 95% CI of (0.9, 21.1) when compared to the average of medium and high levels. Medium and high levels of Factor X were associated with very similar levels of IQ, 100 and 102 points respectively.

If there was more than one factor in the analysis, then things start to get more complicated.  However, you can use the same basic structure, either folding the other factor(s) into the mix or giving each factor its own little paragraph.  I tend to start with them separated, and after all analyses have been completed it may be that some general organizational scheme will present itself.

TGSGTS Blog Progress

Moving material from the book to the blog turns out to be a little harder than a simple copy and paste!  So, I have been settling for moving blocks of text, primarily.   That is going to leave out tables, figures, examples, and any little vignettes.

Design Considerations

In the following, “study” will be used to stand for experiments, trials, studies, surveys, and so forth.

Here are some basic design principles to keep in mind:

First, use the simplest design possible that will still answer your research questions.  There are many academically fascinating experimental designs that are a nightmare to set up, administer, carry out, and then analyze.  Simple designs usually have simple statistical analysis methods associated with them that are more robust to problems such as missing data or weird data distributions.

In contrast, more complex designs can create real problems in the statistical analysis when things go wrong.

For example, in a design that has a complex structure, it may turn out that due to physical reality that some treatment combinations cannot actually be studied together.  In such cases the damage to the design may be so total as to make most of the data worthless.

Second, make sure that the design can actually produce data that has a chance of answering your research questions.  The statistical power of the design is one quality that can be examined for this; but, more fundamentally, you need to make sure that the data can even logically bear on your research question!

Use a design that can be collapsed onto itself if necessary.

You may need to collect data on multiple variables.  For example, a nutritional study may collect a complete lipid profile for each subject in a cross-over design, along with hormone measurements at sparser time points, plus information on diet and mood, along with the usual demographic and baseline characteristics.

Avoid designs that have complex structure such as balanced incomplete blocks, or heavily aliased fractional factorial designs, or the like, unless your study can be replicated with little investment of time and expense.

Repeated Measures

What are “repeated measures”?  Do I need to do MANOVA?  What is MANOVA?

Repeated measures is a bit of a slippery term that is actually not well-defined.

The classic use of the term repeated measures refers to measurements of the same quantity repeated across time.  For example, if we measure blood levels of glucose in a single subject on a daily basis, we have repeated measures of glucose.

Nowadays, the term has also come to encompass measurements of the same quantity repeated across space.  For example, if we analyze soil community composition in terms of a single parameter at different depths at the same location, we have repeated measures of the parameter.

A slightly stretched version of repeated measures encompasses measurements of different quantities at the same time and space.  An example of this would be measurement of gene expression profiles on a set of samples.  In this case the different probes represent different protein expression level.  This is an abuse of terminology, included only so you will be aware of the possibility.  The usual way to describe measurements of different quantities at the same time and space would be the term multivariate.

For the following discussion, assume that we have a classic repeated measures situation.  That is, we have individual subjects with measurements of the same quantity taken across time.  The times are assumed to be the same for all subjects, and the times further had probably best be equally spaced.

The conflation of repeated measures and multivariate analysis is probably one that lies more in the lack of computational ability found in days gone by.  With limited computational power, it was necessary to create statistical methods that could be computed in reasonable time on reasonably-sized data sets.

A multivariate analysis of variance (MANOVA) treats all of the repeated measures from each subject as a single vector of responses.  A MANOVA is appropriate for the case where the responses  arbitrarily covary, but can still be though of as a (vector-valued) linear function of the independent variables.  Since the covariance can be general, it certainly encompasses more restrictive situations such as this case, where you might reasonably expect values from the same subject to be correlated.

With a little imagination, you can see how one could devise various tests for whether the repeated measures differ from each other.  For example, one could form a new data set consisting of differences between adjacent time points.  Then, a formal statistical test of whether all of the differences were equal to zero would do the job.

If we continue the statistical bad habit of making assumptions, then we may as well make some more.  Suppose that the correlation amongst the repeated measures takes a form that is known as compound symmetry.  Without going into details, this then implies a correlation structure known as sphericity.  The bottom line on sphericity is that there is an implication that the variance of differences between any two time points is the same.

You may ask what this heroic assumption brings.  Well, in this case, it brings a big computational simplification for the analysis.  A so-called univariate analysis may be run instead.  The analysis rather depends on the assumption of sphericity being true, so in cases where sphericity might be violated, some methods of correcting for the violation have been proposed.

An unfortunate limitation of the old-school methods of MANOVA and univiarate repeated measures is that they rely on having complete data for each subject.  Using these methodogies requires you to essentially toss out the data from subjects with any missing data.  Also, the MANOVA approach requires many subjects per time point in order to provide decent estimation of the covariance structure of the data.

With the advent of linear mixed effects models, we can probably dispense for the most part with using MANOVA or the old school univariate analysis for repeated measures.  Essentially, the linear mixed effect model framework can provide us with pretty close to equivalent modeling capability with much greater flexibility. Also, since this framework is likelihood-based, it allows for missing data, which is a huge improvement.

For more reading on repeated measures, check out these links: