General Strategy for Getting a Sample Size

Here is a general strategy for figuring out the sample size for your study.  In essence, it presupposes that you have your statistical analysis specified to at least the design and the exact statistical test you are going to use.

  1. Figure out the size of the signal that is important to you to detect.
  2. Determine the amount of variation to expect in the samples.
  3. Set the false positive rate (nearly always the traditional 5% level).
  4. Set the power.  A good starting level is 90%, but 80% is kind of traditional, too.
  5. Determine the sample size based on these quantities and the  proposed statistical test.
  6. Moderate the sample size if needed for expected drop-out,  non-completion, or similar issues.

If the total sample size you obtain is unrealistic in terms of time, cost, availability, or physical reality, then you need to make changes in your proposed design.  Or, you may have some serious soul-searching to do, and may need to rethink your plan of research.  Be happy you figured that out now!

Let’s talk about each point in turn.  First, you need to figure out the size of the signal that is important to you to detect.  This is going to be based on practical or academic considerations.  Examples of such considerations could be “Demonstrate a 15 lbs improvement in yield per acre over control”, or “Show a 50% increase in mortality”, or “Show an increase of at least 5 points on the scale due to intervention”.  Examples like “Show a positive association of behavior with treatment” are going to require some sort of quantification to move forward.

As mentioned above, determining the amount of variation in the samples can be done in a few ways.  One very good way is to mine the available literature for similar studies, then collect estimates of variability from those studies.  Another good way is to conduct a pilot study with the express purpose of estimating the variability.  Another useful method is to elicit estimates from subject-matter experts.  There are some handy tricks for this that statisticians know.

The false positive rate is nearly always going to be set at the 5% level.  (That is, the Type I error rate, or α, is going to be 0.05.)  You could set it lower, but trying to set it higher is going to most likely just cause you problems.

Most people never consider the so-called “power” of their study.  The power in simple language is

The probability that my study will actually give me a statistically significant result if there really is something there to see.

In other words, if what you think is happening really is happening, how likely is your study to actually show it?

Now you can see why setting the power to 90% might be a good idea!  If there is really something to see in the study, don’t you want the study to have a good chance of seeing it?

Design Considerations

In the following, “study” will be used to stand for experiments, trials, studies, surveys, and so forth.

Here are some basic design principles to keep in mind:

First, use the simplest design possible that will still answer your research questions.  There are many academically fascinating experimental designs that are a nightmare to set up, administer, carry out, and then analyze.  Simple designs usually have simple statistical analysis methods associated with them that are more robust to problems such as missing data or weird data distributions.

In contrast, more complex designs can create real problems in the statistical analysis when things go wrong.

For example, in a design that has a complex structure, it may turn out that due to physical reality that some treatment combinations cannot actually be studied together.  In such cases the damage to the design may be so total as to make most of the data worthless.

Second, make sure that the design can actually produce data that has a chance of answering your research questions.  The statistical power of the design is one quality that can be examined for this; but, more fundamentally, you need to make sure that the data can even logically bear on your research question!

Use a design that can be collapsed onto itself if necessary.

You may need to collect data on multiple variables.  For example, a nutritional study may collect a complete lipid profile for each subject in a cross-over design, along with hormone measurements at sparser time points, plus information on diet and mood, along with the usual demographic and baseline characteristics.

Avoid designs that have complex structure such as balanced incomplete blocks, or heavily aliased fractional factorial designs, or the like, unless your study can be replicated with little investment of time and expense.

Repeated Measures

What are “repeated measures”?  Do I need to do MANOVA?  What is MANOVA?

Repeated measures is a bit of a slippery term that is actually not well-defined.

The classic use of the term repeated measures refers to measurements of the same quantity repeated across time.  For example, if we measure blood levels of glucose in a single subject on a daily basis, we have repeated measures of glucose.

Nowadays, the term has also come to encompass measurements of the same quantity repeated across space.  For example, if we analyze soil community composition in terms of a single parameter at different depths at the same location, we have repeated measures of the parameter.

A slightly stretched version of repeated measures encompasses measurements of different quantities at the same time and space.  An example of this would be measurement of gene expression profiles on a set of samples.  In this case the different probes represent different protein expression level.  This is an abuse of terminology, included only so you will be aware of the possibility.  The usual way to describe measurements of different quantities at the same time and space would be the term multivariate.

For the following discussion, assume that we have a classic repeated measures situation.  That is, we have individual subjects with measurements of the same quantity taken across time.  The times are assumed to be the same for all subjects, and the times further had probably best be equally spaced.

The conflation of repeated measures and multivariate analysis is probably one that lies more in the lack of computational ability found in days gone by.  With limited computational power, it was necessary to create statistical methods that could be computed in reasonable time on reasonably-sized data sets.

A multivariate analysis of variance (MANOVA) treats all of the repeated measures from each subject as a single vector of responses.  A MANOVA is appropriate for the case where the responses  arbitrarily covary, but can still be though of as a (vector-valued) linear function of the independent variables.  Since the covariance can be general, it certainly encompasses more restrictive situations such as this case, where you might reasonably expect values from the same subject to be correlated.

With a little imagination, you can see how one could devise various tests for whether the repeated measures differ from each other.  For example, one could form a new data set consisting of differences between adjacent time points.  Then, a formal statistical test of whether all of the differences were equal to zero would do the job.

If we continue the statistical bad habit of making assumptions, then we may as well make some more.  Suppose that the correlation amongst the repeated measures takes a form that is known as compound symmetry.  Without going into details, this then implies a correlation structure known as sphericity.  The bottom line on sphericity is that there is an implication that the variance of differences between any two time points is the same.

You may ask what this heroic assumption brings.  Well, in this case, it brings a big computational simplification for the analysis.  A so-called univariate analysis may be run instead.  The analysis rather depends on the assumption of sphericity being true, so in cases where sphericity might be violated, some methods of correcting for the violation have been proposed.

An unfortunate limitation of the old-school methods of MANOVA and univiarate repeated measures is that they rely on having complete data for each subject.  Using these methodogies requires you to essentially toss out the data from subjects with any missing data.  Also, the MANOVA approach requires many subjects per time point in order to provide decent estimation of the covariance structure of the data.

With the advent of linear mixed effects models, we can probably dispense for the most part with using MANOVA or the old school univariate analysis for repeated measures.  Essentially, the linear mixed effect model framework can provide us with pretty close to equivalent modeling capability with much greater flexibility. Also, since this framework is likelihood-based, it allows for missing data, which is a huge improvement.

For more reading on repeated measures, check out these links: