The General Purpose of Data Transformations

For the most part, data transformations are used to allow us to apply linear model techniques to the data.

I have come to the following thought process on transformations for this purpose.  There are four primary reasons for transformation, listed here in rough order of desirability from my point of view:

Scientific:  It may be that the scientific theory—whether physical, chemical, or biological—has already mathematically described a relationship among the variables.  In that case, it may be that by taking a transformation of the data that the relationship becomes linear.

Operational:  In many settings, for some unknown reason or sometimes for a good reason, a data transformation will linearize the data quite well.  For example, when data span magnitudes, most measurement systems will naturally measure larger variability around larger measurements—a good example of this is found in assay systems that use serial dilution.

Statistical: Now, in some cases, if you assume a probability model, you can find that some transformations do something interesting called “variance stabilization”.  (Well, it’s interesting if you are a statistician, anyway.)  The reason this is relevant is that that the linear model methods all pretty much require equal variances in each group (a slight simplification).  The variance stabilizing transformation can create that situation.

Empirical: Finally, at the bottom of the barrel, you have the bright idea of “Let’s just transform the hell out of the data until we find a transformation that linearizes it!”  There is a certain charm to this, but I am less convinced by the line of reasoning there.  This is usually carried out via either a random “Let’s keep trying transformations till we find a good one.”  approach and something dressed up more formally, like finding the Box-Cox transformation.

I rank the operational reason above the statistical on the grounds that there is something real happening to create the need for a data transformation.  It is important to stay grounded in reality and avoid drifing into the nether worlds of statistical theory.

In fact, there is a pronounced tendency in statistics to elevate statistical theory to the level of reality.  Perhaps there is some deep-seated psychological reason for this.  We need not concern ourselves with this, but simply try to recognize that much of statistical critique is based on this idea of making assumptions about the real world that may or may not be warranted.

The Box-Cox transformation is to me an example of putting the statistical cart before the reality horse: It gives primary importance to supposed existence of normally distributed errors and transforms the data to suit.  While one might perhaps make a Bayesian argument for this, in fact this is never done.

Untransforming

There is a major problem in transforming data.  All of your analysis is now on a scale that is different from your original scale!  The units are now transformed units.  And, the nonlinearity of the transformation is going to make trouble for you.  For example, you may want to present means and standard deviations on the original scale, but these are going to seem off compared to results on the transformed scale.

There are a few ways around this.  The first way is a bit disingenuous.  Present your data on the original scale, but perform the statistical analysis on the transformed data.  Be sure to label things appropriately, with some or all of the following:

  • The Statistical Methods section should indicate something like “The  logarithmic transformation was used where appropriate.”
  • The Results section needs to indicate whether data were transformed   or untransformed in the text and any figures or tables.
  • The Results section may need to indicate that the transformed and   untransformed analyses are similar.

There is nothing formally wrong with this.  You have informed the reader of exactly what has been done, and everything is clearly labeled so that there is (in theory) no confusion.  However, it is slightly disingenuous because we all know that most people are going to gloss the text under the implicit assumption that the statistical analysis is on the same scale as the data presentation.

The second way is to look at the results from transformed and untransformed data.  If they are similar in outcome, you can consider simply presenting the untransformed analysis with perhaps a mention of the transformed analysis.  This is methodologically wrong, but may be pragmatic.

The third way is to bite the bullet and do the math necessary to back-transform everything.  This can be a headache sometimes, and you still won’t avoid the problem of reviewers noting the differences between the untransformed and transformed analyses.  You can alleviate this somewhat by very clearly indicating that results have been back-transformed in the Statistical Methods and Results sections.

Really, when you transform the data you enter into a no-win situation many times.  Depending on the whim or character of reviewers, you can get complaints from any of these approaches.  The first way may be criticized as being inconsistent.  The second way may be criticized as wrong.  The third way may be criticized as confusing or perhaps unnecessary.

Logarithmic Transformation

The logarithmic transformation, or log transform for short, is a very useful transformation for several situations.

When scientific theory indicates some sort of multiplicative relationship between the predictors and the result, the log transformation is a useful device.  For example, in chemical reaction data or in neurotransmitter data you may have multiplicative relationships that are linearized with the logarithmic transformation.

Also, the way that measurement systems work may create an operational need for the log transformation.  When data span various orders of magnitude, it is common for the measurement scale to be subject to error dependent upon the level of measurement.  The log transform changes this to a situation where the variance is similar across the board.

Statistically, the logarithmic transformation is also a variance-stabilizing transformation in the case of the log-normal distribution.

I prefer to use natural logarithms to perform the log transform.   The natural logarithm has the base e≈2.171828, which may seem anything but “natural”.  As with statisticians, mathematicians have their own challenges in dealing with reality.  Suffice it to say that for mathematicians, the base e works nicely.

The usual logarithm that you probably met first was the base-10 logarithm.  The base-10 logarithm is also called the common logarithm.  It is a natural to use when thinking of magnitudes, or quantities analogous to decibels or the Richter scale.  An increase of one unit in the base-10 logarithm corresponds to multiplying by a factor of 10 on the original data.

The natural logarithm is arguably the most common default when you see “log” as the function name in software, though you need to check whether they really mean the base-10 logarithm.  To make things more confusing, the usual mathematical notation for the natural logarithm is “ln” (for the Latin logarithmus naturali), and so sometimes you see that used as the function name in computer software.

For work with linear models, such as regression or ANOVA, it does not matter which base you use!  The change of base formula for logarithms is a simple rescaling.  So although the estimates and coefficients will be changed, the statistical significance results will be completely untouched.

There is a great little trick with natural logarithms, too.  For small values of differences between natural logarithms, say below 0.50, you can directly interpret the differences as percent changes on the original data scale.  This is more accurate the smaller the difference.  That’s handy for interpreting the results.

Depending on the situation, a base other than e or 10 may be useful.  When using the base-2 logarithm, the differences represent fold-changes.  That is, a change of one unit means a change of 1-fold, or a doubling.  A change of two units means a change of 2-fold, or a quadrupling.

Now, what about the case where you have zeroes in the data?  The logarithm of zero is undefined, so that creates a bit of a problem!

There are several approaches to this issue:

1. Analyze the data without the zero data.

2. Adding a positive offset only to the zero data.

3. The more usual dodge of adding a small positive offset to all of the data.

The effects of the second and third choices are going to depend heavily on the number of zeroes in the data, as well as the size of the offset added.  If you add a very tiny number in comparison to the rest of the data, then on the log scale the data that were zeroes are going to be hanging out quite a way away from the rest of the data, making them very high leverage data points.

The first choice is often a good choice.  However, with many zeroes, the number of missing data can dramatically affect the results as well as even the ability to perform some statistical analysis.  A rule of thumb might be to save this approach generally for when the number of zeroes is somewhere under 10%.  The interpretation of the results will then have to be conditional upon actually seeing a positive value.

Read More about Transformations

For a nice discussion from the statistical point of view of the logarithmic transformation and how to interpret it, check out these topics on Cross Validated:

For some additional discussion about the case where you have zeroes:

For some more information about the use of the Box-Cox transformation and the aftermath, check out these topics on Cross Validated:

And, for some general discussion about transformations, see:

The online Engineering Statistics Handbook discusses some of the operational details of finding data transformations in Section 4.6.3.3 Transformations to Improve Fit.