Untransforming

There is a major problem in transforming data.  All of your analysis is now on a scale that is different from your original scale!  The units are now transformed units.  And, the nonlinearity of the transformation is going to make trouble for you.  For example, you may want to present means and standard deviations on the original scale, but these are going to seem off compared to results on the transformed scale.

There are a few ways around this.  The first way is a bit disingenuous.  Present your data on the original scale, but perform the statistical analysis on the transformed data.  Be sure to label things appropriately, with some or all of the following:

  • The Statistical Methods section should indicate something like “The  logarithmic transformation was used where appropriate.”
  • The Results section needs to indicate whether data were transformed   or untransformed in the text and any figures or tables.
  • The Results section may need to indicate that the transformed and   untransformed analyses are similar.

There is nothing formally wrong with this.  You have informed the reader of exactly what has been done, and everything is clearly labeled so that there is (in theory) no confusion.  However, it is slightly disingenuous because we all know that most people are going to gloss the text under the implicit assumption that the statistical analysis is on the same scale as the data presentation.

The second way is to look at the results from transformed and untransformed data.  If they are similar in outcome, you can consider simply presenting the untransformed analysis with perhaps a mention of the transformed analysis.  This is methodologically wrong, but may be pragmatic.

The third way is to bite the bullet and do the math necessary to back-transform everything.  This can be a headache sometimes, and you still won’t avoid the problem of reviewers noting the differences between the untransformed and transformed analyses.  You can alleviate this somewhat by very clearly indicating that results have been back-transformed in the Statistical Methods and Results sections.

Really, when you transform the data you enter into a no-win situation many times.  Depending on the whim or character of reviewers, you can get complaints from any of these approaches.  The first way may be criticized as being inconsistent.  The second way may be criticized as wrong.  The third way may be criticized as confusing or perhaps unnecessary.

Logarithmic Transformation

The logarithmic transformation, or log transform for short, is a very useful transformation for several situations.

When scientific theory indicates some sort of multiplicative relationship between the predictors and the result, the log transformation is a useful device.  For example, in chemical reaction data or in neurotransmitter data you may have multiplicative relationships that are linearized with the logarithmic transformation.

Also, the way that measurement systems work may create an operational need for the log transformation.  When data span various orders of magnitude, it is common for the measurement scale to be subject to error dependent upon the level of measurement.  The log transform changes this to a situation where the variance is similar across the board.

Statistically, the logarithmic transformation is also a variance-stabilizing transformation in the case of the log-normal distribution.

I prefer to use natural logarithms to perform the log transform.   The natural logarithm has the base e≈2.171828, which may seem anything but “natural”.  As with statisticians, mathematicians have their own challenges in dealing with reality.  Suffice it to say that for mathematicians, the base e works nicely.

The usual logarithm that you probably met first was the base-10 logarithm.  The base-10 logarithm is also called the common logarithm.  It is a natural to use when thinking of magnitudes, or quantities analogous to decibels or the Richter scale.  An increase of one unit in the base-10 logarithm corresponds to multiplying by a factor of 10 on the original data.

The natural logarithm is arguably the most common default when you see “log” as the function name in software, though you need to check whether they really mean the base-10 logarithm.  To make things more confusing, the usual mathematical notation for the natural logarithm is “ln” (for the Latin logarithmus naturali), and so sometimes you see that used as the function name in computer software.

For work with linear models, such as regression or ANOVA, it does not matter which base you use!  The change of base formula for logarithms is a simple rescaling.  So although the estimates and coefficients will be changed, the statistical significance results will be completely untouched.

There is a great little trick with natural logarithms, too.  For small values of differences between natural logarithms, say below 0.50, you can directly interpret the differences as percent changes on the original data scale.  This is more accurate the smaller the difference.  That’s handy for interpreting the results.

Depending on the situation, a base other than e or 10 may be useful.  When using the base-2 logarithm, the differences represent fold-changes.  That is, a change of one unit means a change of 1-fold, or a doubling.  A change of two units means a change of 2-fold, or a quadrupling.

Now, what about the case where you have zeroes in the data?  The logarithm of zero is undefined, so that creates a bit of a problem!

There are several approaches to this issue:

1. Analyze the data without the zero data.

2. Adding a positive offset only to the zero data.

3. The more usual dodge of adding a small positive offset to all of the data.

The effects of the second and third choices are going to depend heavily on the number of zeroes in the data, as well as the size of the offset added.  If you add a very tiny number in comparison to the rest of the data, then on the log scale the data that were zeroes are going to be hanging out quite a way away from the rest of the data, making them very high leverage data points.

The first choice is often a good choice.  However, with many zeroes, the number of missing data can dramatically affect the results as well as even the ability to perform some statistical analysis.  A rule of thumb might be to save this approach generally for when the number of zeroes is somewhere under 10%.  The interpretation of the results will then have to be conditional upon actually seeing a positive value.

Read More about Transformations

For a nice discussion from the statistical point of view of the logarithmic transformation and how to interpret it, check out these topics on Cross Validated:

For some additional discussion about the case where you have zeroes:

For some more information about the use of the Box-Cox transformation and the aftermath, check out these topics on Cross Validated:

And, for some general discussion about transformations, see:

The online Engineering Statistics Handbook discusses some of the operational details of finding data transformations in Section 4.6.3.3 Transformations to Improve Fit.