The logarithmic transformation, or log transform for short, is a very useful transformation for several situations.
When scientific theory indicates some sort of multiplicative relationship between the predictors and the result, the log transformation is a useful device. For example, in chemical reaction data or in neurotransmitter data you may have multiplicative relationships that are linearized with the logarithmic transformation.
Also, the way that measurement systems work may create an operational need for the log transformation. When data span various orders of magnitude, it is common for the measurement scale to be subject to error dependent upon the level of measurement. The log transform changes this to a situation where the variance is similar across the board.
Statistically, the logarithmic transformation is also a variance-stabilizing transformation in the case of the log-normal distribution.
I prefer to use natural logarithms to perform the log transform. The natural logarithm has the base e≈2.171828, which may seem anything but “natural”. As with statisticians, mathematicians have their own challenges in dealing with reality. Suffice it to say that for mathematicians, the base e works nicely.
The usual logarithm that you probably met first was the base-10 logarithm. The base-10 logarithm is also called the common logarithm. It is a natural to use when thinking of magnitudes, or quantities analogous to decibels or the Richter scale. An increase of one unit in the base-10 logarithm corresponds to multiplying by a factor of 10 on the original data.
The natural logarithm is arguably the most common default when you see “log” as the function name in software, though you need to check whether they really mean the base-10 logarithm. To make things more confusing, the usual mathematical notation for the natural logarithm is “ln” (for the Latin logarithmus naturali), and so sometimes you see that used as the function name in computer software.
For work with linear models, such as regression or ANOVA, it does not matter which base you use! The change of base formula for logarithms is a simple rescaling. So although the estimates and coefficients will be changed, the statistical significance results will be completely untouched.
There is a great little trick with natural logarithms, too. For small values of differences between natural logarithms, say below 0.50, you can directly interpret the differences as percent changes on the original data scale. This is more accurate the smaller the difference. That’s handy for interpreting the results.
Depending on the situation, a base other than e or 10 may be useful. When using the base-2 logarithm, the differences represent fold-changes. That is, a change of one unit means a change of 1-fold, or a doubling. A change of two units means a change of 2-fold, or a quadrupling.
Now, what about the case where you have zeroes in the data? The logarithm of zero is undefined, so that creates a bit of a problem!
There are several approaches to this issue:
1. Analyze the data without the zero data.
2. Adding a positive offset only to the zero data.
3. The more usual dodge of adding a small positive offset to all of the data.
The effects of the second and third choices are going to depend heavily on the number of zeroes in the data, as well as the size of the offset added. If you add a very tiny number in comparison to the rest of the data, then on the log scale the data that were zeroes are going to be hanging out quite a way away from the rest of the data, making them very high leverage data points.
The first choice is often a good choice. However, with many zeroes, the number of missing data can dramatically affect the results as well as even the ability to perform some statistical analysis. A rule of thumb might be to save this approach generally for when the number of zeroes is somewhere under 10%. The interpretation of the results will then have to be conditional upon actually seeing a positive value.