Laying Out the Manuscript

It is going to be most helpful to you to find examples that you can use to guide your writing.  If you are writing a thesis or dissertation, then there should be plenty of examples of these on file at the school.  There may be departmental requirements or expectations, so it would be good to outline the headings for the manuscript first.

If you are writing for publication, then simply review a number of journal articles from the journal(s) that you are considering.  Also, each journal has its own set of author guidelines, which you should download and follow.  This will make your life easier.

Doing these tasks will give you a vision and a feel for how things ought to look.  You get absolutely no points for learning things the hard way in life!

Your immediate goal should be to put together a skeleton to guide your efforts moving forward.  Consider the skeleton of the manuscript to be the headings and sub-headings.  This is the logical organization of the manuscript rather than the content.

The Rough Draft

I think you should get a rough draft written as soon as you possibly can do so.  In many ways, this is the hardest part of the process.  The way to move on this is in chunks:

  • Write the Introduction, Background, Methods, and References sections   first, and add to them and edit them as you go along.  These   sections are not going to change too much.
  • Write the Abstract last, obviously, after you have written the   Results and Discussion sections.  However, you can write the   skeleton part of the Abstract since it usually has a little bit of   introductory text and background.
  • Plan on using the Statistical Analysis Plan layout to organize the   Results section of your manuscript.
  • Plan on writing the Discussion section in parallel to the Results.

Note:   Until you have the Rough Draft done, I recommend not showing it to anyone on your committe including your advisor!

The reason for this is that you will simply get sidetracked by your committee members brainstorming, knee-jerk criticism, and helpful ideas.  It is all too easy to get caught up in an endless cycle of changing the statistical analyses and results in order to try to hit illusory targets.

The Unescapable Chores

I spend probably a third to a half of my statistical analysis time performing data cleaning and normalization.  My experience of this type of work ranges from dreary repetitiveness to zen-like absorption.

By data cleaning, I mean the tasks of formatting or reformatting the data to allow statistical analysis, identifying and fixing incorrect data, and merging or subsetting data as needed.

By normalization I mean the tasks of setting data to standard units, identifying useful or necessary categories, and changing variable names or data to standard values.  It also includes setting up the correct data structure and file formats.

Basics of Data Cleaning

We will look at a simple example.  Suppose I am given an Excel file of the data, which has been marked up for readability, perhaps also containing various summary statistics such as averages or standard deviations calculated inside the spreadsheet.  Let’s say that the name of the Excel file is “Experiment 7 Pressure Readings.xlsx”, and that
it contains data in a mix of formats.

My first job is going to be to copy the data to a new file to avoid over-writing the original data, which I will consider to be frozen at this point.  There, I will do the following tasks at a minimum:

  • Remove all formatting.
  • Remove all non-alphanumeric characters from the column names.
  • Probably shorten the column names to more or less standardized, one-  or two-word labels.
  • Remove character data from numeric columns.
  • Make all cells with missing values empty rather than coded.  If  these have different causes, then if there are few I will write analysis notes documenting them.  If there are many then they may need a separate column for that information.
  • Move or tranpose data as necessary to follow the skinny data format.
  • Delete blank columns and blank rows outside of the data area—sometimes these hide cells that have been formatted and they will be read as spurious data.
  • Depending on the amount of data and how organized the data are, I may merge various data sets by hand or I may do it using computer programming.

After this chore, I save the resulting data file as a comma separated value file, which I usually call something like
“Clean-Data-2013-11-21.csv”.  This is then going to serve as my frozen analysis data set.  Any subsequent changes in the data are going to be performed using either a computer program or a well-defined, written set of steps.  Which, when you think about it, are pretty much the same thing.

Results Section

Many people confuse items that belong in the Results section with items that belong in the Discussion section.

Results are either data, summaries of data, or directly computed from data with no subjective input.  Here are examples of results:

  • A listing of the data.
  • A listing of a subset of the data.
  • A figure generated from the data.
  • Any summary statistics generated from the data.
  • Other statistics generated from the data.
  • Observations about the relationships in the data that are  qualitative or quantitative.

Here is one way to organize your Results section into sub-sections:

Data quality:  In this sub-section, you describe the steps that   were taken to ensure the integrity of the data that are about to be  analyzed to death.  Here, you describe the inclusion or exclusion  criteria for the data.

Sample characteristics:  Here you give a summary of the data.  How many complete records were obtained per group?  How many data are missing?   That sort of thing.

Little picture results:  Here you give results parameter by  parameter.  Typically these types of analyses are done in cookie-cutter fashion.

Big picture results:  Here you aggregate results from the individual parameter analyses or use  statistical methods that are designed to work on sets of  parameters.

Manuscript Organization

A very typical organization of the thesis or dissertation manuscript is as follows:

Abstract:  The abstract dramatically summarizes the results and interpretation of the research.

Introduction: In addition to introducing the topics of the manuscript, the introduction is also typically used to lay out the research questions.

Background:  The Background section should represent a distillation of literature review into the existing state of
knowledge about the research questions.  This is very important because often there is a generally accepted “way of doing things” that you had better follow if you want your manuscript to be accepted.  From our point of view, the literature review will also most likely reveal a set of statistical methods that are likely to be of great interest.

Methods:  The Methods section should describe the actual materials and methods used to carry out the study.  Of particular interest to us will be the Statistical Methods section, which will describe the statistical techniques and methodology used to analyze the data.

Results:  The Results section contains presentations of the data, or direct summaries of the data in the form of statistics.  It can also contain text that integrates material in the Results section with itself.  For example, you may make global observations here.   Other words that go with Results are “Observations” and “Findings”.

Discussion:  The Discussion section generally speaking contains text that integrates the Results with information from the Introduction and Background sections.  Essentially, this is the   section that connects your study with the outside world.   Other words that go with Discussion include “Conclusions”,  “Interpretations”, “Recommendations”, and “Limitations”.

References: You will want to be extremely nit-picky in providing references to support all of your choices and conclusions.

You will want to use parallelism to an extreme degree in this type of writing.  Basically, first organize your Results in a way that makes sense, then write the other sections to match.

The Artifacts of a Carefully Planned Research Program

If you are serious about producing high quality research results, then you need to plan ahead.  The time to plan is before you start your research, not afterward.  However, going through things backward once or twice is usually a great motivator for doing things right in the future.

Here are the documents that you should minimally expect to produce in a well-planned research program:

Research Proposal Lays out the suggested course of research.
Study Protocol Describes how a study is going to be conducted.
Data Collection Tools The forms or files that will be used to collect the data.
Statistical Analysis Plan Describes the plan for the statistical analysis.
Master Data The final form of the data transferred from the data collection tools.
Analysis Data Data derived from the master data after cleaning and normalization.
Statistical Code Programs or scripts used to carry out the analysis against the analysis data.
Statistical Analysis The output from statistical packages or statistical code.
TLG Tables, listings, and graphs created by statistical packages or statistical code.
Analysis Notes Describe analysis choices and decisions.
Manuscript The actual manuscript using the statistical analysis material.
Reviewer Comments The helpful comments from various reviewers.
Reviewer Response Documented responses to reviewer comments.
Final Manuscript Dissertation, thesis, or published work.

Now, this might seem like overkill for a graduate student.  But, if you think about it, you are going to be covering all of these items either explicitly or implicitly!  It is always going to be better if you explicitly consider each item and plan accordingly.  The item itself might be a single sheet of paper, or even a simple e-mail or paragraph, but it still needs to be considered.

Also, you are going to be spending years of your life in this course of research as well as oodles of money.  You will be working with a minimum of your committee, but may also be working with dozens or hundreds of other people to carry out the studies that form your research program.  You simply cannot afford to hope that everything will just work out by itself, or that someone else will hold your hand for you the whole way through so that everything goes well.

My point is this:  You are not going to be doing any extra work by thinking through each of these items, planning a bit, and writing down your resulting thoughts.  On the contrary, I believe that you are actually going to be saving yourself quite a bit of time and potentially stress and grief!

Actually, the more you work ahead and plan, the easier it is to carry out the tasks that you have to do now.  So, if you can think through your Statistical Analysis Plan in advance, it will help you to design your data collection tools and to clarify your Study Protocol.  You will be able to clearly identify the questions that you can actually answer with these data, which will help you formulate your research hypotheses much more realistically and defensibly.  If you have to formally present a Research Proposal, this will pay off even in terms of reducing stress and enhancing your credibility.

The General Purpose of Data Transformations

For the most part, data transformations are used to allow us to apply linear model techniques to the data.

I have come to the following thought process on transformations for this purpose.  There are four primary reasons for transformation, listed here in rough order of desirability from my point of view:

Scientific:  It may be that the scientific theory—whether physical, chemical, or biological—has already mathematically described a relationship among the variables.  In that case, it may be that by taking a transformation of the data that the relationship becomes linear.

Operational:  In many settings, for some unknown reason or sometimes for a good reason, a data transformation will linearize the data quite well.  For example, when data span magnitudes, most measurement systems will naturally measure larger variability around larger measurements—a good example of this is found in assay systems that use serial dilution.

Statistical: Now, in some cases, if you assume a probability model, you can find that some transformations do something interesting called “variance stabilization”.  (Well, it’s interesting if you are a statistician, anyway.)  The reason this is relevant is that that the linear model methods all pretty much require equal variances in each group (a slight simplification).  The variance stabilizing transformation can create that situation.

Empirical: Finally, at the bottom of the barrel, you have the bright idea of “Let’s just transform the hell out of the data until we find a transformation that linearizes it!”  There is a certain charm to this, but I am less convinced by the line of reasoning there.  This is usually carried out via either a random “Let’s keep trying transformations till we find a good one.”  approach and something dressed up more formally, like finding the Box-Cox transformation.

I rank the operational reason above the statistical on the grounds that there is something real happening to create the need for a data transformation.  It is important to stay grounded in reality and avoid drifing into the nether worlds of statistical theory.

In fact, there is a pronounced tendency in statistics to elevate statistical theory to the level of reality.  Perhaps there is some deep-seated psychological reason for this.  We need not concern ourselves with this, but simply try to recognize that much of statistical critique is based on this idea of making assumptions about the real world that may or may not be warranted.

The Box-Cox transformation is to me an example of putting the statistical cart before the reality horse: It gives primary importance to supposed existence of normally distributed errors and transforms the data to suit.  While one might perhaps make a Bayesian argument for this, in fact this is never done.

Untransforming

There is a major problem in transforming data.  All of your analysis is now on a scale that is different from your original scale!  The units are now transformed units.  And, the nonlinearity of the transformation is going to make trouble for you.  For example, you may want to present means and standard deviations on the original scale, but these are going to seem off compared to results on the transformed scale.

There are a few ways around this.  The first way is a bit disingenuous.  Present your data on the original scale, but perform the statistical analysis on the transformed data.  Be sure to label things appropriately, with some or all of the following:

  • The Statistical Methods section should indicate something like “The  logarithmic transformation was used where appropriate.”
  • The Results section needs to indicate whether data were transformed   or untransformed in the text and any figures or tables.
  • The Results section may need to indicate that the transformed and   untransformed analyses are similar.

There is nothing formally wrong with this.  You have informed the reader of exactly what has been done, and everything is clearly labeled so that there is (in theory) no confusion.  However, it is slightly disingenuous because we all know that most people are going to gloss the text under the implicit assumption that the statistical analysis is on the same scale as the data presentation.

The second way is to look at the results from transformed and untransformed data.  If they are similar in outcome, you can consider simply presenting the untransformed analysis with perhaps a mention of the transformed analysis.  This is methodologically wrong, but may be pragmatic.

The third way is to bite the bullet and do the math necessary to back-transform everything.  This can be a headache sometimes, and you still won’t avoid the problem of reviewers noting the differences between the untransformed and transformed analyses.  You can alleviate this somewhat by very clearly indicating that results have been back-transformed in the Statistical Methods and Results sections.

Really, when you transform the data you enter into a no-win situation many times.  Depending on the whim or character of reviewers, you can get complaints from any of these approaches.  The first way may be criticized as being inconsistent.  The second way may be criticized as wrong.  The third way may be criticized as confusing or perhaps unnecessary.

Logarithmic Transformation

The logarithmic transformation, or log transform for short, is a very useful transformation for several situations.

When scientific theory indicates some sort of multiplicative relationship between the predictors and the result, the log transformation is a useful device.  For example, in chemical reaction data or in neurotransmitter data you may have multiplicative relationships that are linearized with the logarithmic transformation.

Also, the way that measurement systems work may create an operational need for the log transformation.  When data span various orders of magnitude, it is common for the measurement scale to be subject to error dependent upon the level of measurement.  The log transform changes this to a situation where the variance is similar across the board.

Statistically, the logarithmic transformation is also a variance-stabilizing transformation in the case of the log-normal distribution.

I prefer to use natural logarithms to perform the log transform.   The natural logarithm has the base e≈2.171828, which may seem anything but “natural”.  As with statisticians, mathematicians have their own challenges in dealing with reality.  Suffice it to say that for mathematicians, the base e works nicely.

The usual logarithm that you probably met first was the base-10 logarithm.  The base-10 logarithm is also called the common logarithm.  It is a natural to use when thinking of magnitudes, or quantities analogous to decibels or the Richter scale.  An increase of one unit in the base-10 logarithm corresponds to multiplying by a factor of 10 on the original data.

The natural logarithm is arguably the most common default when you see “log” as the function name in software, though you need to check whether they really mean the base-10 logarithm.  To make things more confusing, the usual mathematical notation for the natural logarithm is “ln” (for the Latin logarithmus naturali), and so sometimes you see that used as the function name in computer software.

For work with linear models, such as regression or ANOVA, it does not matter which base you use!  The change of base formula for logarithms is a simple rescaling.  So although the estimates and coefficients will be changed, the statistical significance results will be completely untouched.

There is a great little trick with natural logarithms, too.  For small values of differences between natural logarithms, say below 0.50, you can directly interpret the differences as percent changes on the original data scale.  This is more accurate the smaller the difference.  That’s handy for interpreting the results.

Depending on the situation, a base other than e or 10 may be useful.  When using the base-2 logarithm, the differences represent fold-changes.  That is, a change of one unit means a change of 1-fold, or a doubling.  A change of two units means a change of 2-fold, or a quadrupling.

Now, what about the case where you have zeroes in the data?  The logarithm of zero is undefined, so that creates a bit of a problem!

There are several approaches to this issue:

1. Analyze the data without the zero data.

2. Adding a positive offset only to the zero data.

3. The more usual dodge of adding a small positive offset to all of the data.

The effects of the second and third choices are going to depend heavily on the number of zeroes in the data, as well as the size of the offset added.  If you add a very tiny number in comparison to the rest of the data, then on the log scale the data that were zeroes are going to be hanging out quite a way away from the rest of the data, making them very high leverage data points.

The first choice is often a good choice.  However, with many zeroes, the number of missing data can dramatically affect the results as well as even the ability to perform some statistical analysis.  A rule of thumb might be to save this approach generally for when the number of zeroes is somewhere under 10%.  The interpretation of the results will then have to be conditional upon actually seeing a positive value.