The Principle of Analyzing Only One Data Source

Do not violate this principal.  Violations of this principal are a number one cause of a wide variety of problems, but at a minimum can almost be guaranteed to waste your time, effort, and money, if not your credibility.

Usually, violations of this principal occur due to either poor planning or to a perceived need for haste (which is probably a symptom of poor planning).  It can also happen when more than one person is working on analyzing the same data.  It can also be “Just the way things are done around here.”

Here is what we especially do not want:  Different analysis data sets floating around with different statistical analyses attached to them.  We have no way of knowing whether the numbers in different analysis data sets are different because the analyses are different or because the data sets underneath the analyses are different.  We may not even know which data are correct!

A typical way this happens is as follows:  You start by creating some simple summary statistical tables by hand using Excel.  Since you need a variety of different summary statistics, you copy the data into a worksheet and have at it.  A while later, you have manually selected a lot of data, created summary statistics, and manually copied the numbers over to a nice little table.

Later, you start working on some statistical analysis, and notice some outlier points.  You decide after reviewing the data that a couple of the points were entered wrong.  Good catch!  You fix those data in your statistical analysis program, and keep on working.  But, maybe you remember to change all of the numbers in the Excel summary table and maybe you do not.

Now you have two different data sets floating around.  Multiply this by the other software that you might be using, where you make changes like aggregating categories, or creating filters on the data that will be analyzed.

Here are some symptoms of not analyzing only one data source:

  • You have multiple different copies of your analysis data in different formats.
  • Your figures do not match your tables.  Or, neither your figures nor your tables match your statistical results.
  • Your numbers are slightly off when you decide to cross-check some simple stuff like sample size or means between programs.
  • You start to wonder where you got a particular number.
  • You cannot figure out where you got a particular number.

Leave a Reply