Information Flow

It will help you to understand the overall flow of information for your study.  Most people concentrate only on one small part, the statistical test, and forget the rest.

Starting with the physical phenomena that are being studied, we have to use some sort of measurement tool to quantify the aspects we want to study.  Then, some sort of data collection tool is used to capture the information.

Then, we create a master data set or sets from the data collection tools that will be frozen.  “Frozen” means that you will not ever edit those data again.  Ever.

If any data changes need to be made, they will be made in an analysis data set that is derived from the master data set.  Ideally, this will be done using some sort of automated system or programming steps, so that all changes from the master data set are documented.

If the analysis data set is created by hand, then you need to make notes of what changes were made.  I suggest including a text document with the analysis data set.  At some point, you are going to want to freeze the analysis data set so that you can follow the principle of analyzing only one data set.

Next, the data are manipulated using either a statistical package, a statistical language, or some other software  in order to produce statistical analyses, listings, tables, and figures.  Note that you may need to create further derived data sets in these tools in order to accomplish these tasks.

Again, if any manual steps are performed here, they need to be documented.

Finally, the statistical analyses, listings, tables, and figures are used to write the thesis or dissertation, to create presentations, or to write for publication.

The Principle of Analyzing Only One Data Source

Do not violate this principal.  Violations of this principal are a number one cause of a wide variety of problems, but at a minimum can almost be guaranteed to waste your time, effort, and money, if not your credibility.

Usually, violations of this principal occur due to either poor planning or to a perceived need for haste (which is probably a symptom of poor planning).  It can also happen when more than one person is working on analyzing the same data.  It can also be “Just the way things are done around here.”

Here is what we especially do not want:  Different analysis data sets floating around with different statistical analyses attached to them.  We have no way of knowing whether the numbers in different analysis data sets are different because the analyses are different or because the data sets underneath the analyses are different.  We may not even know which data are correct!

A typical way this happens is as follows:  You start by creating some simple summary statistical tables by hand using Excel.  Since you need a variety of different summary statistics, you copy the data into a worksheet and have at it.  A while later, you have manually selected a lot of data, created summary statistics, and manually copied the numbers over to a nice little table.

Later, you start working on some statistical analysis, and notice some outlier points.  You decide after reviewing the data that a couple of the points were entered wrong.  Good catch!  You fix those data in your statistical analysis program, and keep on working.  But, maybe you remember to change all of the numbers in the Excel summary table and maybe you do not.

Now you have two different data sets floating around.  Multiply this by the other software that you might be using, where you make changes like aggregating categories, or creating filters on the data that will be analyzed.

Here are some symptoms of not analyzing only one data source:

  • You have multiple different copies of your analysis data in different formats.
  • Your figures do not match your tables.  Or, neither your figures nor your tables match your statistical results.
  • Your numbers are slightly off when you decide to cross-check some simple stuff like sample size or means between programs.
  • You start to wonder where you got a particular number.
  • You cannot figure out where you got a particular number.