We will look at a simple example. Suppose I am given an Excel file of the data, which has been marked up for readability, perhaps also containing various summary statistics such as averages or standard deviations calculated inside the spreadsheet. Let’s say that the name of the Excel file is “Experiment 7 Pressure Readings.xlsx”, and that
it contains data in a mix of formats.
My first job is going to be to copy the data to a new file to avoid over-writing the original data, which I will consider to be frozen at this point. There, I will do the following tasks at a minimum:
- Remove all formatting.
- Remove all non-alphanumeric characters from the column names.
- Probably shorten the column names to more or less standardized, one- or two-word labels.
- Remove character data from numeric columns.
- Make all cells with missing values empty rather than coded. If these have different causes, then if there are few I will write analysis notes documenting them. If there are many then they may need a separate column for that information.
- Move or tranpose data as necessary to follow the skinny data format.
- Delete blank columns and blank rows outside of the data area—sometimes these hide cells that have been formatted and they will be read as spurious data.
- Depending on the amount of data and how organized the data are, I may merge various data sets by hand or I may do it using computer programming.
After this chore, I save the resulting data file as a comma separated value file, which I usually call something like
“Clean-Data-2013-11-21.csv”. This is then going to serve as my frozen analysis data set. Any subsequent changes in the data are going to be performed using either a computer program or a well-defined, written set of steps. Which, when you think about it, are pretty much the same thing.