1.5 Tabling the Data and the need to Round

So when I left off we had just seen that missing data was going to be a problem.
1. Exploring the Data
Before I discuss the options for dealing with missing data, when I started looking at the data again I realized we need to round the data as well.
Examining the data in tabular form:

table(clean_data$TMAX)

#Read the output like this...

Value: -11.92 -11.02 -9.94 -9.04 ...
Number of Occurances: 2 2 3 2 ...
Value: ...
Number of Occurances: ...
so on and so forth

Aside:
The table function output can take a bit of getting used to read from but it’s incredibly helpful if you want try to maximize your console window (assuming you’re working from the command line) you will find that R will still only display 80 characters per line to increase this enter:

options(width=160)
However on my machine this sometime leads to trouble later on if I start entering commands that are longer than 160 characters.
:Aside done.

Back to rounding if you look carefully at the table output you can see how the decimal places are not random (we don’t see all possible values from 0.1 to 0.9 for any single temperature), this is an artifact of converting from tenth of a decimal celsius (which gives us single digit Fahrenheit temperature precision).

So in the clean up script I should have included two more lines.
Starting from reading the csv from step 0 would now look like:

clean_data=read.csv("ncdc_galesburg_daily_clean.csv",stringsAsFactors=FALSE)
clean_data$Date=as.Date(clean_data$Date)
## new
clean_data$TMAX=round(clean_data$TMAX)
clean_data$TMIN=round(clean_data$TMIN)
##
clean_data$avgTemp=((clean_data$TMAX+clean_data$TMIN)/2.0)

Look’s great now. All is well in the land of unit conversions.
Let’s examine that missing data problem next.
1.7 All About Missing Data and Datatypes in R

Leave a Reply