1.7 All About Missing Data (NA) and Datatypes in R.

Previous Post 1.5 Tabling the Data and the need to Round
So by default R is very conservative with missing data, as it should be.

If I try to compute the mean of the entire Galesburg daily average temperature time series.

mean(clean_data$avg_temp)
"[1] NA
Warning message:
In mean.default(clean_data$avg_temp) :
argument is not numeric or logical: returning NA"

Which means I entered a typo (the variable doesn’t exist).

Instead enter:
mean(clean_data$avgTemp)
#This returns:
"[1] NA"

What? Why?
If there is even one NA value in a vector, R by default will return NA when computing the mean.
This is good conservative behavior.

Okay so how many NA values are we dealing with here?

In R here are the two most common functions used to diagnose variables:
str()
summary()

The str() or structure function tells you about the variable; whether it’s float(decimal) or integer or character(text). Similar to is* functions in matlab or var_dump() in PHP.
And summary() prints a synopsis of the variable.

str(clean_data$avgTemp)
#This returns:
num [1:24789] 33.5 30.5 30 30 32.5 30 34.5 38.5 34 28.5

We can read this as the “column” avgTemp of the dataframe clean_data is number (num) i.e. floating point number or decimal with 24,789 data points and then it lists some of the values at the start.

Aside: What are data frames?
1.7b What are dataframes in R?

Well how many NA’s are there?
summary(clean_data$avgTemp)
## 542 out of
str(clean_data$avgTemp)
## 542/24789 or ~2% of the data is missing.
## we can ignore these NA with the agrument na.rm=T or True
## i.e remove NA= TRUE
mean(clean_data$avgTemp,na.rm=T)
## returns [1] 50.58537

So the mean daily avg temperature for all of the data that we have is 50.6 degrees Fahrenheit. This corresponds to a horizontal line y=50.6 across the time series.

meanDailyAvgTemp=mean(clean_data$avgTemp,na.rm=T)
abline(h=meanDailyAvgTemp,col='red',lwd=3)

avg_temp_xts_with_mean

How does adding na.rm=T actually work?
2. Dealing with Missing Data in R: Omit, Approx, or Spline Part 1

Leave a Reply