Previous Post 1.5 Tabling the Data and the need to Round
So by default R is very conservative with missing data, as it should be.
If I try to compute the mean of the entire Galesburg daily average temperature time series.
mean(clean_data$avg_temp)
"[1] NA
Warning message:
In mean.default(clean_data$avg_temp) :
argument is not numeric or logical: returning NA"
Which means I entered a typo (the variable doesn’t exist).
Instead enter:
mean(clean_data$avgTemp)
#This returns:
"[1] NA"
What? Why?
If there is even one NA value in a vector, R by default will return NA when computing the mean.
This is good conservative behavior.
Okay so how many NA values are we dealing with here?
In R here are the two most common functions used to diagnose variables:
str()
summary()
The str()
or structure function tells you about the variable; whether it’s float(decimal) or integer or character(text). Similar to is* functions in matlab or var_dump() in PHP.
And summary()
prints a synopsis of the variable.
str(clean_data$avgTemp)
#This returns:
num [1:24789] 33.5 30.5 30 30 32.5 30 34.5 38.5 34 28.5
We can read this as the “column” avgTemp
of the dataframe clean_data
is number (num) i.e. floating point number or decimal with 24,789 data points and then it lists some of the values at the start.
Aside: What are data frames?
1.7b What are dataframes in R?
Well how many NA’s are there?
summary(clean_data$avgTemp)
## 542 out of
str(clean_data$avgTemp)
## 542/24789 or ~2% of the data is missing.
## we can ignore these NA with the agrument na.rm=T or True
## i.e remove NA= TRUE
mean(clean_data$avgTemp,na.rm=T)
## returns [1] 50.58537
So the mean daily avg temperature for all of the data that we have is 50.6 degrees Fahrenheit. This corresponds to a horizontal line y=50.6 across the time series.
meanDailyAvgTemp=mean(clean_data$avgTemp,na.rm=T)
abline(h=meanDailyAvgTemp,col='red',lwd=3)
How does adding na.rm=T actually work?
2. Dealing with Missing Data in R: Omit, Approx, or Spline Part 1