2. Dealing with Missing Data in R: Omit, Approx, or Spline Part 2

So I decided to try and spline the missing temperature data after going over na.omit(), na.approx() and na.spline() in part 1.
2. Dealing with Missing Data in R: Omit, Approx, or Spline Part 1

Reaching way back to step 1. Let’s revisit the missing data in February 2014 and 2013.

plot(avgTempXTS['2014'])
points(avgTempXTS['2014'])
## yup there is the missing data

2014_daily_avg_temp_plot_xts_with_points
## let's create a new splined XTS variable to see how polynomial interpolation does
avgTempSplineXTS=na.spline(avgTempXTS)
## to have 2 graph windows open at the same time
dev.new()
plot(avgTempSplineXTS['2014'])
points(avgTempSplineXTS['2014'])
### this looks good I don't see any crazy jumps to -200 or -300 or something
plot(avgTempSplineXTS)
points(avgTempSplineXTS)
## to compare to the original data, first change back to graph window 2
dev.set(2)
plot(avgTempXTS)
points(avgTempXTS)

temp_spline_2014

I don’t see the splined data changing any of the record highs or lows for daily average temperature so in that aspect it looks good to me.
There are more rigorous ways to check exactly what we are introducing into the data for sure, but for my purposes here (with 2% of the data missing)
this is an acceptable approach to me. The main thing to look for would be the spline output saying it was 90 degrees in January for example. Let’s quickly check for this.

### save the locations (or indices)
### where the missing data is in the vector
missingTempDataIndices=which(is.na(clean_data$avgTemp))
### output only the interpolated times
avgTempSplineXTS[missingTempDataIndices]
### scrolling quickly through this list
### the 1.23 deg in Dec 31st 1998 caught my eye
### so using the magic of XTS
avgTempSplineXTS['1998-12-25/1999-01-05']
[,1]
1998-12-26 23.000000
1998-12-27 34.500000
1998-12-28 28.000000
1998-12-29 20.000000
1998-12-30 4.000000
1998-12-31 1.230667
1999-01-01 9.500000
1999-01-02 16.500000
1999-01-03 8.000000
1999-01-04 -10.000000
1999-01-05 0.000000
## Note for the above slice of data only
## Dec 31st 1998 is missing everything else should be good.

Apparently we have no data from Nov 2nd 2008 to May 21st 2009, that is a lot of missing data. Looking at the temperatures closely the -17.5 deg F that was interpolated for 2009-01-16 caught my eye, and the 1.23 deg F on Dec 31st 1998. Would I expect these to be accurate? No I wouldn’t expect them to be. Given the large block of missing data, perhaps using the climate average of each and every day would be better justified than relying on polynomial interpolation for such a large chuck of data.
But for the scope of this project and the missing data being only 2% of the total record, I’m going to keep it simple and use na.spline.
Let’s quickly compare it to linear interpolation and see if it does better.

avgTempLinearInterpXTS=na.approx(avgTempXTS)
avgTempLinearInterpXTS[missingTempDataIndices]
### this outputs the same temperature for 2009-01-16 = -17.5
avgTempLinearInterpXTS['1998-12-25/1999-01-05']
### But for Dec 31st 1998 this outputs 6.5 instead of 1.23 which does seem better, so it's a bit of wash
[,1]
1998-12-26 23.00
1998-12-27 34.50
1998-12-28 28.00
1998-12-29 20.00
1998-12-30 4.00
1998-12-31 6.75
1999-01-01 9.50
1999-01-02 16.50
1999-01-03 8.00
1999-01-04 -10.00
1999-01-05 0.00

Rather than declaring one method superior to the other, let’s take another perspective: dealing with missing data has already introduced some uncertainty into the analysis and we haven’t really done anything yet. I view this as another parameter to consider. Either na.omit(), na.approx() or na.spline() depending on how uncomfortable you are with interpolating data. If you want to be absolutely sure it doesn’t change the output of what you’re looking at, where possible, you would want to re-run your test or plot with all three of the NA functions and see if the results differ. If the results don’t change under all three cases then I would argue that the results are robust.

Next let’s compare 2013 as well.
dev.set(2)
plot(avgTempXTS['2013'])
points(avgTempXTS['2013'])

2013_daily_avg_temp_plot_xts
dev.set(3) # or dev.new() if it doesn't exist
plot(avgTempSplineXTS['2013'])
points(avgTempSplineXTS['2013'])

temp_spline_2013
The results for the second half of 2013 look really good.

Lastly about the missing data for precipitation, I would argue the best step if we don’t want to throw out 25 days of temperature data because of missing precip data is to just set the missing precip data = 0. Interpolating either linearly or splining a daily sum precipitation time series is wildly inappropriate. Unless we had radar data to suggest it was raining continuously over Galesburg before, during and after the missing days.

To update the dataframe appropriately.

### before
summary(clean_data)
###
clean_data$PRCP[is.na(clean_data$PRCP)]=0
clean_data$TMAX=na.spline(clean_data$TMAX)
clean_data$TMIN=na.spline(clean_data$TMIN)
clean_data$avgTemp=na.spline(clean_data$avgTemp)
### after
summary(clean_data)

Now we have to remember we splined the data. But at long last we are ready to rumble!
Stay tuned!

See the code recap for what we have so far.
2b Complete Code Recap

Leave a Reply