Mean and Standard Error

At the end of this level, you should love traces of data. You need to be able to correctly estimate the equilibration length, mean, variance, autocorrelation and standard error of a data trace. Ideally, you will also be able to plot the data in trace and histogram format and be able to identify unequilibrated data, infinite variance and correlation.

Equilibrated and Uncorrelated Data

For equilibrated and uncorrelated data, mean and error are easy to estimate
#!/usr/bin/env python2.7
import numpy as np
data = np.loadtxt("data.txt")
print data.mean(),data.std()/np.sqrt(len(data))

To see what this kind of “good” data looks like, add the lines
import matplotlib.pyplot as plt
plt.plot(data)
plt.show()

BTW, the data is generated from a FN-DMC calculation of clamped-ion ground-state energy of Li\(^+\) with a non-relativistic Hamiltonian that only includes the kinetic energy of the electrons and Coulomb interaction among the electrons and nucleus. You should get an answer of -7.27987(6)Ha in agreement within one standard deviation of the exact answer of -7.279913Ha calculated with the ICI method.

Raw Data

Equilibration

Raw data from a simulation often contains some initial transient that should not be used in statistics collection. The correct thing to do is simply throw out this part of the data. However, the determination of the equilibration length can be trick, especially if the variance of the data is large. One good heuristic to use is to start collecting data from the end, update the mean and variance of these supposedly converged data and determine the equilibration length as the index of the first point that falls more than 1.5 standard deviation away from the mean for example. This heuristic will fail if the data is not equilibrated from beginning to end or if the data is highly correlated. It is advisable to always plot and look at the data and “eye-ball” the mean and standard deviation to make sure your analysis script is working correctly.

Correlation

If the data is correlated (the next data point has “memory” of the current data point), the naive estimate of standard error will be too small, because the “effective” number of points is smaller than the actually number of data points since you can predict the next few data points given the first few data points. Note that correlation does not affect the correct evaluation of the estimated mean, but can have drastic effect on the estimated error. There are two ways to overcome this problem.

If we still wish to use the standard error formula for uncorrelated data, we can use a technique called “blocking”, which chops the data into blocks and uses the averages of the blocks as the uncorrelated data set. If the block size is large enough that the averages of the blocks are indeed uncorrelated, then we will obtain the correct estimate for the error of the estimated mean. Beware that if the block size is so large that you only have a few blocks left, then the error of the estimated error will be large and the statistics are not meaningful.

If we are willing to learn the concept of “autocorrelation length” \(\kappa\), then we can calculate the effective number of data points by dividing the total number of data points by the correlation length \(N_{eff}=N/\kappa\).

Both blocking and autocorrelation are used in serious data analysis to ensure that the error is correctly estimated with any reasonable block size (more than 20 blocks for reasonable statistics, and \(\kappa<1000\)). In this case blocking is merely a mechanism to reduce the amount of data that needs to be saved. Beware that variance of the raw data cannot be properly calculated without knowing the block weight.

Core Concepts:

  1. Ergodicity
  2. Central Limit Theorem
  3. Autocorrelation Length