Skip to content

Week 5

Estimating errors

Why computing errors is tricky

Using a molecular dynamics simulation, we can compute the expectation of a physical observable as

where is the total simulation time (we only consider the data after the equilibration phase). This average would converge to the actual thermodynamic average if we could go to . Since in our simulation is always finite, the time average is only an estimate of the true expectation value. Obviously, the longer we run the simulation, the more accurate the value will be. To quantify, how accurate the result is, we need to compute the error.

If we had statistically independent random data of a physical observable , we could estimate its standard deviation with the well-known formula

.

However, this does not work with molecular dynamics, as we do not have statistically independent random data. We are doing a time-evolution, meaning that every new configuration is a small variation of the previous configuration - the data thus is correlated.

The autocorrelation function

Correlated data means that the data sequence has a memory of the previous configurations. Let's now assume that we have a sequence of correlated data . We can quantify the amount of correlation can by the normalized autocorrelation function (also seen as Pearson correlation coefficient in the literature):

that compares the fluctuations at a certain time distance, assuming a stationary process. Typically the autocorrelation function has an exponential decay , where is the correlation time of the simulation (Note that in my definition here the correlation "time" refers to the index and is thus dimensionless). To get statistically independent data, one would thus have to take snapshots with a waiting time of more than . In particular, knowing the correlation time , the error on our simulation result is given as

The formula above is valid for computing the autocorrelation function for an infinitely long time-series. To compute it from your finite length simulation data, you can use the formula for . To get you would then have to fit to .

Example

Let's illustrate using an example how this procedure works for numerical random correlated data. Instead of using an actual molecular dynamics data, here I use random correlated data that I generated using a [mathematical prescription] (https://www.cmu.edu/biolphys/deserno/pdf/corr_gaussian_random.pdf) that allows me to specify the correlation.

This is some example data where I set , and (remember, this is not the error, but the square root of the intrinsic variance):

random correlated data

The calculated autocorrelation function then looks like that:

autocorrelation function with fit

In this plot, I have also fitted the numerical autocorrelation function with , with fit parameter . As you see, I get a close to the value I set inititally. I don't get exactly 50, as I have a finite sequence of data - it would converge to 50 as the sequence length increases.

With this approach, I can calculate the average of my data as , which fits with that I set the desired mean to be 0 for an infinitely long series.

When fitting the autocorrelation function, it is important to be aware of the fact that for larger it usually has quite some fluctuations instead of becoming 0, as you can see for this plot of the autocorrelation function for long times:

full autocorrelation function

Hence, only fit to the well behaved part. Sometimes you also see a different way to compute the correlation time as (this approach is based on ). Doing this naively usually does not give nice results for numerical data because of the fluctuations at larger . A work-around is to use a cut-off and only sum up to some maximum .

The approach of calculating the error using the autocorrelation function is conceptually the easiest, but as you see one has to take some care. Alternatively, there is another way tocompute the error that avoid the calculation of the autocorrelation function as described in the nexxt section.

Data blocking

The idea of data blocking is to replace the time series with a block-averaged version where always blocks of subsequent entries are replaced by the average of that block:

and . The idea of this method is that once the block length becomes equal or larger than the correlation time, the block averages form statistically independent random variables, and hence we can compute the error as

Without explicitly calculating the autocorrelation function, we do not a priori know what value of to use. But in this case we can make use of the fact that in our definition the error is a function of the block size . We know that the block size is large enough, once has converged to a (roughly) constant value!

For the correlated example data shown above, as a function of block size looks like this:

data blocking

From this we can read off an error of about , which agrees with what we found with the autocorrelation function approach.

Errors of derived quantities

Error propagation

Some of the observables mentioned in week 4 are not simply calculated using the average of a time series. One particular example is the specific heat which is a function of and where is the kinetic energy. For each of the individual averages ( and ) we can compute a proper error (taking into account the correlation) using the rules of error propagation.

Block bootstrap

An alternative to explicitly deriving an analytical formula for the error of the derived quantity is to use the bootstrap method. Bootstrapping is a resampling method: Instead of calculating a quantitiy from the simulation data directly, we generate new random sets from it:

Say, you have data points. Then you make a new random set by drawing random data points from the original set. This is not just a reshuffling, but you will pick certain data points several times, and others not at all. From this new data set you then compute in the usual way the quantity you want. You repeat this procedure times. Then you have a set of values for , and from this you can compute an estimate of the error as

where the average is now over the data points for Note that there is no factor here - it shouldn't be, because the error shouldn't get smaller by doing more resampling (it is determined by the original set). For a large enough the error will thus be independent of .

Note that for the bootstrap we need statistically independent data. Hence, to use it for correlated data, we need to replace the original data set by a block-averaged version as for the data blocking method.

Milestones

  • Implement calculation of errors and test your implementation on data with a known correlation time. You can get the code to generate random data with a specified autocorrelation time here.
  • Compute observables including errors
  • Make sure your code is structured logically.
  • Make a plan for simulations to go into the report: How do you want to validate your simulation, and which observables/simulations do you want to run?
  • Make a brief statement about the efficiency of your code: can you run the simulations you would like to do? Would it make sense to improve performance?