Week 5

Estimating errors¶

Why computing errors is tricky¶

Using a molecular dynamics simulation, we can compute the expectation of a physical observable $A$ as

$\langle A\rangle = \frac{1}{N} \sum_{n>0}^{N}{A_n}$

where $N$ is the total simulation time (we only consider the data after the equilibration phase). This average would converge to the actual thermodynamic average if we could go to $N=\infty$ . Since in our simulation $N$ is always finite, the time average is only an estimate of the true expectation value. Obviously, the longer we run the simulation, the more accurate the value will be. To quantify, how accurate the result is, we need to compute the error.

If we had statistically independent random data of a physical observable $A$ , we could estimate its standard deviation with the well-known formula

$\sigma_A = \sqrt{\frac{1}{N-1}\sum_n\left(A_n - \langle A \rangle \right)^2}$ .

However, this does not work with molecular dynamics, as we do not have statistically independent random data. We are doing a time-evolution, meaning that every new configuration is a small variation of the previous configuration - the data thus is correlated.

The autocorrelation function¶

Correlated data means that the data sequence has a memory of the previous configurations. Let's now assume that we have a sequence of correlated data $A_n$ . We can quantify the amount of correlation can by the normalized autocorrelation function (also seen as Pearson correlation coefficient in the literature):

$\chi_A(t) = \frac{1}{\mathrm{Var}(A)} \sum_n \left(A_n - \langle A \rangle \right) \left(A_{n + t} - \langle A \rangle\right)$

that compares the fluctuations at a certain time distance, assuming a stationary process. Typically the autocorrelation function has an exponential decay $e^{-t/\tau}$ , where $\tau$ is the correlation time of the simulation (Note that in my definition here the correlation "time" refers to the index $n$ and is thus dimensionless). To get statistically independent data, one would thus have to take snapshots with a waiting time of more than $\tau$ . In particular, knowing the correlation time $\tau$ , the error on our simulation result is given as

$\sigma_A = \sqrt{\frac{2 \tau}{N} \left(\langle A^2 \rangle - \langle A \rangle^2\right)}$

The formula above is valid for computing the autocorrelation function for an infinitely long time-series. To compute it from your finite length simulation data, you can use the formula $\chi_A(t) = \frac{\left(N-t\right)\sum_{n} A_n A_{n+t} - \sum_{n} A_n \times \sum_{n} A_{n+t}}{\sqrt{\left(N-t\right)\sum_{n} A_n^2 - \left(\sum_{n} A_n\right)^2}\sqrt{\left(N-t\right)\sum_{n} A_{n+t}^2 - \left(\sum_{n} A_{n+t}\right)^2}}$ for $1 \leq n\leq N-t$ . To get $\tau$ you would then have to fit $\chi_A(t)$ to $e^{-t/\tau}$ .

Example¶

Let's illustrate using an example how this procedure works for numerical random correlated data. Instead of using an actual molecular dynamics data, here I use random correlated data that I generated using a [mathematical prescription] (https://www.cmu.edu/biolphys/deserno/pdf/corr_gaussian_random.pdf) that allows me to specify the correlation.

This is some example data where I set $\tau=50$ , $\langle A \rangle=0$ and $\sqrt{\langle A^2 \rangle - \langle A \rangle^2} =1$ (remember, this is not the error, but the square root of the intrinsic variance):

random correlated data

The calculated autocorrelation function then looks like that:

autocorrelation function with fit

In this plot, I have also fitted the numerical autocorrelation function with $\exp(-t/\tau)$ , with fit parameter $\tau$ . As you see, I get a $\tau$ close to the value I set inititally. I don't get exactly 50, as I have a finite sequence of data - it would converge to 50 as the sequence length increases.

With this approach, I can calculate the average of my data as $-0.045 \pm 0.068$ , which fits with that I set the desired mean to be 0 for an infinitely long series.

When fitting the autocorrelation function, it is important to be aware of the fact that for larger $t$ it usually has quite some fluctuations instead of becoming 0, as you can see for this plot of the autocorrelation function for long times:

full autocorrelation function

Hence, only fit to the well behaved part. Sometimes you also see a different way to compute the correlation time as $\tau = \sum_t \chi_A(t)/\chi_A(0)$ (this approach is based on $\int_0^\infty \exp(-t/\tau) dt = \tau$ ). Doing this naively usually does not give nice results for numerical data because of the fluctuations at larger $t$ . A work-around is to use a cut-off and only sum up to some maximum $t$ .

The approach of calculating the error using the autocorrelation function is conceptually the easiest, but as you see one has to take some care. Alternatively, there is another way tocompute the error that avoid the calculation of the autocorrelation function as described in the nexxt section.

Data blocking¶

The idea of data blocking is to replace the time series $(A_1, A_2, \dots, A_N)$ with a block-averaged version $(a_1, \dots, a_{N_b})$ where always blocks of $b$ subsequent entries are replaced by the average of that block:

$a_i = \frac{1}{b} \sum_{(i-1)\,b\, +\, 1}^{i\,b} A_i\,,$

and $N_b = N/b$ . The idea of this method is that once the block length becomes equal or larger than the correlation time, the block averages form statistically independent random variables, and hence we can compute the error as

$\sigma_A(b) = \sqrt{\frac{1}{N_b-1} \left(\langle a^2 \rangle - \langle a \rangle^2\right)}$

Without explicitly calculating the autocorrelation function, we do not a priori know what value of $b$ to use. But in this case we can make use of the fact that in our definition the error $\sigma_A(b)$ is a function of the block size $b$ . We know that the block size $b$ is large enough, once $\sigma_A(b)$ has converged to a (roughly) constant value!

For the correlated example data shown above, $\sigma_A(b)$ as a function of block size looks like this:

data blocking

From this we can read off an error of about $0.06$ , which agrees with what we found with the autocorrelation function approach.

Errors of derived quantities¶

Error propagation¶

Some of the observables mentioned in week 4 are not simply calculated using the average of a time series. One particular example is the specific heat which is a function of $\langle K^2\rangle$ and $\langle K \rangle$ where $K$ is the kinetic energy. For each of the individual averages ( $\langle K^2\rangle$ and $\langle K \rangle$ ) we can compute a proper error (taking into account the correlation) using the rules of error propagation.

Block bootstrap¶

An alternative to explicitly deriving an analytical formula for the error of the derived quantity is to use the bootstrap method. Bootstrapping is a resampling method: Instead of calculating a quantitiy from the simulation data directly, we generate new random sets from it:

Say, you have $N$ data points. Then you make a new random set by drawing $N$ random data points from the original set. This is not just a reshuffling, but you will pick certain data points several times, and others not at all. From this new data set you then compute in the usual way the quantity $A$ you want. You repeat this procedure $n$ times. Then you have a set of $n$ values for $A$ , and from this you can compute an estimate of the error as

$\sigma_A = \sqrt{\langle A^2 \rangle - \langle A \rangle^2}$

where the average is now over the $n$ data points for $A$ Note that there is no factor $1/(n-1)$ here - it shouldn't be, because the error shouldn't get smaller by doing more resampling (it is determined by the original set). For a large enough $n$ the error will thus be independent of $n$ .

Note that for the bootstrap we need statistically independent data. Hence, to use it for correlated data, we need to replace the original data set by a block-averaged version as for the data blocking method.

Milestones

Implement calculation of errors and test your implementation on data with a known correlation time. You can get the code to generate random data with a specified autocorrelation time here.
Compute observables including errors
Make sure your code is structured logically.
Make a plan for simulations to go into the report: How do you want to validate your simulation, and which observables/simulations do you want to run?
Make a brief statement about the efficiency of your code: can you run the simulations you would like to do? Would it make sense to improve performance?