qmcpack/docs/lab_qmc_statistics.rst

.. _lab-qmc-statistics:

Lab 1: MC Statistical Analysis
==============================

Topics covered in this lab
--------------------------

This lab focuses on the basics of analyzing data from MC calculations.
In this lab, participants will use data from VMC calculations of a
simple 1-electron system with an analytically soluble system (the ground
state of the hydrogen atom) to understand how to interpret an MC
situation. Most of these analyses will also carry over to DMC
simulations. Topics covered include:

-  Averaging MC variables

-  The statisical error bar of mean values

-  The effects of autocorrelation and variance on the error bar

-  The relationship between MC time step and autocorrelation

-  The use of blocking to reduce autocorrelation

-  The significance of the acceptance ratio

-  The significance of the sample size

-  How to determine whether an MC run was successful

-  The relationship between wavefunction quality and variance

-  Gauging the efficiency of MC runs

-  The cost of scaling up to larger system sizes

Lab directories and files
-------------------------

::

  labs/lab1_qmc_statistics/
  │
  ├── atom                              - H atom VMC calculation
  │   ├── H.s000.scalar.dat                - H atom VMC data
  │   └── H.xml                            - H atom VMC input file
  │
  ├── autocorrelation                   - varying autocorrelation
  │   ├── H.dat                            - data for gnuplot
  │   ├── H.plt                            - gnuplot for time step vs. E_L, tau_c
  │   ├── H.s000.scalar.dat                - H atom VMC data: time step = 10
  │   ├── H.s001.scalar.dat                - H atom VMC data: time step =  5
  │   ├── H.s002.scalar.dat                - H atom VMC data: time step =  2
  │   ├── H.s003.scalar.dat                - H atom VMC data: time step =  1
  │   ├── H.s004.scalar.dat                - H atom VMC data: time step =  0.5
  │   ├── H.s005.scalar.dat                - H atom VMC data: time step =  0.2
  │   ├── H.s006.scalar.dat                - H atom VMC data: time step =  0.1
  │   ├── H.s007.scalar.dat                - H atom VMC data: time step =  0.05
  │   ├── H.s008.scalar.dat                - H atom VMC data: time step =  0.02
  │   ├── H.s009.scalar.dat                - H atom VMC data: time step =  0.01
  │   ├── H.s010.scalar.dat                - H atom VMC data: time step =  0.005
  │   ├── H.s011.scalar.dat                - H atom VMC data: time step =  0.002
  │   ├── H.s012.scalar.dat                - H atom VMC data: time step =  0.001
  │   ├── H.s013.scalar.dat                - H atom VMC data: time step =  0.0005
  │   ├── H.s014.scalar.dat                - H atom VMC data: time step =  0.0002
  │   ├── H.s015.scalar.dat                - H atom VMC data: time step =  0.0001
  │   └── H.xml                            - H atom VMC input file
  │
  ├── average                            - Python scripts for average/std. dev.
  │   ├── average.py                         - average five E_L from H atom VMC
  │   ├── stddev2.py                         - standard deviation using (E_L)^2
  │   └── stddev.py                          - standard deviation around the mean
  │
  ├── basis                              - varying basis set for orbitals
  │   ├── H__exact.s000.scalar.dat           - H atom VMC data using STO basis
  │   ├── H_STO-2G.s000.scalar.dat           - H atom VMC data using STO-2G basis
  │   ├── H_STO-3G.s000.scalar.dat           - H atom VMC data using STO-3G basis
  │   └── H_STO-6G.s000.scalar.dat           - H atom VMC data using STO-6G basis
  │
  ├── blocking                           - varying block/step ratio
  │   ├── H.dat                              - data for gnuplot
  │   ├── H.plt                              - gnuplot for N_block vs. E, tau_c
  │   ├── H.s000.scalar.dat                  - H atom VMC data 50000:1 blocks:steps
  │   ├── H.s001.scalar.dat                  - "  "    "    "  25000:2 blocks:steps
  │   ├── H.s002.scalar.dat                  - "  "    "    "  12500:4 blocks:steps
  │   ├── H.s003.scalar.dat                  - "  "    "    "  6250: 8 blocks:steps
  │   ├── H.s004.scalar.dat                  - "  "    "    "  3125:16 blocks:steps
  │   ├── H.s005.scalar.dat                  - "  "    "    "  2500:20 blocks:steps
  │   ├── H.s006.scalar.dat                  - "  "    "    "  1250:40 blocks:steps
  │   ├── H.s007.scalar.dat                  - "  "    "    "  1000:50 blocks:steps
  │   ├── H.s008.scalar.dat                  - "  "    "    "  500:100 blocks:steps
  │   ├── H.s009.scalar.dat                  - "  "    "    "  250:200 blocks:steps
  │   ├── H.s010.scalar.dat                  - "  "    "    "  125:400 blocks:steps
  │   ├── H.s011.scalar.dat                  - "  "    "    "  100:500 blocks:steps
  │   ├── H.s012.scalar.dat                  - "  "    "    "  50:1000 blocks:steps
  │   ├── H.s013.scalar.dat                  - "  "    "    "  40:1250 blocks:steps
  │   ├── H.s014.scalar.dat                  - "  "    "    "  20:2500 blocks:steps
  │   ├── H.s015.scalar.dat                  - "  "    "    "  10:5000 blocks:steps
  │   └── H.xml                             - H atom VMC input file
  │
  ├── blocks                             -  varying total number of blocks
  │   ├── H.dat                             - data for gnuplot
  │   ├── H.plt                             - gnuplot for N_block vs. E
  │   ├── H.s000.scalar.dat                 - H atom VMC data    500 blocks
  │   ├── H.s001.scalar.dat                 - "  "    "    "    2000 blocks
  │   ├── H.s002.scalar.dat                 - "  "    "    "    8000 blocks
  │   ├── H.s003.scalar.dat                 - "  "    "    "   32000 blocks
  │   ├── H.s004.scalar.dat                 - "  "    "    "  128000 blocks
  │   └── H.xml                             - H atom VMC input file
  │
  ├── dimer                          - comparing no and simple Jastrow factor
  │   ├── H2_STO___no_jastrow.s000.scalar.dat - H dimer VMC data without Jastrow
  │   └── H2_STO_with_jastrow.s000.scalar.dat - H dimer VMC data with Jastrow
  │
  ├──  docs                               - documentation
  │   ├──  Lab_1_MC_Analysis.pdf             - this document
  │   └──  Lab_1_Slides.pdf                  - slides presented in the lab
  │
  ├── nodes                              - varying number of computing nodes
  │   ├──  H.dat                             - data for gnuplot
  │   ├──  H.plt                             - gnuplot for N_node vs. E
  │   ├──  H.s000.scalar.dat                 - H atom VMC data with  32 nodes
  │   ├──  H.s001.scalar.dat                 - H atom VMC data with 128 nodes
  │   └──  H.s002.scalar.dat                 - H atom VMC data with 512 nodes
  │
  ├── problematic                        - problematic VMC run
  │   └──  H.s000.scalar.dat                 - H atom VMC data with a problem
  │
  └── size                                - scaling with number of particles
      ├──  01________H.s000.scalar.dat       - H atom VMC data
      ├──  02_______H2.s000.scalar.dat       - H dimer "   "
      ├──  06________C.s000.scalar.dat       - C atom  "   "
      ├──  10______CH4.s000.scalar.dat       - methane "   "
      ├──  12_______C2.s000.scalar.dat       - C dimer "   "
      ├──  16_____C2H4.s000.scalar.dat       - ethene  "
      ├──  18___CH4CH4.s000.scalar.dat       - methane dimer VMC data
      ├──  32_C2H4C2H4.s000.scalar.dat       - ethene dimer   "   "
      ├──  nelectron_tcpu.dat                - data for gnuplot
      └──  Nelectron_tCPU.plt                - gnuplot for N_elec vs. t_CPU

Atomic units
------------

QMCPACK operates in Ha atomic units to reduce the number of factors in
the Schrödinger equation. Thus, the unit of length is the bohr (5.291772
:math:`\times 10^{-11}` m = 0.529177 Å); the unit of energy is the Ha
(4.359744 :math:`\times 10^{-18}` J = 27.211385 eV). The energy of the
ground state of the hydrogen atom in these units is -0.5 Ha.

.. _review:

Reviewing statistics
--------------------

We will practice taking the average (mean) and standard deviation of
some MC data by hand to review the basic definitions.

Enter Python’s command line by typing ``python [Enter]``. You will see a
prompt “>>>.”

The mean of a dataset is given by:

.. math::
  :label: eq63

  \overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_i\:.

To calculate the average of five local energies from an MC calculation
of the ground state of an electron in the hydrogen atom, input (truncate
at the thousandths place if you cannot copy and paste; script versions
are also available in the ``average`` directory):

::

  (
  (-0.45298911858) +
  (-0.45481953564) +
  (-0.48066105923) +
  (-0.47316713469) +
  (-0.46204733302)
  )/5.

Then, press ``[Enter]`` to get:

::

  >>> ((-0.45298911858) + (-0.45481953564) + (-0.48066105923) +
  (-0.47316713469) + (-0.4620473302))/5.
  -0.46473683566800006

To understand the significance of the mean, we also need the standard deviation
around the mean of the data (also called the error bar), given by:

.. math::
  :label: eq64

  \sigma = \sqrt{\frac{1}{N(N-1)}\sum_{i=1}^{N} ({x_i} - \overline{x})^2}\:.

To calculate the standard deviation around the mean (-0.464736835668) of these
five data points, put in:

::

  ( (1./(5.*(5.-1.))) * (
  (-0.45298911858-(-0.464736835668))**2 + \\
  (-0.45481953564-(-0.464736835668))**2 +
  (-0.48066105923-(-0.464736835668))**2 +
  (-0.47316713469-(-0.464736835668))**2 +
  (-0.46204733302-(-0.464736835668))**2 )
  )**0.5

Then, press ``[Enter]`` to get:

::

  >>> ( (1./(5.*(5.-1.))) * ( (-0.45298911858-(-0.464736835668))**2 +
  (-0.45481953564-(-0.464736835668))**2 + (-0.48066105923-(-0.464736835668))**2 +
  (-0.47316713469-(-0.464736835668))**2 + (-0.46204733302-(-0.464736835668))**2
  ) )**0.5
  0.0053303187464332066

Thus, we might report this data as having a value -0.465 +/- 0.005 Ha.
This calculation of the standard deviation assumes that the average for this
data is fixed, but we can continually add MC samples to the data, so it
is better to use an estimate of the error bar that does not rely on the overall
average.  Such an estimate is given by:

.. math::
  :label: eq65

  \tilde{\sigma} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} \left[{(x^2)}_i - ({x_i})^2\right]}\:.

To calculate the standard deviation with this formula, input the following,
which includes the square of the local energy calculated with each
corresponding local energy:

::

  ( (1./(5.-1.)) * (
  (0.60984565298-(-0.45298911858)**2) + \\
  (0.61641291630-(-0.45481953564)**2) +
  (1.35860151160-(-0.48066105923)**2) + \\
  (0.78720769003-(-0.47316713469)**2) +
  (0.56393677687-(-0.46204733302)**2) )
  )**0.5

and press ``[Enter]`` to get:

::

  >>> ((1./(5.-1.))*((0.60984565298-(-0.45298911858)**2)+
  (0.61641291630-(-0.45481953564)**2)+(1.35860151160-(-0.48066105923)**2)+
  (0.78720769003-(-0.47316713469)**2)+(0.56393677687-(-0.46204733302)**2))
  )**0.5
  0.84491636672906634

This much larger standard deviation, acknowledging that the mean of this
small data set is not the average in the limit of infinite sampling,
more accurately reports the value of the local energy as -0.5 +/- 0.8
Ha.

Type ``quit()`` and press ``[Enter]`` to exit the Python command line.

.. _inspect-data:

Inspecting MC Data
------------------

QMCPACK outputs data from MC calculations into files ending in
``scalar.dat``. Several quantities are calculated and written for each
block of MC steps in successive columns to the right of the step index.

Change directories to ``atom``, and open the file ending in
``scalar.dat`` with a text editor (e.g., **vi \*.scalar.dat** or **emacs
\*.scalar.dat**. If possible, adjust the terminal so that lines do not
wrap. The data will begin as follows (broken into three groups to fit on
this page):

::

  #   index    LocalEnergy         LocalEnergy_sq      LocalPotential     ...
           0   -4.5298911858e-01    6.0984565298e-01   -1.1708693521e+00
           1   -4.5481953564e-01    6.1641291630e-01   -1.1863425644e+00
           2   -4.8066105923e-01    1.3586015116e+00   -1.1766446209e+00
           3   -4.7316713469e-01    7.8720769003e-01   -1.1799481122e+00
           4   -4.6204733302e-01    5.6393677687e-01   -1.1619244081e+00
           5   -4.4313854290e-01    6.0831516179e-01   -1.2064503041e+00
           6   -4.5064926960e-01    5.9891422196e-01   -1.1521370176e+00
           7   -4.5687452611e-01    5.8139614676e-01   -1.1423627617e+00
           8   -4.5018503739e-01    8.4147849706e-01   -1.1842075439e+00
           9   -4.3862013841e-01    5.5477715836e-01   -1.2080979177e+00

The first line begins with a #, indicating that this line does not
contain MC data but rather the labels of the columns. After a blank
line, the remaining lines consist of the MC data. The first column,
labeled index, is an integer indicating which block of MC data is on
that line. The second column contains the quantity usually of greatest
interest from the simulation: the local energy. Since this simulation
did not use the exact ground state wavefunction, it does not produce
-0.5 Ha as the local energy although the value lies within about 10%.
The value of the local energy fluctuates from block to block, and the
closer the trial wavefunction is to the ground state the smaller these
fluctuations will be. The next column contains an important ingredient
in estimating the error in the MC average—the square of the local
energy—found by evaluating the square of the Hamiltonian.

::

  ...   Kinetic             Coulomb             BlockWeight        ...
         7.1788023352e-01   -1.1708693521e+00    1.2800000000e+04
         7.3152302871e-01   -1.1863425644e+00    1.2800000000e+04
         6.9598356165e-01   -1.1766446209e+00    1.2800000000e+04
         7.0678097751e-01   -1.1799481122e+00    1.2800000000e+04
         6.9987707508e-01   -1.1619244081e+00    1.2800000000e+04
         7.6331176120e-01   -1.2064503041e+00    1.2800000000e+04
         7.0148774798e-01   -1.1521370176e+00    1.2800000000e+04
         6.8548823555e-01   -1.1423627617e+00    1.2800000000e+04
         7.3402250655e-01   -1.1842075439e+00    1.2800000000e+04
         7.6947777925e-01   -1.2080979177e+00    1.2800000000e+04

The fourth column from the left consists of the values of the local potential
energy.  In this simulation, it is identical to the Coulomb potential
(contained in the sixth column) because the one electron in the simulation has
only the potential energy coming from its interaction with the nucleus.  In
many-electron simulations, the local potential energy contains contributions
from the electron-electron Coulomb interactions and the nuclear potential or
pseudopotential.  The fifth column contains the local kinetic energy value for
each MC block, obtained from the Laplacian of the wavefunction.  The sixth
column shows the local Coulomb interaction energy.  The seventh column displays
the weight each line of data has in the average (the weights are identical in
this simulation).

::

  ...    BlockCPU            AcceptRatio
         6.0178991748e-03    9.8515625000e-01
         5.8323097461e-03    9.8562500000e-01
         5.8213412744e-03    9.8531250000e-01
         5.8330412549e-03    9.8828125000e-01
         5.8108362256e-03    9.8625000000e-01
         5.8254170264e-03    9.8625000000e-01
         5.8314813086e-03    9.8679687500e-01
         5.8258469971e-03    9.8726562500e-01
         5.8158433545e-03    9.8468750000e-01
         5.7959401123e-03    9.8539062500e-01

The eighth column shows the CPU time (in seconds) to calculate the data
in that line. The ninth column from the left contains the acceptance
ratio (1 being full acceptance) for MC steps in that line’s data. Other
than the block weight, all quantities vary from line to line.

Exit the text editor (**[Esc] :q! [Enter]** in vi, **[Ctrl]-x [Ctrl]-c**
in emacs).

.. _averaging:

Averaging quantities in the MC data
-----------------------------------

QMCPACK includes the qmca Python tool to average quantities in the
``scalar.dat`` file (and also the ``dmc.dat`` file of DMC simulations).
Without any flags, qmca will output the average of each column with a
quantity in the ``scalar.dat`` file as follows.

Execute qmca by ``qmca *.scalar.dat``, which for this data outputs:

::

  H  series 0
  LocalEnergy           =          -0.45446 +/-          0.00057
  Variance              =             0.529 +/-            0.018
  Kinetic               =            0.7366 +/-           0.0020
  LocalPotential        =           -1.1910 +/-           0.0016
  Coulomb               =           -1.1910 +/-           0.0016
  LocalEnergy_sq        =             0.736 +/-            0.018
  BlockWeight           =    12800.00000000 +/-       0.00000000
  BlockCPU              =        0.00582002 +/-       0.00000067
  AcceptRatio           =          0.985508 +/-         0.000048
  Efficiency            =        0.00000000 +/-       0.00000000

After one blank, qmca prints the title of the subsequent data, gleaned
from the data file name. In this case, ``H.s000.scalar.dat`` became “H
series 0.” Everything before the first “``.s``” will be interpreted as
the title, and the number between “``.s``” and the next “.” will be
interpreted as the series number.

The first column under the title is the name of each quantity qmca
averaged. The column to the right of the equal signs contains the
average for the quantity of that line, and the column to the right of
the plus-slash-minus is the statistical error bar on the quantity. All
quantities calculated from MC simulations have and must be reported with
a statistical error bar!

Two new quantities not present in the ``scalar.dat`` file are computed
by qmca from the data—variance and efficiency. We will look at these
later in this lab.

To view only one value, **qmca** takes the **-q (quantity)** flag. For
example, the output of ``qmca -q LocalEnergy *.scalar.dat`` in this
directory produces a single line of output:

::

  H  series 0  LocalEnergy = -0.454460 +/- 0.000568

Type ``qmca --help`` to see the list of all quantities and their
abbreviations.

Evaluating MC simulation quality
--------------------------------

There are several aspects of a MC simulation to consider in deciding how well
it went.  Besides the deviation of the average from an expected value (if there
is one), the stability of the simulation in its sampling, the autocorrelation
between MC steps, the value of the acceptance ratio (accepted steps over total
proposed steps), and the variance in the local energy all indicate the quality
of an MC simulation.  We will look at these one by one.

Tracing MC quantities
~~~~~~~~~~~~~~~~~~~~~

Visualizing the evolution of MC quantities over the course of the
simulation by a *trace* offers a quick picture of whether the random
walk had the expected behavior. qmca plots traces with the -t flag.

Type ``qmca -q e -t H.s000.scalar.dat``, which produces a graph of the
trace of the local energy:

.. image:: /figs/lab_qmc_statistics_tracing1.png
  :width: 500
  :align: center

The solid black line connects the values of the local energy at each MC
block (labeled “samples”). The average value is marked with a
horizontal, solid red line. One standard deviation above and below the
average are marked with horizontal, dashed red lines.

The trace of this run is largely centered on the average with no
large-scale oscillations or major shifts, indicating a good-quality MC
run.

Try tracing the kinetic and potential energies, seeing that their
behavior is comparable with the total local energy.

Change to directory ``problematic`` and type
``qmca -q e -t H.s000.scalar.dat`` to produce this graph:

.. image:: /figs/lab_qmc_statistics_tracing2.png
  :width: 500
  :align: center

Here, the local energy samples cluster around the expected -0.5 Ha for the
first 150 samples or so and then begin to oscillate more wildly and increase
erratically toward 0, indicating a poor-quality MC run.

Again, trace the kinetic and potential energies in this run and see how their
behavior compares with the total local energy.

Blocking away autocorrelation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*Autocorrelation* occurs when a given MC step biases subsequent MC
steps, leading to samples that are not statistically independent. We
must take this autocorrelation into account to obtain accurate
statistics. qmca outputs autocorrelation when given the --sac flag.

Change to directory ``autocorrelation`` and type
``qmca -q e --sac H.s000.scalar.dat``.

::

  H  series 0  LocalEnergy = -0.454982 +/- 0.000430    1.0

The value after the error bar on the quantity is the autocorrelation
(1.0 in this case).

Proposing too small a step in configuration space, the MC *time step*,
can lead to autocorrelation since the new samples will be in the
neighborhood of previous samples. Type ``grep timestep H.xml`` to see
the varying time step values in this QMCPACK input file (``H.xml``):

::

  <parameter name="timestep">10</parameter>
  <parameter name="timestep">5</parameter>
  <parameter name="timestep">2</parameter>
  <parameter name="timestep">1</parameter>
  <parameter name="timestep">0.5</parameter>
  <parameter name="timestep">0.2</parameter>
  <parameter name="timestep">0.1</parameter>
  <parameter name="timestep">0.05</parameter>
  <parameter name="timestep">0.02</parameter>
  <parameter name="timestep">0.01</parameter>
  <parameter name="timestep">0.005</parameter>
  <parameter name="timestep">0.002</parameter>
  <parameter name="timestep">0.001</parameter>
  <parameter name="timestep">0.0005</parameter>
  <parameter name="timestep">0.0002</parameter>
  <parameter name="timestep">0.0001</parameter>

Generally, as the time step decreases, the autocorrelation will increase
(caveat: very large time steps will also have increasing
autocorrelation). To see this, type ``qmca -q e --sac *.scalar.dat`` to
see the energies and autocorrelation times, then plot with gnuplot by
inputting ``gnuplot H.plt``:

.. image:: /figs/lab_qmc_statistics_blocking1.png
  :width: 500
  :align: center

The error bar also increases with the autocorrelation.

Press ``q [Enter]`` to quit gnuplot.

To get around the bias of autocorrelation, we group the MC steps into
blocks, take the average of the data in the steps of each block, and
then finally average the averages in all the blocks. QMCPACK outputs the
block averages as each line in the ``scalar.dat`` file. (For DMC
simulations, in addition to the ``scalar.dat``, QMCPACK outputs the
quantities at each step to the ``dmc.dat`` file, which permits
reblocking the data differently from the specification in the input
file.)

Change directories to ``blocking``. Here we look at the time step of the
last dataset in the ``autocorrelation`` directory. Verify this by typing
``grep timestep H.xml`` to see that all values are set to 0.001. Now to
see how we will vary the blocking, type ``grep -A1 blocks H.xml``. The
parameter “steps” indicates the number of steps per block, and the
parameter “blocks” gives the number of blocks. For this comparison, the
total number of MC steps (equal to the product of “steps” and “blocks”)
is fixed at 50,000. Now check the effect of blocking on
autocorrelation—type ``qmca -q e --sac *scalar.dat`` to see the data and
``gnuplot H.plt`` to visualize the data:

.. image:: /figs/lab_qmc_statistics_blocking2.png
  :width: 500
  :align: center

The greatest number of steps per block produces the smallest
autocorrelation time. The larger number of blocks over which to average
at small step-per-block number masks the corresponding increase in error
bar with increasing autocorrelation.

Press ``q [Enter]`` to quit gnuplot.

Balancing autocorrelation and acceptance ratio
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Adjusting the time step value also affects the ratio of accepted steps to
proposed steps.  Stepping nearby in configuration space implies that the
probability distribution is similar and thus more likely to result in an
accepted move.  Keeping the acceptance ratio high means the algorithm is
efficiently exploring configuration space and not sticking at particular
configurations.  Return to the ``autocorrelation`` directory.  Refresh your
memory on the time steps in this set of simulations by ``grep timestep
H.xml``. Then, type ``qmca -q ar *scalar.dat`` to see the acceptance ratio
as it varies with decreasing time step:

::

  H  series 0  AcceptRatio = 0.047646 +/- 0.000206
  H  series 1  AcceptRatio = 0.125361 +/- 0.000308
  H  series 2  AcceptRatio = 0.328590 +/- 0.000340
  H  series 3  AcceptRatio = 0.535708 +/- 0.000313
  H  series 4  AcceptRatio = 0.732537 +/- 0.000234
  H  series 5  AcceptRatio = 0.903498 +/- 0.000156
  H  series 6  AcceptRatio = 0.961506 +/- 0.000083
  H  series 7  AcceptRatio = 0.985499 +/- 0.000051
  H  series 8  AcceptRatio = 0.996251 +/- 0.000025
  H  series 9  AcceptRatio = 0.998638 +/- 0.000014
  H  series 10  AcceptRatio = 0.999515 +/- 0.000009
  H  series 11  AcceptRatio = 0.999884 +/- 0.000004
  H  series 12  AcceptRatio = 0.999958 +/- 0.000003
  H  series 13  AcceptRatio = 0.999986 +/- 0.000002
  H  series 14  AcceptRatio = 0.999995 +/- 0.000001
  H  series 15  AcceptRatio = 0.999999 +/- 0.000000

By series 8 (time step = 0.02), the acceptance ratio is in excess of
99%.

Considering the increase in autocorrelation and subsequent increase in
error bar as time step decreases, it is important to choose a time step
that trades off appropriately between acceptance ratio and
autocorrelation. In this example, a time step of 0.02 occupies a spot
where the acceptance ratio is high (99.6%) and autocorrelation is not
appreciably larger than the minimum value (1.4 vs. 1.0).

Considering variance
~~~~~~~~~~~~~~~~~~~~

Besides autocorrelation, the dominant contributor to the error bar is
the *variance* in the local energy. The variance measures the
fluctuations around the average local energy, and, as the fluctuations
go to zero, the wavefunction reaches an exact eigenstate of the
Hamiltonian. qmca calculates this from the local energy and local energy
squared columns of the ``scalar.dat``.

Type ``qmca -q v H.s009.scalar.dat`` to calculate the variance on the
run with time step balancing autocorrelation and acceptance ratio:

::

  H  series 9  Variance = 0.513570 +/- 0.010589

Just as the total energy does not tell us much by itself, neither does
the variance. However, comparing the ratio of the variance with the
energy indicates how the magnitude of the fluctuations compares with the
energy itself. Type ``qmca -q ev H.s009.scalar.dat`` to calculate the
energy and variance on the run side by side with the ratio:

::

                      LocalEnergy               Variance        ratio
  H  series 0  -0.454460 +/- 0.000568   0.529496 +/- 0.018445   1.1651

The very high ration of 1.1651 indicates the square of the fluctuations is on
average larger than the value itself.  In the next section, we will approach
ways to improve the variance that subsequent labs will build on.

Reducing statistical error bars
-------------------------------

Increasing MC sampling
~~~~~~~~~~~~~~~~~~~~~~

Increasing the number of MC samples in a dataset reduces the error bar
as the inverse of the square root of the number of samples. There are
two ways to increase the number of MC samples in a simulation: (1)
running more samples in parallel and (2) increasing the number of blocks
(with fixed number of steps per block, this increases the total number
of MC steps).

To see the effect of running more samples in parallel, change to the
directory ``nodes``. The series here increases the number of nodes by
factors of four from 32 to 128 to 512. Type ``qmca -q ev *scalar.dat``
and note the change in the error bar on the local energy as the number
of nodes. Visualize this with **gnuplot H.plt**:

.. image:: /figs/lab_qmc_statistics_nodes.png
  :width: 500
  :align: center

Increasing the number of blocks, unlike running in parallel, increases
the total CPU time of the simulation.

Press ``q [Enter]`` to quit gnuplot.

To see the effect of increasing the block number, change to the
directory ``blocks``. To see how we will vary the number of blocks, type
``grep -A1 blocks H.xml``. The number of steps remains fixed, thus
increasing the total number of samples. Visualize the tradeoff by
inputting ``gnuplot H.plt``:

.. image:: /figs/lab_qmc_statistics_blocks.png
  :width: 500
  :align: center

Press ``q [Enter]`` to quit gnuplot.

Improving the basis sets
~~~~~~~~~~~~~~~~~~~~~~~~

In all of the previous examples, we are using the sum of two Gaussian
functions (STO-2G) to approximate what should be a simple decaying
exponential for the wavefunction of the ground state of the hydrogen
atom. The sum of multiple copies of a function varying each copy’s width
and amplitude with coefficients is called a *basis set*. As we add
Gaussians to the basis set, the approximation improves, the variance
goes toward zero, and the energy goes to -0.5 Ha. In nearly every other
case, the exact function is unknown, and we add basis functions until
the total energy does not change within some threshold.

Change to the directory ``basis`` and look at the total energy and
variance as we change the wavefunction by typing ``qmca -q ev H_``:

::

                            LocalEnergy               Variance        ratio
  H_STO-2G  series 0  -0.454460 +/- 0.000568   0.529496 +/- 0.018445   1.1651
  H_STO-3G  series 0  -0.465386 +/- 0.000502   0.410491 +/- 0.010051   0.8820
  H_STO-6G  series 0  -0.471332 +/- 0.000491   0.213919 +/- 0.012954   0.4539
  H__exact  series 0  -0.500000 +/- 0.000000   0.000000 +/- 0.000000   -0.0000

qmca also puts out the ratio of the variance to the local energy in a
column to the right of the variance error bar. A typical high-quality
value for this ratio is lower than 0.1 or so—none of these few-Gaussian
wavefunctions satisfy that rule of thumb.

Use qmca to plot the trace of the local energy, kinetic energy, and
potential energy of H__exact. The total energy is constantly -0.5 Ha
even though the kinetic and potential energies fluctuate from
configuration to configuration.

Adding a Jastrow factor
~~~~~~~~~~~~~~~~~~~~~~~

Another route to reducing the variance is the introduction of a Jastrow
factor to account for electron-electron correlation (not the statistical
autocorrelation of MC steps but the physical avoidance that electrons
have of one another). To do this, we will switch to the hydrogen dimer
with the exact ground state wavefunction of the atom (STO basis)—this
will not be exact for the dimer. The ground state energy of the hydrogen
dimer is -1.174 Ha.

Change directories to ``dimer`` and put in ``qmca -q ev *scalar.dat`` to
see the result of adding a simple, one-parameter Jastrow to the STO
basis for the hydrogen dimer at experimental bond length:

::

                                  LocalEnergy               Variance
  H2_STO___no_jastrow  series 0  -0.876548 +/- 0.005313   0.473526 +/- 0.014910
  H2_STO_with_jastrow  series 0  -0.912763 +/- 0.004470   0.279651 +/- 0.016405

The energy reduces by 0.044 +/- 0.006 HA and the variance by 0.19 +/-
0.02. This is still 20% above the ground state energy, and subsequent
labs will cover how to improve on this with improved forms of the
wavefunction that capture more of the physics.

Scaling to larger numbers of electrons
--------------------------------------

Calculating the efficiency
~~~~~~~~~~~~~~~~~~~~~~~~~~

The inverse of the product of CPU time and the variance measures the
*efficiency* of an MC calculation. Use qmca to calculate efficiency by
typing ``qmca -q eff *scalar.dat`` to see the efficiency of these two
H :math:`_2` calculations:

::

  H2_STO___no_jastrow  series 0  Efficiency = 16698.725453 +/- 0.000000
  H2_STO_with_jastrow  series 0  Efficiency = 52912.365609 +/- 0.000000

The Jastrow factor increased the efficiency in these calculations by a factor
of three, largely through the reduction in variance (check the average block
CPU time to verify this claim).

Scaling up
~~~~~~~~~~

To see how MC scales with increasing particle number, change directories
to ``size``. Here are the data from runs of increasing numbers of
electrons for H, H\ :math:`_2`, C, CH\ :math:`_4`, C\ :math:`_2`,
C\ :math:`_2`\ H\ :math:`_4`, (CH\ :math:`_4`)\ :math:`_2`, and
(C\ :math:`_2`\ H\ :math:`_4`)\ :math:`_2` using the STO-6G basis set
for the orbitals of the Slater determinant. The file names begin with
the number of electrons simulated for those data.

Use ``qmca -q bc *scalar.dat`` to see that the CPU time per block
increases with the number of electrons in the simulation; then plot the
total CPU time of the simulation by **gnuplot Nelectron_tCPU.plt**:

.. image:: /figs/lab_qmc_statistics_scaling.png
  :width: 500
  :align: center

The green pluses represent the CPU time per block at each electron
number. The red line is a quadratic fit to those data. For a fixed basis
set size, we expect the time to scale quadratically up to 1,000s of
electrons, at which point a cubic scaling term may become dominant.
Knowing the scaling allows you to roughly project the calculation time
for a larger number of electrons.

Press ``q [Enter]`` to quit gnuplot.

This is not the whole story, however. The variance of the energy also
increases with a fixed basis set as the number of particles increases at
a faster rate than the energy decreases. To see this, type
``qmca -q ev *scalar.dat``:

::

              LocalEnergy               Variance
  01________H  series 0  -0.471352 +/- 0.000493      0.213020 +/- 0.012950
  02_______H2  series 0  -0.898875 +/- 0.000998      0.545717 +/- 0.009980
  06________C  series 0  -37.608586 +/- 0.020453   184.322000 +/- 45.481193
  10______CH4  series 0  -38.821513 +/- 0.022740   169.797871 +/- 24.765674
  12_______C2  series 0  -72.302390 +/- 0.037691   491.416711 +/- 106.090103
  16_____C2H4  series 0  -75.488701 +/- 0.042919   404.218115 +/- 60.196642
  18___CH4CH4  series 0  -58.459857 +/- 0.039309   498.579645 +/- 92.480126
  32_C2H4C2H4  series 0  -91.567283 +/- 0.048392   632.114026 +/- 69.637760

The increase in variance is not uniform, but the general trend is upward with a
fixed wavefunction form and basis set.  Subsequent labs will address how to
improve the wavefunction to keep the variance manageable.