qmcpack/tests/performance/NiO/README

NiO QMC Performance Benchmarks

1. Introduction

These benchmarks for VMC and DMC represent real research runs and are
large enough to be used for performance measurements. This is in
contrast to the conventional integration tests where the particle
counts are too small to be representative. Care is still needed to
remove initialization, I/O, and compute a representative performance
measure.

The ctest integration is sufficient to run the benchmarks and measure
relative performance from version to version of QMCPACK and assess
proposed code changes. To obtain highest performance on a particular
platform, you will have to run the benchmarks in a standalone manner
and tune thread counts, placement, walker count (etc.)

2. Simulated system and QMC methods tested

The simulated systems consist of a number of repeats of a NiO
primitive cell.

Name  Atoms Electrons  Electrons per spin
  S1     4       48             24
  S2     8       96             48
  S4    16      192             96
  S8    32      384            192
 S16    64      768            384
 S24    96     1152            576
 S32   128     1536            768
 S48   192     2304           1152
 S64   256     3072	      1536
S128   512     6144	      3072
S256  1024    12288	      6144

Runs consist of a number of short blocks of (i) VMC without drift (ii)
VMC with drift term included (iii) DMC with constant population.

These different runs vary the ratio between value, gradient, and
laplacian evaluations of the wavefunction. The most important
performance is for DMC, which dominates supercomputer time usage. For
a large enough supercell, the runs scale cubically in cost with the
number of "electrons per spin".

Two sets of wavefunction are tested: splined orbitals with a one and
two body Jastrow functions, and a more complex form with an additional
three body Jastrow function. The Jastrows are the same for each run
and are not reoptimized, as might be done for research.

On early 2017 era hardware and QMCPACK code, it is very likely that
only the first 3 supercells are easily runnable due to memory
limitations.

3. Requirements

Download the necessary NiO h5 orbital files of different sizes from
the following link

https://anl.box.com/s/yxz1ic4kxtdtgpva5hcmlom9ixfl3v3c

Or directly download files in the command line via curl -L -O -J <URL>

# NiO-fcc-supertwist111-supershift000-S1.h5
https://anl.box.com/shared/static/uduxhujxkm1st8pau9muin255cxr2blb.h5
# NiO-fcc-supertwist111-supershift000-S2.h5
https://anl.box.com/shared/static/g5ceycyjhb2b6segk7ibxup2hxnd77ih.h5
# NiO-fcc-supertwist111-supershift000-S4.h5
https://anl.box.com/shared/static/47sjyru249ct438j450o7nos6siuaft2.h5
# NiO-fcc-supertwist111-supershift000-S8.h5
https://anl.box.com/shared/static/3sgw5wsfkbptptxyuu8r4iww9om0grwk.h5
# NiO-fcc-supertwist111-supershift000-S16.h5
https://anl.box.com/shared/static/f2qftlejohkv48alidi5chwjspy1fk15.h5
# NiO-fcc-supertwist111-supershift000-S24.h5
https://anl.box.com/shared/static/hiysnip3o8e3sp15e3e931ca4js3zsnw.h5
# NiO-fcc-supertwist111-supershift000-S32.h5
https://anl.box.com/shared/static/tjdc8o3yt69crl8xqx7lbmqts03itfve.h5
# NiO-fcc-supertwist111-supershift000-S48.h5
https://anl.box.com/shared/static/7jzdg0yp2njanz5roz5j40lcqc4poxqj.h5
# NiO-fcc-supertwist111-supershift000-S64.h5
https://anl.box.com/shared/static/yneul9l7rq2ad35vkt4mgmr2ijxt5vb6.h5
# NiO-fcc-supertwist111-supershift000-S128.h5
https://anl.box.com/shared/static/a0j8gjrfvco0mnko00wq5ujt5oidlg0y.h5
# NiO-fcc-supertwist111-supershift000-S256.h5
https://anl.box.com/shared/static/373klkrpmc362aevkt7gb0s8rg1hs9ps.h5

The above direct links were verified in June 2023 but may be fragile.

Only the specific problem sizes needed for the intended benchmarking runs need to be downloaded.

Please check the md5 value of h5 files before starting any benchmarking.

$ md5sum *.h5
e09c619f93daebca67f51a6d235bfcb8  NiO-fcc-supertwist111-supershift000-S1.h5
0beffc63ef597f27b70d10be43825515  NiO-fcc-supertwist111-supershift000-S2.h5
10c4f5150b1e77bbb73da8b1e4aa2b7a  NiO-fcc-supertwist111-supershift000-S4.h5
6476972b54b58c89d15c478ed4e10317  NiO-fcc-supertwist111-supershift000-S8.h5
b47f4be12f98f8a3d4b65d0ae048b837  NiO-fcc-supertwist111-supershift000-S16.h5
2a149cfe4153f7d56409b5ce4eaf35d6  NiO-fcc-supertwist111-supershift000-S24.h5
ee1f6c6699a24e30d7e6c122cde55ac1  NiO-fcc-supertwist111-supershift000-S32.h5
d159ef4d165d2d749c5557dfbf8cbdce  NiO-fcc-supertwist111-supershift000-S48.h5
40ecaf05177aa4bbba7d3bf757994548  NiO-fcc-supertwist111-supershift000-S64.h5
0a530594a3c7eec4f0155b5b2ca92eb0  NiO-fcc-supertwist111-supershift000-S128.h5
cff0101debb11c8c215e9138658fbd21  NiO-fcc-supertwist111-supershift000-S256.h5

$ ls -l *.h5
  42861392 NiO-fcc-supertwist111-supershift000-S1.h5
  75298480 NiO-fcc-supertwist111-supershift000-S2.h5
 141905680 NiO-fcc-supertwist111-supershift000-S4.h5
 275701688 NiO-fcc-supertwist111-supershift000-S8.h5
 545483396 NiO-fcc-supertwist111-supershift000-S16.h5
 818687172 NiO-fcc-supertwist111-supershift000-S24.h5
1093861616 NiO-fcc-supertwist111-supershift000-S32.h5
1637923716 NiO-fcc-supertwist111-supershift000-S48.h5
2180300396 NiO-fcc-supertwist111-supershift000-S64.h5
4375340300 NiO-fcc-supertwist111-supershift000-S128.h5
8786322376 NiO-fcc-supertwist111-supershift000-S256.h5

The data files should be placed in a directory labeled NiO.

4. Throughput metric

A key result that can be extracted from the benchmarks is a throughput
metric, or the "time to move one walker", as measured on a per step
basis.  One can also compute the "walkers moved per second per node",
a throughput metric factoring the hardware availability (threads, cores, GPUs).

Higher throughput measures are better. Note however that the  metric
does not factor the equilibration period in the Monte Carlo or
consider the reasonable minimum and maximum number of walkers usable
for specific scientific calculation. Hence doubling the throughput
does not automatically halve the time to scientific solution, although
for many scenarios it will.

5. Benchmarking with ctest

This is the simplest way to calibrate performance though has some
limitations. The current choice is uses a fixed 1 MPI with 16 threads
on a single node on CPU systems. If you need to change either of these
numbers or you need to control more hardware behaviors such as thread
affinity, please read the next section.

To activate the ctest route, add the following option in your cmake
command line before building your binary:

-DQMC_DATA=<full path to your data folder>

All the h5 files must be placed in a subdirectory <full path to your data folder>/NiO.
Run tests with command "ctest -R performance-NiO" after building
QMCPACK. Add "-VV" to capture the QMCPACK output. Enabling the timers
is not essential, but activates fine grained timers and counters useful for
analysis such as the number of times specific kernels are called and
their speed.


6. Running the benchmarks manually

1) Complete step 5 (above), which will cause cmake to generate all
   input cases and adjust parameters.

2) Copy all the files in YOUR_BUILD_FOLDER/tests/performance/NiO
   (not YOUR_QMCPACK_REPO/tests/performance/NiO) to a work directory (WDIR)
   where you would like to run the benchmark.

3) Copy or softlink all the h5 files to your WDIR if the existing ones are broken.

4) Run benchmarks

   (i) On a standalone workstation

   Directly enter the dmc-xxx folders and run individual benchmarks.
   YOUR_BUILD_FOLDER/bin/qmcpack NiO-fcc-SX-dmc.xml

   (ii) On a cluster with a job queue sytem

   Prepare one job script for each dmc-xxx folder. We provide two samples for CPU
   (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at
   ALCF Cetus and Cooley to give you basic ideas how to run QMCPACK manually.

   a) Customize the header based on your machine.

   b) You always need to point the variable "exe" to the binary that
      you would like to benchmark.

   c) Update SXX in "file_prefix" based on the problem size to run.

   d) Customize the mpirun command based on the job dispatcher on your
      system and pick the number of MPI tasks and OMP threads as well
      as any other controls you would like to add.

   e) Submit your job.

5) Collect performance results

   One simple performance metric is the time per block which indicates
   how fast walkers are advancing. This metric can be measured with
   qmca, an analysis tool shipped with QMCPACK. Add
   YOUR_QMCPACK_REPO/nexus/bin to environment variable PATH.

   In your WDIR, use
   qmca -q bc -e 0 dmc*/*.scalar.dat to collect the timing for all the runs.

   Or in each subfolder, you type
   qmca -q bc -e 0 *.scalar.dat

   The current benchmarks contains 3 run sections:

     I) VMC + no drift
    II) VMC + drift
   III) DMC with constant population

   i.e. three timings are given per run. Timing information is also
   included in the standard output of QMCPACK and a *.info.xml produced
   by each run. In the standard output, "QMC Execution time" is the
   time per run section, e.g. all blocks of VMC with drift, while the
   fine grained timing information is printed at the end.

7. Additional considerations

The performance runs have settings to recompute the inverse of the Slater determinants at a different frequency to the defaults. The
recomputation is controlled by the blocks_between_recompute parameter and is set so that this occurs once per QMC section. This
ensures that the relevant code paths are utilized in these short runs and is included in any performance analysis based on them. The
defaults, which depend on whether the build is mixed or full precision, are described in
https://qmcpack.readthedocs.io/en/develop/methods.html . To maintain numerical accuracy in production QMC calculations this
recomputation must be performed at a minimum frequency determined by the electron count, precision, and details of the simulated
system.