Expanded README with more run details and information for non-specialists

This commit is contained in:
Paul Kent 2017-02-21 12:29:21 -05:00
parent 002d84f3ad
commit 53de3c8731
3 changed files with 147 additions and 41 deletions

View File

@ -1,24 +1,65 @@
0. Benchmark Release Notes.
NiO QMC Performance Benchmarks
v.0
Initial benchmark set. Supercell size up to 256 (1024 atoms), but only checked up to 64 (256 atoms)
1. Introduction
v.1
Add J3 to CPU tests.
These benchmarks for VMC and DMC represent real research runs and are
large enough to be used for performance measurements. This is in
contrast to the conventional integration tests where the particle
counts are too small to be representative. Care is still needed to
remove initiallization, I/O, and compute a representative performance
measure.
v.2
Add one more qmc section to each test,
From I) VMC + no drift, II) DMC with constant population
To I) VMC + no drift, II) VMC + drift, III) DMC with constant population
The ctest integration is sufficient to run the benchmarks and measure
relative performance from version to version of QMCPACK and assess
proposed code changes. To obtain highest performance on a particular
platform, you must run the benchmarks in a standalone manner and tune
thread counts, placement, walker count (etc.)
2. Simulated system and QMC methods tested
The simulated systems consist of a number of repeats of a NiO
primitive cell.
Name Atoms Electrons Electrons per spin
S8 32 384 192
S16 64 768 384
S32 128 1536 768
S64 256 3072 1536
S128 512 6144 3072
S256 1024 12288 6144
Runs consist of a number of short blocks of (i) VMC without drift (ii)
VMC with drift term included (iii) DMC with constant population.
These different runs vary the ratio between value, gradient, and
laplacian evaluations of the wavefunction. The most important
performance is for DMC, which dominates supercomputer time usage. For
a large enough supercell, the runs scale cubically in cost with the
"electrons per spin".
Two sets of wavefunction are tested: splined orbitals with a one and
two body jastrow functions, and a more complex form with an additional
three body jastrow function. The jastrows are the same for each run
and are not reoptimized, as might be done for research.
On early 2017 era hardware and QMCPACK code, it is very likely that
only the first 3 supercells are easily runnable due to memory
limitations.
3. Requirements
Download the necessary NiO h5 orbital files of different sizes from
the following link
1. Before your run.
Download the necessary NiO h5 orbital files of different sizes from the following link
https://anl.box.com/s/pveyyzrc2wuvg5tmxjzzwxeo561vh3r0
This link will be updated when the storage host is changed.
You only need to download the sizes you would like to include in your benchmarking runs.
Please check the md5 value of h5 files before starting any benchmarking.
Only Linux distributions, md5sum tool is widely available
This link will be updated when a longer term storage host is
identified. You only need to download the sizes you would like to
include in your benchmarking runs.
Please check the md5 value of h5 files before starting any
benchmarking.
$ md5sum *.h5
6476972b54b58c89d15c478ed4e10317 NiO-fcc-supertwist111-supershift000-S8.h5
b47f4be12f98f8a3d4b65d0ae048b837 NiO-fcc-supertwist111-supershift000-S16.h5
@ -27,45 +68,110 @@ ee1f6c6699a24e30d7e6c122cde55ac1 NiO-fcc-supertwist111-supershift000-S32.h5
0a530594a3c7eec4f0155b5b2ca92eb0 NiO-fcc-supertwist111-supershift000-S128.h5
cff0101debb11c8c215e9138658fbd21 NiO-fcc-supertwist111-supershift000-S256.h5
2. Benchmarking with ctest.
This is the simplest way to calibrate performance though with limitations.
The current choice is using 1 MPI with 16 threads on a single node. If you need to change either of these numbers
or you need to control more hardware behaviors such as thread affinity, please read the next section.
To activate the ctest route, add the following option in your cmake command line when building your binary.
-D QMC_DATA=YOUR_DATA_FOLDER
YOUR_DATA_FOLDER contains a folder called NiO with h5 files in it.
Running tests with command "ctest -R performance-NiO" after your building is complete.
$ ls -l *.h5
275701688 NiO-fcc-supertwist111-supershift000-S8.h5
545483396 NiO-fcc-supertwist111-supershift000-S16.h5
1093861616 NiO-fcc-supertwist111-supershift000-S32.h5
2180300396 NiO-fcc-supertwist111-supershift000-S64.h5
4375340300 NiO-fcc-supertwist111-supershift000-S128.h5
8786322376 NiO-fcc-supertwist111-supershift000-S256.h5
3. Running the benchmark manually.
1) Copy the whole current folder to a work directory (WDIR) you would like to run the benchmark.
The data files should be placed in a directory labeled NiO.
4. Throughput metric
A key result that can be extracted from the benchmarks is a throughput
metric, or the "time to move one walker", as measured on a per step
basis. One can also compute the "walkers moved per second per node",
factoring the hardware availability (threads, cores, GPUs).
Higher throughput measures are better. Note however that the metric
does not factor the equilibration period in the Monte Carlo or
consider the reasonable minimum and maximum number of walkers usable
for specific scientific calculation. Hence doubling the throughput
does not automatically halve the time to scientific solution, although
for many scenarios it will.
5. Benchmarking with ctest
This is the simplest way to calibrate performance though has some
limitations. The current choice is uses a fixed 1 MPI with 16 threads
on a single node on CPU systems. If you need to change either of these
numbers or you need to control more hardware behaviors such as thread
affinity, please read the next section.
To activate the ctest route, add the following option in your cmake
command line before building your binary:
-DQMC_DATA=YOUR_DATA_FOLDER -DENABLE_TIMERS=1
YOUR_DATA_FOLDER contains a folder called NiO with the h5 files in it.
Run tests with command "ctest -R performance-NiO" after building
QMCPACK. Add "-VV" to capture the QMCPACK output. Enabling the timers
is not essential, but activates fine grained timers and counters useful for
analysis such as the number of times specific kernels are called and
their speed.
6. Running the benchmarks manually
1) Copy the whole current folder (tests/performance/NiO) to a work
directory (WDIR) you would like to run the benchmark.
2) Copy or softlink all the h5 files to your WDIR.
3) prepare a job script for submitting a single calculation to the job queue system.
We provide two samples for CPU (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at ALCF Cetus and Cooley to give you basic ideas how to run qmcpack manually.
3) Prepare an example job script for submitting a single calculation
to a job queuing system. We provide two samples for CPU
(qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at
ALCF Cetus and Cooley to give you basic ideas how to run QMCPACK manually.
a) Customize the header based on your machine.
b) You always need to point the variable "exe" to the binary that you would like to benchmark.
c) "file_prefix" should not be changed and and the run script will update them by pointing to the right size.
d) Customize the mpirun based on the job dispatcher on your system and pick the MPI/THREADS as well as other controls you would like to add.
*If your system do not have a job queue, remove everything before $exe in that line.
c) "file_prefix" should not be changed and and the run script will
update them by pointing to the right size.
d) Customize the mpirun based on the job dispatcher on your system
and pick the MPI/THREADS as well as other controls you would like
to add.
4) Customize run scripts
4) Customize run scripts.
The run_cpu.sh and run_gpu.sh run scripts provide a basic scan with a single run for each size.
These scripts create individual folder for each benchmark run and submit it to the job queue.
ATTENTION: the GPU run has default 32 walkers per MPI. You may adjust it in the run_gpu.sh based on your hardware capability.
*If your system do not have a job queue, use "subjob=sh" in the run script.
The files submit_cpu_cetus.sh and submit_gpu_cooley.sh are example
job submission scripts that provide a basic scan with a single run
for each system size. We suggest making a customized version for
your benchmark machines.
These scripts create individual folders for each benchmark run
and submit it to the job queue.
ATTENTION: the GPU run has a default 32 walkers per MPI. You may
adjust it in the run_gpu.sh based on your hardware capability.
Generally, more walkers leads to higher performance.
*If your system does not have a job queue, use "subjob=sh" in the script.
5) Collect performance results.
A simple performance metric can be the time per block which reflects fast walkers are advancing.
It can be measured with qmca, an analysis tool shipped with QMCPACK.
5) Collect performance results
A simple performance metric can be the time per block which
reflects how fast walkers are advancing.
It can be measured with qmca, an analysis tool shipped with
QMCPACK.
In your WDIR, use
qmca -q bc -e 0 dmc*/*.scalar.dat to collect the timing for all the runs.
Or in each subfolder, you type
qmca -q bc -e 0 *.scalar.dat
The current benchmarks contains 3 run sections.
The current benchmarks contains 3 run sections:
I) VMC + no drift
II) VMC + drift
III) DMC with constant population
So three timing are given per run.
So three timings are given per run. Timing information is also
included in the standard output of QMCPACK and a *.info.xml produced
by each run. In the standard output, "QMC Execution time" is the
time per run section, e.g. all blocks of VMC with drift, while the
fine grained timing information is printed at the end.
Please ask in QMCPACK google group if you have any questions.
Please ask in QMCPACK's google group if you have any questions.
https://groups.google.com/forum/#!forum/qmcpack