Expanded README with more run details and information for non-specialists

2017-02-21 12:29:21 -05:00 · 2017-02-21 12:29:21 -05:00 · 53de3c8731
parent 002d84f3ad
commit 53de3c8731
3 changed files with 147 additions and 41 deletions
--- a/tests/performance/NiO/README
+++ b/tests/performance/NiO/README
@ -1,24 +1,65 @@
-0. Benchmark Release Notes.
+NiO QMC Performance Benchmarks

-v.0
-Initial benchmark set. Supercell size up to 256 (1024 atoms), but only checked up to 64 (256 atoms)
+1. Introduction

-v.1
-Add J3 to CPU tests.
+These benchmarks for VMC and DMC represent real research runs and are
+large enough to be used for performance measurements. This is in
+contrast to the conventional integration tests where the particle
+counts are too small to be representative. Care is still needed to
+remove initiallization, I/O, and compute a representative performance
+measure.

-v.2
-Add one more qmc section to each test,
-From I) VMC + no drift, II) DMC with constant population
-To   I) VMC + no drift, II) VMC + drift, III) DMC with constant population
+The ctest integration is sufficient to run the benchmarks and measure
+relative performance from version to version of QMCPACK and assess
+proposed code changes. To obtain highest performance on a particular
+platform, you must run the benchmarks in a standalone manner and tune
+thread counts, placement, walker count (etc.)
+
+2. Simulated system and QMC methods tested
+
+The simulated systems consist of a number of repeats of a NiO
+primitive cell.
+
+Name  Atoms Electrons  Electrons per spin
+  S8    32      384            192
+ S16    64      768            384
+ S32   128     1536            768
+ S64   256     3072	      1536
+S128   512     6144	      3072
+S256  1024    12288	      6144
+
+Runs consist of a number of short blocks of (i) VMC without drift (ii)
+VMC with drift term included (iii) DMC with constant population.
+
+These different runs vary the ratio between value, gradient, and
+laplacian evaluations of the wavefunction. The most important
+performance is for DMC, which dominates supercomputer time usage. For
+a large enough supercell, the runs scale cubically in cost with the
+"electrons per spin".
+
+Two sets of wavefunction are tested: splined orbitals with a one and
+two body jastrow functions, and a more complex form with an additional
+three body jastrow function. The jastrows are the same for each run
+and are not reoptimized, as might be done for research.
+
+On early 2017 era hardware and QMCPACK code, it is very likely that
+only the first 3 supercells are easily runnable due to memory
+limitations.
+
+3. Requirements
+
+Download the necessary NiO h5 orbital files of different sizes from
+the following link

-1. Before your run.
-Download the necessary NiO h5 orbital files of different sizes from the following link
 https://anl.box.com/s/pveyyzrc2wuvg5tmxjzzwxeo561vh3r0
-This link will be updated when the storage host is changed.
-You only need to download the sizes you would like to include in your benchmarking runs.

-Please check the md5 value of h5 files before starting any benchmarking.
-Only Linux distributions, md5sum tool is widely available
+This link will be updated when a longer term storage host is
+identified. You only need to download the sizes you would like to
+include in your benchmarking runs.
+
+Please check the md5 value of h5 files before starting any
+benchmarking.
+
 $ md5sum *.h5
 6476972b54b58c89d15c478ed4e10317  NiO-fcc-supertwist111-supershift000-S8.h5
 b47f4be12f98f8a3d4b65d0ae048b837  NiO-fcc-supertwist111-supershift000-S16.h5
@ -27,45 +68,110 @@ ee1f6c6699a24e30d7e6c122cde55ac1  NiO-fcc-supertwist111-supershift000-S32.h5
 0a530594a3c7eec4f0155b5b2ca92eb0  NiO-fcc-supertwist111-supershift000-S128.h5
 cff0101debb11c8c215e9138658fbd21  NiO-fcc-supertwist111-supershift000-S256.h5

-2. Benchmarking with ctest.
-This is the simplest way to calibrate performance though with limitations.
-The current choice is using 1 MPI with 16 threads on a single node. If you need to change either of these numbers
-or you need to control more hardware behaviors such as thread affinity, please read the next section.
-To activate the ctest route, add the following option in your cmake command line when building your binary.
-D QMC_DATA=YOUR_DATA_FOLDER
-YOUR_DATA_FOLDER contains a folder called NiO with h5 files in it.
-Running tests with command "ctest -R performance-NiO" after your building is complete.
+$ ls -l *.h5
+ 275701688 NiO-fcc-supertwist111-supershift000-S8.h5 
+ 545483396 NiO-fcc-supertwist111-supershift000-S16.h5
+1093861616 NiO-fcc-supertwist111-supershift000-S32.h5  
+2180300396 NiO-fcc-supertwist111-supershift000-S64.h5
+4375340300 NiO-fcc-supertwist111-supershift000-S128.h5  
+8786322376 NiO-fcc-supertwist111-supershift000-S256.h5

-3. Running the benchmark manually.
-1) Copy the whole current folder to a work directory (WDIR) you would like to run the benchmark.
+The data files should be placed in a directory labeled NiO.
+
+4. Throughput metric
+
+A key result that can be extracted from the benchmarks is a throughput
+metric, or the "time to move one walker", as measured on a per step
+basis.  One can also compute the "walkers moved per second per node",
+factoring the hardware availability (threads, cores, GPUs).
+
+Higher throughput measures are better. Note however that the  metric
+does not factor the equilibration period in the Monte Carlo or
+consider the reasonable minimum and maximum number of walkers usable
+for specific scientific calculation. Hence doubling the throughput
+does not automatically halve the time to scientific solution, although
+for many scenarios it will.
+
+5. Benchmarking with ctest
+
+This is the simplest way to calibrate performance though has some
+limitations.  The current choice is uses a fixed 1 MPI with 16 threads
+on a single node on CPU systems. If you need to change either of these
+numbers or you need to control more hardware behaviors such as thread
+affinity, please read the next section.
+
+To activate the ctest route, add the following option in your cmake
+command line before building your binary:
+
+-DQMC_DATA=YOUR_DATA_FOLDER -DENABLE_TIMERS=1
+
+YOUR_DATA_FOLDER contains a folder called NiO with the h5 files in it.
+Run tests with command "ctest -R performance-NiO" after building
+QMCPACK. Add "-VV" to capture the QMCPACK output. Enabling the timers
+is not essential, but activates fine grained timers and counters useful for
+analysis such as the number of times specific kernels are called and
+their speed.
+
+
+6. Running the benchmarks manually
+
+1) Copy the whole current folder (tests/performance/NiO) to a work
+   directory (WDIR) you would like to run the benchmark. 
 2) Copy or softlink all the h5 files to your WDIR.
-3) prepare a job script for submitting a single calculation to the job queue system.
-   We provide two samples for CPU (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at ALCF Cetus and Cooley to give you basic ideas how to run qmcpack manually.
+3) Prepare an example job script for submitting a single calculation
+   to a job queuing system. We provide two samples for CPU
+   (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at
+   ALCF Cetus and Cooley to give you basic ideas how to run QMCPACK manually.
+   
   a) Customize the header based on your machine.
   b) You always need to point the variable "exe" to the binary that you would like to benchmark.
-   c) "file_prefix" should not be changed and and the run script will update them by pointing to the right size.
-   d) Customize the mpirun based on the job dispatcher on your system and pick the MPI/THREADS as well as other controls you would like to add.
-  *If your system do not have a job queue, remove everything before $exe in that line.
+   c) "file_prefix" should not be changed and and the run script will
+      update them by pointing to the right size. 
+   d) Customize the mpirun based on the job dispatcher on your system
+      and pick the MPI/THREADS as well as other controls you would like
+      to add.
      
-4) Customize run scripts.
-   The run_cpu.sh and run_gpu.sh run scripts provide a basic scan with a single run for each size.
-   These scripts create individual folder for each benchmark run and submit it to the job queue.
-   ATTENTION: the GPU run has default 32 walkers per MPI. You may adjust it in the run_gpu.sh based on your hardware capability.
-  *If your system do not have a job queue, use "subjob=sh" in the run script.
+4) Customize run scripts
+
+   The files submit_cpu_cetus.sh and submit_gpu_cooley.sh are example
+   job submission scripts that provide a basic scan with a single run
+   for each system size. We suggest making a customized version for
+   your benchmark machines.
+   
+   These scripts create individual folders for each benchmark run
+   and submit it to the job queue.
+   
+   ATTENTION: the GPU run has a default 32 walkers per MPI. You may
+   adjust it in the run_gpu.sh based on your hardware capability.
+   Generally, more walkers leads to higher performance.
+   
+   *If your system does not have a job queue, use "subjob=sh" in the script.
+
+5) Collect performance results
+
+   A simple performance metric can be the time per block which
+   reflects how fast walkers are advancing.
+   
+   It can be measured with qmca, an analysis tool shipped with
+   QMCPACK.
   
-5) Collect performance results.
-   A simple performance metric can be the time per block which reflects fast walkers are advancing.
-   It can be measured with qmca, an analysis tool shipped with QMCPACK.
   In your WDIR, use
   qmca -q bc -e 0 dmc*/*.scalar.dat to collect the timing for all the runs.
+
   Or in each subfolder, you type
   qmca -q bc -e 0 *.scalar.dat

-   The current benchmarks contains 3 run sections.
+   The current benchmarks contains 3 run sections:
+   
     I) VMC + no drift
    II) VMC + drift
   III) DMC with constant population
-   So three timing are given per run.
   
-Please ask in QMCPACK google group if you have any questions.
+   So three timings are given per run.  Timing information is also
+   included in the standard output of QMCPACK and a *.info.xml produced
+   by each run. In the standard output, "QMC Execution time" is the
+   time per run section, e.g. all blocks of VMC with drift, while the
+   fine grained timing information is printed at the end.
+
+Please ask in QMCPACK's google group if you have any questions.
 https://groups.google.com/forum/#!forum/qmcpack
--- a/tests/performance/NiO/submit_cpu_cetus.sh
+++ b/tests/performance/NiO/submit_cpu_cetus.sh
--- a/tests/performance/NiO/submit_gpu_cooley.sh
+++ b/tests/performance/NiO/submit_gpu_cooley.sh