diff --git a/tests/performance/NiO/README b/tests/performance/NiO/README index 4e25ded1b..fb22efb4f 100644 --- a/tests/performance/NiO/README +++ b/tests/performance/NiO/README @@ -1,24 +1,65 @@ -0. Benchmark Release Notes. +NiO QMC Performance Benchmarks -v.0 -Initial benchmark set. Supercell size up to 256 (1024 atoms), but only checked up to 64 (256 atoms) +1. Introduction -v.1 -Add J3 to CPU tests. +These benchmarks for VMC and DMC represent real research runs and are +large enough to be used for performance measurements. This is in +contrast to the conventional integration tests where the particle +counts are too small to be representative. Care is still needed to +remove initiallization, I/O, and compute a representative performance +measure. -v.2 -Add one more qmc section to each test, -From I) VMC + no drift, II) DMC with constant population -To I) VMC + no drift, II) VMC + drift, III) DMC with constant population +The ctest integration is sufficient to run the benchmarks and measure +relative performance from version to version of QMCPACK and assess +proposed code changes. To obtain highest performance on a particular +platform, you must run the benchmarks in a standalone manner and tune +thread counts, placement, walker count (etc.) + +2. Simulated system and QMC methods tested + +The simulated systems consist of a number of repeats of a NiO +primitive cell. + +Name Atoms Electrons Electrons per spin + S8 32 384 192 + S16 64 768 384 + S32 128 1536 768 + S64 256 3072 1536 +S128 512 6144 3072 +S256 1024 12288 6144 + +Runs consist of a number of short blocks of (i) VMC without drift (ii) +VMC with drift term included (iii) DMC with constant population. + +These different runs vary the ratio between value, gradient, and +laplacian evaluations of the wavefunction. The most important +performance is for DMC, which dominates supercomputer time usage. For +a large enough supercell, the runs scale cubically in cost with the +"electrons per spin". + +Two sets of wavefunction are tested: splined orbitals with a one and +two body jastrow functions, and a more complex form with an additional +three body jastrow function. The jastrows are the same for each run +and are not reoptimized, as might be done for research. + +On early 2017 era hardware and QMCPACK code, it is very likely that +only the first 3 supercells are easily runnable due to memory +limitations. + +3. Requirements + +Download the necessary NiO h5 orbital files of different sizes from +the following link -1. Before your run. -Download the necessary NiO h5 orbital files of different sizes from the following link https://anl.box.com/s/pveyyzrc2wuvg5tmxjzzwxeo561vh3r0 -This link will be updated when the storage host is changed. -You only need to download the sizes you would like to include in your benchmarking runs. -Please check the md5 value of h5 files before starting any benchmarking. -Only Linux distributions, md5sum tool is widely available +This link will be updated when a longer term storage host is +identified. You only need to download the sizes you would like to +include in your benchmarking runs. + +Please check the md5 value of h5 files before starting any +benchmarking. + $ md5sum *.h5 6476972b54b58c89d15c478ed4e10317 NiO-fcc-supertwist111-supershift000-S8.h5 b47f4be12f98f8a3d4b65d0ae048b837 NiO-fcc-supertwist111-supershift000-S16.h5 @@ -27,45 +68,110 @@ ee1f6c6699a24e30d7e6c122cde55ac1 NiO-fcc-supertwist111-supershift000-S32.h5 0a530594a3c7eec4f0155b5b2ca92eb0 NiO-fcc-supertwist111-supershift000-S128.h5 cff0101debb11c8c215e9138658fbd21 NiO-fcc-supertwist111-supershift000-S256.h5 -2. Benchmarking with ctest. -This is the simplest way to calibrate performance though with limitations. -The current choice is using 1 MPI with 16 threads on a single node. If you need to change either of these numbers -or you need to control more hardware behaviors such as thread affinity, please read the next section. -To activate the ctest route, add the following option in your cmake command line when building your binary. --D QMC_DATA=YOUR_DATA_FOLDER -YOUR_DATA_FOLDER contains a folder called NiO with h5 files in it. -Running tests with command "ctest -R performance-NiO" after your building is complete. +$ ls -l *.h5 + 275701688 NiO-fcc-supertwist111-supershift000-S8.h5 + 545483396 NiO-fcc-supertwist111-supershift000-S16.h5 +1093861616 NiO-fcc-supertwist111-supershift000-S32.h5 +2180300396 NiO-fcc-supertwist111-supershift000-S64.h5 +4375340300 NiO-fcc-supertwist111-supershift000-S128.h5 +8786322376 NiO-fcc-supertwist111-supershift000-S256.h5 -3. Running the benchmark manually. -1) Copy the whole current folder to a work directory (WDIR) you would like to run the benchmark. +The data files should be placed in a directory labeled NiO. + +4. Throughput metric + +A key result that can be extracted from the benchmarks is a throughput +metric, or the "time to move one walker", as measured on a per step +basis. One can also compute the "walkers moved per second per node", +factoring the hardware availability (threads, cores, GPUs). + +Higher throughput measures are better. Note however that the metric +does not factor the equilibration period in the Monte Carlo or +consider the reasonable minimum and maximum number of walkers usable +for specific scientific calculation. Hence doubling the throughput +does not automatically halve the time to scientific solution, although +for many scenarios it will. + +5. Benchmarking with ctest + +This is the simplest way to calibrate performance though has some +limitations. The current choice is uses a fixed 1 MPI with 16 threads +on a single node on CPU systems. If you need to change either of these +numbers or you need to control more hardware behaviors such as thread +affinity, please read the next section. + +To activate the ctest route, add the following option in your cmake +command line before building your binary: + +-DQMC_DATA=YOUR_DATA_FOLDER -DENABLE_TIMERS=1 + +YOUR_DATA_FOLDER contains a folder called NiO with the h5 files in it. +Run tests with command "ctest -R performance-NiO" after building +QMCPACK. Add "-VV" to capture the QMCPACK output. Enabling the timers +is not essential, but activates fine grained timers and counters useful for +analysis such as the number of times specific kernels are called and +their speed. + + +6. Running the benchmarks manually + +1) Copy the whole current folder (tests/performance/NiO) to a work + directory (WDIR) you would like to run the benchmark. 2) Copy or softlink all the h5 files to your WDIR. -3) prepare a job script for submitting a single calculation to the job queue system. - We provide two samples for CPU (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at ALCF Cetus and Cooley to give you basic ideas how to run qmcpack manually. +3) Prepare an example job script for submitting a single calculation + to a job queuing system. We provide two samples for CPU + (qmcpack-cpu-cetus.sub) and GPU (qmcpack-gpu-cooley.sub) runs at + ALCF Cetus and Cooley to give you basic ideas how to run QMCPACK manually. + a) Customize the header based on your machine. b) You always need to point the variable "exe" to the binary that you would like to benchmark. - c) "file_prefix" should not be changed and and the run script will update them by pointing to the right size. - d) Customize the mpirun based on the job dispatcher on your system and pick the MPI/THREADS as well as other controls you would like to add. - *If your system do not have a job queue, remove everything before $exe in that line. + c) "file_prefix" should not be changed and and the run script will + update them by pointing to the right size. + d) Customize the mpirun based on the job dispatcher on your system + and pick the MPI/THREADS as well as other controls you would like + to add. + +4) Customize run scripts -4) Customize run scripts. - The run_cpu.sh and run_gpu.sh run scripts provide a basic scan with a single run for each size. - These scripts create individual folder for each benchmark run and submit it to the job queue. - ATTENTION: the GPU run has default 32 walkers per MPI. You may adjust it in the run_gpu.sh based on your hardware capability. - *If your system do not have a job queue, use "subjob=sh" in the run script. + The files submit_cpu_cetus.sh and submit_gpu_cooley.sh are example + job submission scripts that provide a basic scan with a single run + for each system size. We suggest making a customized version for + your benchmark machines. + + These scripts create individual folders for each benchmark run + and submit it to the job queue. + + ATTENTION: the GPU run has a default 32 walkers per MPI. You may + adjust it in the run_gpu.sh based on your hardware capability. + Generally, more walkers leads to higher performance. + + *If your system does not have a job queue, use "subjob=sh" in the script. -5) Collect performance results. - A simple performance metric can be the time per block which reflects fast walkers are advancing. - It can be measured with qmca, an analysis tool shipped with QMCPACK. +5) Collect performance results + + A simple performance metric can be the time per block which + reflects how fast walkers are advancing. + + It can be measured with qmca, an analysis tool shipped with + QMCPACK. + In your WDIR, use qmca -q bc -e 0 dmc*/*.scalar.dat to collect the timing for all the runs. + Or in each subfolder, you type qmca -q bc -e 0 *.scalar.dat - The current benchmarks contains 3 run sections. + The current benchmarks contains 3 run sections: + I) VMC + no drift II) VMC + drift III) DMC with constant population - So three timing are given per run. + + So three timings are given per run. Timing information is also + included in the standard output of QMCPACK and a *.info.xml produced + by each run. In the standard output, "QMC Execution time" is the + time per run section, e.g. all blocks of VMC with drift, while the + fine grained timing information is printed at the end. -Please ask in QMCPACK google group if you have any questions. +Please ask in QMCPACK's google group if you have any questions. https://groups.google.com/forum/#!forum/qmcpack diff --git a/tests/performance/NiO/run_cpu.sh b/tests/performance/NiO/submit_cpu_cetus.sh similarity index 100% rename from tests/performance/NiO/run_cpu.sh rename to tests/performance/NiO/submit_cpu_cetus.sh diff --git a/tests/performance/NiO/run_gpu.sh b/tests/performance/NiO/submit_gpu_cooley.sh similarity index 100% rename from tests/performance/NiO/run_gpu.sh rename to tests/performance/NiO/submit_gpu_cooley.sh