qmcpack/manual/running.tex

122 lines
9.1 KiB
TeX

\chapter{Running QMCPACK}
\label{chap:running}
QMCPACK requires at least one xml input file, and is invoked via:
{\texttt{qmcpack [command line options] <XML input file(s)>}}
\section{Command line options}
\label{sec:commandline}
QMCPACK offers several command line options which affect how calculations
are performed. If the flag is absent, then the corresponding
option is disabled.
\begin{description}
\item[\texttt{-{}-dryrun}]{ Validate the input file without performing the simulation.
This is a good way to ensure that QMCPACK will do what you think it will. }
\item[\texttt{-{}-enable-timers=none|coarse|medium|fine}]{ Control the timer granularity
when the build option \texttt{ENABLE\_TIMER} is enabled. }
\item[\texttt{-{}-help}]{ Print version information as well as a list of optional
command-line arguments. }
\item[\texttt{-{}-noprint}]{ Do not print extra information on Jastrow or pseudopotential.
If this flag is not present, QMCPACK will create several \texttt{.dat} files
that contain information about pseudopotentials (one file per PP), and jastrow
factors (one per jastrow factor). These file may be useful for visual inspection
of the jastrow, for example. }
\item[\texttt{-{}-save\_wfs}]{ Write a \texttt{.h5} file containing the real-space B-spline
coefficients of the single particle wave functions. See the manual
\ref{sec:spo_spline} for more information.}
\item[\texttt{-{}-vacuum X}]{Removed, use `vacuum' input tag described in \ref{chap:simulationcell}. }
\item[\texttt{-{}-verbosity=low|high|debug}]{ Control the output verbosity. }
\item[\texttt{-{}-version}]{ Print version information and optional arguments.
Same as \texttt{-{}-help}. }
\end{description}
\section{Input files}
\label{sec:inputs}
The input is one or more XML file(s), documented in chapter~\ref{chap:input_overview}.
\section{Output files}
QMCPACK generates multiple files, documented in chapter~\ref{chap:output_overview}.
\section{Running in parallel}
\label{sec:parallelrunning}
%considerations for mpi, threads, gpu.
\subsection{MPI}
QMCPACK is fully parallelized with MPI. When performing an ensemble job, all
the MPI ranks are first equally divided into groups which perform individual
QMC calculations. Within one calculation, all the walkers are fully distributed
across all the MPI ranks in the group. Since MPI requires distributed memory,
there must be at least one MPI per node. To maximize the efficiency, more facts
should be taken into account. When using MPI+threads on compute nodes with more
than one NUMA domain (e.g., AMD Interlagos CPU on Titan or a node with multiple
CPU sockets), it is recommended to place as many MPI ranks as the number of
NUMA domains if the memory is sufficient. On clusters with more than just one
GPU per node (NVIDIA Tesla K80), it requires to use the same number of MPI
ranks as the number of GPUs per node in order to let each MPI rank take one GPU.
\subsection{Use of OpenMP threads}
\label{sec:openmprunning}
Modern processors integrate multiple identical cores even with hardware threads
on a single die to increase the total performance and maintain a reasonable
power draw. QMCPACK takes advantage of all that compute capability on a
processor by using threads via OpenMP programming model as well as threaded linear algebra libraries. By default, QMCPACK is always built with OpenMP enabled. When launching calculations, users should instruct QMCPACK to create the right number of threads per MPI rank by specifying environment variable OMP\_NUM\_THREADS. Even in the GPU accelerated version, using threads significantly reduces the time spent on the calculations performed by the CPU.
\subsubsection{Performance consideration}
\label{sec:cpu:performance}
As walkers are the basic units of workload in QMC algorithms, they are loosely coupled and distributed across all the threads. For this reason, the best strategy to run QMCPACK efficiently is to feed enough walkers to the available threads.
In a VMC calculation, the code automatically raises the actual number of walkers per MPI rank to the number of available threads if the user-specified number of walkers is smaller, see ``walkers/mpi=XXX'' in the VMC output.
In a DMC calculation, the target number of walkers should be chosen to be slightly smaller than a multiple of the total number of available threads across all the MPI ranks belongs to this calculation. Since the number of walkers varies from generation to generation, its dynamical value should be slightly smaller or equal to that multiple most of the time.
To achieve better performance, mixed precision version (experimental) has been introduced to the CPU code. The mixed precision CPU code is more aggressive than the GPU version in using single precision (SP) operations. Current implementation utilizes SP on most calculations, except for matrix inversions and reductions where double precision is required to retain high accuracy. All the constant spline data in wavefunction, pseudopotentials and Coulomb potentials are initialized in double precision and later stored in single precision. The mixed precision code is as accurate as the double precision code up to a certain system size. Cross checking and verification of accuracy are encouraged for systems with more than approximately 1500 electrons. The mixed precision code has been tested on solids with real/complex builds with VMC, VMC using drift and DMC runs with wavefunction including single slater determinant and one- and two-body Jastrow factors. Wavefunction optimization will be fixed in the following updates.
\subsubsection{Memory consideration}
When using threads, some memory objects shared by all the threads. Usually these memory are read-only when the walkers are evolving, for instance the ionic distance table and wavefunction coefficients.
If a wavefunction is represented by B-splines, the whole table is shared by all the threads. It usually takes a large chunk of memory when a large primitive cell was used in the simulation. Its actual size is reported as ``MEMORY increase XXX MB BsplineSetReader'' in the output file.
See details about how to reduce it in section~\ref{sec:spo_spline}.
The other memory objects which are distinct for each walker during random walk need to be associated with individual walkers and can not be shared. This part of memory grows linearly as the number of walkers per MPI rank. Those objects include wavefunction values (Slater determinants) at given electronic configurations and electron related distance tables (electron-electron distance table). Those matrices dominate the $N^2$ scaling of the memory usage per walker.
\subsection{Running on GPU machines}
\label{sec:gpurunning}
The GPU version on the NVIDIA CUDA platform is fully incorporated into
the main source code. Commonly used functionalities for
solid-state and molecular systems using B-spline single-particle
orbitals are supported. Use of Gaussian basis sets and three-body
Jastrow functions are not yet supported. A detailed description of the GPU
implementation can be found in Ref. \cite{EslerKimCeperleyShulenburger2012}.
The current GPU implementation assumes one MPI process per GPU. To use
nodes with multiple GPUs, use multiple MPI processes per node.
Vectorization is achieved over walkers, that is, all walkers are
propagated in parallel. In each GPU kernel, loops over electrons,
atomic cores or orbitals are further vectorized to exploit an
additional level of parallelism and to allow coalesced memory access.
%----------------------------------------------------------------------------%
\subsubsection{Performance consideration}
\label{sec:gpu:performance}
The relative speedup of the GPU implementation increases with both the number of electrons and the number of walkers running on a GPU. Typically, 128-256 walkers per GPU utilize sufficient number of threads to operate the GPU efficiently and to hide memory-access latency.
To achieve better performance, current implementation utilizes single precision operations on most GPU calculations, except for matrix inversions and Coulomb interaction where double precision is required to retain high accuracy. The mixed precision GPU code is as accurate as the double precision CPU code up to a certain system size. Cross checking and verification of accuracy are encouraged for systems with more than approximately 1500 electrons.
%------------------------------------------------------------------------------%
\subsubsection{Memory consideration}
In the GPU implementation, each walker has an anonymous buffer on the GPU's global memory to store temporary data associated with the wavefunctions. Therefore, the amount of memory available on a GPU limits the number of walkers and eventually the system size that it can process.
If the GPU memory is exhausted, reduce the number of walkers per GPU.
Coarsening the grids of the B-splines representation (by decreasing the value of meshfactor in the input file) can also lower the memory usage,
at the expense (risk) of obtaining inaccurate results. Proceed with caution if this option has to be considered.
It is also possible to distrubte the B-spline coefficients table between the host and GPU memory, see option Spline\_Size\_Limit\_MB in Sec.~\ref{sec:spo_spline}.