[skip-CI] CP user guide updated

This commit is contained in:
giannozz 2021-09-02 18:44:12 +00:00
parent c8919b9124
commit 6b9d6677b5
4 changed files with 151 additions and 262 deletions

View File

@ -3,42 +3,36 @@
Introduction Introduction
============ ============
This guide covers the usage of the `CP` package, version 6.6, a core This guide covers the usage of the `CP` package, version 6.8, a core
component of the Quantum ESPRESSO distribution. Further documentation, component of the Quantum ESPRESSO distribution. Further documentation,
beyond what is provided in this guide, can be found in the directory beyond what is provided in this guide, can be found in the directory
`CPV/Doc/`, containing a copy of this guide. `CPV/Doc/`, containing a copy of this guide.
*Important notice: due to the lack of time and of manpower, this manual This guide assumes that you know the physics that `CP` describes and the
is only partially updated and may contain outdated information.*
This guide assumes that you know the physics that `CP` describes and the
methods it implements. It also assumes that you have already installed, methods it implements. It also assumes that you have already installed,
or know how to install, Quantum ESPRESSO. If not, please read the or know how to install, Quantum ESPRESSO. If not, please read the
general User's Guide for Quantum ESPRESSO, found in directory `Doc/` two general User's Guide for Quantum ESPRESSO, found in directory `Doc/` two
levels above the one containing this guide; or consult the web site:\ levels above the one containing this guide; or consult the web site:
`http://www.quantum-espresso.org`. `http://www.quantum-espresso.org`.
People who want to modify or contribute to `CP` should read the People who want to modify or contribute to `CP` should read the
Developer Manual:\ Developer Manual: `https://gitlab.com/QEF/q-e/-/wikis/home`.
`Doc/developer_man.pdf`.
`CP` can perform Car-Parrinello molecular dynamics, including `CP` can perform Car-Parrinello molecular dynamics, including
variable-cell dynamics, and free-energy surface calculation at fixed variable-cell dynamics. The `CP` package is based on the original code
cell through meta-dynamics, if patched with PLUMED. written by Roberto Car
and Michele Parrinello. `CP` was developed by Alfredo Pasquarello (EPF
The `CP` package is based on the original code written by Roberto Car
and Michele Parrinello. `CP` was developed by Alfredo Pasquarello (EPF
Lausanne), Kari Laasonen (Oulu), Andrea Trave, Roberto Car (Princeton), Lausanne), Kari Laasonen (Oulu), Andrea Trave, Roberto Car (Princeton),
Nicola Marzari (EPF Lausanne), Paolo Giannozzi, and others. FPMD, later Nicola Marzari (EPF Lausanne), Paolo Giannozzi, and others. FPMD, later
merged with `CP`, was developed by Carlo Cavazzoni, Gerardo Ballabio merged with `CP`, was developed by Carlo Cavazzoni (Leonardo), Gerardo
(CINECA), Sandro Scandolo (ICTP), Guido Chiarotti, Paolo Focher, and Ballabio (CINECA), Sandro Scandolo (ICTP), Guido Chiarotti, Paolo Focher,
others. We quote in particular: and others. We quote in particular:
- Federico Grasselli and Riccardo Bertossa (SISSA) for bug fixes, - Federico Grasselli and Riccardo Bertossa (SISSA) for bug fixes,
extensions to Autopilot; extensions to Autopilot;
- Biswajit Santra, Hsin-Yu Ko, Marcus Calegari Andrade (Princeton) for - Biswajit Santra, Hsin-Yu Ko, Marcus Calegari Andrade (Princeton) for
SCAN functional; various contribution, notably the SCAN functional;
- Robert DiStasio (Cornell)), Biswajit Santra, and Hsin-Yu Ko for - Robert DiStasio (Cornell)), Biswajit Santra, and Hsin-Yu Ko for
hybrid functionals with MLWF; (maximally localized Wannier hybrid functionals with MLWF; (maximally localized Wannier
@ -50,21 +44,21 @@ others. We quote in particular:
- Paolo Umari (Univ. Padua) for finite electric fields and conjugate - Paolo Umari (Univ. Padua) for finite electric fields and conjugate
gradients; gradients;
- Paolo Umari and Ismaila Dabo for ensemble-DFT; - Paolo Umari and Ismaila Dabo (Penn State) for ensemble-DFT;
- Xiaofei Wang (Princeton) for META-GGA; - Xiaofei Wang (Princeton) for META-GGA;
- The Autopilot feature was implemented by Targacept, Inc. - The Autopilot feature was implemented by Targacept, Inc.
This guide has been mostly writen by Gerardo Ballabio and Carlo The original version of this guide was mostly written by Gerardo Ballabio
Cavazzoni. and Carlo Cavazzoni.
`CP` is free software, released under the GNU General Public License.\ `CP` is free software, released under the GNU General Public License.\
See `http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt`, or the file See `http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt`, or the file
License in the distribution). `License` in the distribution.
We shall greatly appreciate if scientific work done using the Quantum We shall greatly appreciate if scientific work done using the Quantum
ESPRESSO distribution will contain an acknowledgment to the following ESPRESSO distribution will contain an acknowledgment to the following
references: references:
> P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, > P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni,
@ -95,22 +89,22 @@ Users of the GPU-enabled version should also cite the following paper:
> Ferretti, N. Marzari, I. Timrov, A. Urru, S. Baroni, J. Chem. Phys. > Ferretti, N. Marzari, I. Timrov, A. Urru, S. Baroni, J. Chem. Phys.
> 152, 154105 (2020) > 152, 154105 (2020)
Note the form Quantum ESPRESSO for textual citations of the code. Please Note the form `Quantum ESPRESSO` (in small caps) for textual citations
also see package-specific documentation for further recommended of the code. Please also see other package-specific documentation for
citations. Pseudopotentials should be cited as (for instance) further recommended citations. Pseudopotentials should be cited as
(for instance)
> \[ \] We used the pseudopotentials C.pbe-rrjkus.UPF and O.pbe-vbc.UPF > \[ \] We used the pseudopotentials C.pbe-rrjkus.UPF and O.pbe-vbc.UPF
> from\ > from `http://www.quantum-espresso.org`.
> `http://www.quantum-espresso.org`.
Compilation Compilation
=========== ===========
`CP` is included in the core Quantum ESPRESSO distribution. Instruction `CP` is included in the core Quantum ESPRESSO distribution. Instruction
on how to install it can be found in the general documentation (User's on how to install it can be found in the general documentation (User's
Guide) for Quantum ESPRESSO. Guide) for Quantum ESPRESSO.
Typing `make cp` from the main Quantum ESPRESSO directory or `make` from Typing `make cp` from the main Quantum ESPRESSO directory or `make` from
the `CPV/` subdirectory produces the following codes in `CPV/src`: the `CPV/` subdirectory produces the following codes in `CPV/src`:
- `cp.x`: Car-Parrinello Molecular Dynamics code - `cp.x`: Car-Parrinello Molecular Dynamics code
@ -125,52 +119,44 @@ Symlinks to executable programs will be placed in the `bin/`
subdirectory. subdirectory.
As a final check that compilation was successful, you may want to run As a final check that compilation was successful, you may want to run
some or all of the tests and examples. Automated tests for `cp.x` are in some or all of the tests and examples. Automated tests for `cp.x` are in
directory `test-suite/` and can be run via the `Makefile` found there. directory `test-suite/` and can be run via the `Makefile` found there.
Please see the general User's Guide for their setup. Please see the general User's Guide for their setup.
You may take the tests and examples distributed with `CP` as templates You may take the tests and examples distributed with `CP` as templates
for writing your own input files. Input files for tests are contained in for writing your own input files. Input files for tests are contained in
subdirectories `test-suite/cp_` with file type `*.in1`, `*.in2`, \... . subdirectories `test-suite/cp_*` with file type `*.in1`, `*.in2`, \... .
Input files for examples are produced, if you run the examples, in the Input files for examples are produced, if you run the examples, in the
`results/` subdirectories, with names ending with `.in`. `results/` subdirectories, with names ending with `.in`.
For general information on parallelism and how to run in parallel For general information on parallelism and how to run in parallel
execution, please see the general User's Guide. `CP` currently can take execution, please see the general User's Guide. `CP` currently can take
advantage of both MPI and OpenMP parallelization. The "plane-wave", advantage of both MPI and OpenMP parallelization and on GPU acceleration.
"linear-algebra" and "task-group" parallelization levels are The "plane-wave", "linear-algebra" and "task-group" parallelization levels
implemented. are implemented.
Input data Input data
========== ==========
Input data for `cp.x` is organized into several namelists, followed by Input data for `cp.x` is organized into several namelists, followed by
other fields ("cards") introduced by keywords. The namelists are other fields ("cards") introduced by keywords. The namelists are:
------------------- ---------------------------------------------------------- > &CONTROL: general variables controlling the run\
&CONTROL: general variables controlling the run > &SYSTEM: structural information on the system under investigation\
&SYSTEM: structural information on the system under investigation > &ELECTRONS: electronic variables, electron dynamics\
&ELECTRONS: electronic variables, electron dynamics > &IONS : ionic variables, ionic dynamics\
&IONS : ionic variables, ionic dynamics > &CELL (optional): variable-cell dynamics\
&CELL (optional): variable-cell dynamics
------------------- ----------------------------------------------------------
\
The `&CELL` namelist may be omitted for fixed-cell calculations. This The `&CELL` namelist may be omitted for fixed-cell calculations. This
depends on the value of variable `calculation` in namelist &CONTROL. depends on the value of variable `calculation` in namelist &CONTROL.
Most variables in namelists have default values. Only the following Most variables in namelists have default values. Only he following
variables in &SYSTEM must always be specified: variables in &SYSTEM must always be specified:
----------- --------------------- ----------------------------------------------- > `ibrav` (integer) Bravais-lattice index\
`ibrav` (integer) Bravais-lattice index > `celldm` (real, dimension 6) crystallographic constants\
`celldm` (real, dimension 6) crystallographic constants > `nat` (integer) number of atoms in the unit cell\
`nat` (integer) number of atoms in the unit cell > `ntyp` (integer) number of types of atoms in the unit cell\
`ntyp` (integer) number of types of atoms in the unit cell > `ecutwfc` (real) kinetic energy cutoff (Ry) for wavefunctions
`ecutwfc` (real) kinetic energy cutoff (Ry) for wavefunctions.
----------- --------------------- -----------------------------------------------
\
).
Explanations for the meaning of variables `ibrav` and `celldm`, as well Explanations for the meaning of variables `ibrav` and `celldm`, as well
as on alternative ways to input structural data, are contained in files as on alternative ways to input structural data, are contained in files
@ -178,34 +164,31 @@ as on alternative ways to input structural data, are contained in files
describe a large number of other variables as well. Almost all variables describe a large number of other variables as well. Almost all variables
have default values, which may or may not fit your needs. have default values, which may or may not fit your needs.
Comment lines in namelists can be introduced by a \"!\", exactly as in
fortran code.
After the namelists, you have several fields ("cards") introduced by After the namelists, you have several fields ("cards") introduced by
keywords with self-explanatory names: keywords with self-explanatory names:
> ATOMIC\_SPECIES\ > ATOMIC\_SPECIES\
> ATOMIC\_POSITIONS\ > ATOMIC\_POSITIONS\
> CELL\_PARAMETERS (optional)\ > CELL\_PARAMETERS (optional)\
> OCCUPATIONS (optional)\ > OCCUPATIONS (optional)
The keywords may be followed on the same line by an option. Unknown The keywords may be followed on the same line by an option. Unknown
fields are ignored. See the files mentioned above for details on the fields are ignored. See the files mentioned above for details on the
available "cards". available "cards".
Comments lines in "cards" can be introduced by either a "!" or a "\#" Comment lines in namelists can be introduced by a \"!\", exactly as in
character in the first position of a line. fortran code. Comments lines in "cards" can be introduced by either a "!"
or a "\#" character in the first position of a line.
Data files Data files
---------- ----------
The output data files are written in the directory specified by variable The output data files are written in the directory specified by variable
`outdir`, with names specified by variable `prefix` (a string that is `outdir`, with names specified by variable `prefix` (a string that is
prepended to all file names, whose default value is: `prefix=pwscf`). prepended to all file names, whose default value is `prefix=cp_$ndw`,
The `iotk` toolkit is used to write the file in a XML format, whose where `ndw` is an integer specified in input).
definition can be found in the Developer Manual. In order to use the In order to use the data on a different machine, you may need to
data directory on a different machine, you need to convert the binary compile `CP` with HDF5 enabled.
files to formatted and back, using the `bin/iotk` script.
The execution stops if you create a file `prefix.EXIT` either in the The execution stops if you create a file `prefix.EXIT` either in the
working directory (i.e. where the program is executed), or in the working directory (i.e. where the program is executed), or in the
@ -215,58 +198,13 @@ this procedure is that all files are properly closed, whereas just
killing the process may leave data and output files in an unusable killing the process may leave data and output files in an unusable
state. state.
Format of arrays containing charge density, potential, etc. The format of arrays containing charge density, potential, etc.
----------------------------------------------------------- is described in the developer manual.
The index of arrays used to store functions defined on 3D meshes is
actually a shorthand for three indices, following the FORTRAN convention
(\"leftmost index runs faster\"). An example will explain this better.
Suppose you have a 3D array `psi(nr1x,nr2x,nr3x)`. FORTRAN compilers
store this array sequentially in the computer RAM in the following way:
psi( 1, 1, 1)
psi( 2, 1, 1)
...
psi(nr1x, 1, 1)
psi( 1, 2, 1)
psi( 2, 2, 1)
...
psi(nr1x, 2, 1)
...
...
psi(nr1x,nr2x, 1)
...
psi(nr1x,nr2x,nr3x)
etc
Let `ind` be the position of the `(i,j,k)` element in the above list:
the following relation
ind = i + (j - 1) * nr1x + (k - 1) * nr2x * nr1x
holds. This should clarify the relation between 1D and 3D indexing. In
real space, the `(i,j,k)` point of the FFT grid with dimensions `nr1`
( $`\le`$ `nr1x`), `nr2` ( $`\le`$ `nr2x`), , `nr3` ( $`\le`$ `nr3x`), is
```math
r_{ijk}=\frac{i-1}{nr1} \tau_1 + \frac{j-1}{nr2} \tau_2 + \frac{k-1}{nr3} \tau_3
```
where the $`\tau_i`$ are the basis vectors of the
Bravais lattice. The latter are stored row-wise in the `at` array:
$`\tau_1 =`$ `at(:, 1)`, $`\tau_2 =`$ `at(:, 2)`, $`\tau_3 =`$ `at(:, 3)`.
The distinction between the dimensions of the FFT grid, `(nr1,nr2,nr3)`
and the physical dimensions of the array, `(nr1x,nr2x,nr3x)` is done
only because it is computationally convenient in some cases that the two
sets are not the same. In particular, it is often convenient to have
`nrx1`=`nr1`+1 to reduce memory conflicts.
Output files Output files
========== ==========
The `cp.x` code produces many output file, that together build up the trajectory. The `cp.x` code produces many output files, that together build up the trajectory.
You have a file for the positions, called `prefix.pos`, where `prefix` is defined in You have a file for the positions, called `prefix.pos`, where `prefix` is defined in
the input file, that is formatted like: the input file, that is formatted like:
@ -280,35 +218,40 @@ the input file, that is formatted like:
0.42395189282719E+01 0.55766875434652E+01 0.31291744042209E+01 0.42395189282719E+01 0.55766875434652E+01 0.31291744042209E+01
0.45445534106843E+01 0.36049553522533E+01 0.55864387532281E+01 0.45445534106843E+01 0.36049553522533E+01 0.55864387532281E+01
where in the first line there is an header with, in order, the number of the step and where the first line contains the step number and elapsed time, in ps, at this
the time in ps of this step. Later you found the positions of all the atoms, in the step; the following lines contain the positions, in Bohr radii, of all the
same order of the input file (note that this behaviour emerged in v6.6 -- previously atoms (3 in this examples), in the same order as in the input file (since v6.6
atoms were sorted by type). In this example we have 3 atoms. -- previously, atoms were sorted by type; the type must be deduced from the
The type must be deduced from the input file. After the first 4 lines input file). The same structure is repeated for the second step and so on.
you find the same structure for the second step. The units of the position are Bohr's The printout is made every `iprint` steps (10 in this case, so at step 10, 20,
radius. Note that the atoms coordinates are unwrapped, so it is possible that they go etc.). Note that the atomic coordinates are not wrapped into the simulation
outside the simulation cell. cell, so it is possible that they lie outside it.
The velocities are written in a similar file named `prefix.vel`, where `prefix` is defined in The velocities are written in a similar file named `prefix.vel`, where `prefix`
the input file, that is formatted like the `.pos` file. The units are the usual Hartree is defined in the input file, that is formatted like the `.pos` file. The units
atomic units (note again that the velocity in the pw code differs by a factor of 2). are the usual Hartree atomic units (note that the velocities in the `pw.x` code
are in _Rydberg_ a.u. and differ by a factor 2).
The `prefix.for` file is formatted like the previous two. Contains the computed forces The `prefix.for` file, formatted like the previous two, contains the computed
and has Hartree atomic units too. forces, in Hartree atomic units as well. It is written only if a molecular
It is written only if `tprnfor = .true.` is set in the input file. dynamics calculation is performed, or if `tprnfor = .true.` is set in input.
The file `prefix.evp` has one line per printed step and contains some thermodynamic data. The file `prefix.evp` has one line per printed step and contains some
thermodynamical data.
The first line of the file names the columns: The first line of the file names the columns:
``` ```
# nfi time(ps) ekinc T\_cell(K) Tion(K) etot enthal econs econt Volume Pressure(GPa # nfi time(ps) ekinc Tcell(K) Tion(K) etot enthal econs econt Volume Pressure(GPa)
``` ```
where: where:
- `ekinc` $`K_{ELECTRONS}`$, the electron's fake kinetic energy - `ekinc` is the electrons fictitious kinetic energy, $`K_{ELECTRONS}`$
- `enthal` $`E_{DFT}+PV`$ - `enthal` is the enthalpy, $`E_{DFT}+PV`$
- `etot` $`E_{DFT}`$ potential energy of the system, the DFT energy - `etot` is the DFT (potential) energy of the system, $`E_{DFT}`$
- `econs` $`E_{DFT} + K_{NUCLEI}`$ this is something that is a constant of motion in the limit where the electronic fictitious mass is zero. It has a physical meaning. - `econs` is a physically meaningful constant of motion, $`E_{DFT} + K_{NUCLEI}`$,
- `econt` $`E_{DFT} + K_{IONS} + K_{ELECTRONS}`$ this is a constant of motion of the lagrangian. If the dt is small enough this will be up to a very good precision a constant. It is not a physical quantity, since $`K_{ELECTRONS}`$ has _nothing_ to do with the quantum kinetic energy of the electrons. in the limit of zero electronic fictitious mass
- `econt` is the constant of motion of the lagrangian$`E_{DFT} + K_{IONS} + K_{ELECTRONS}`$ t.
If the time step `dt` is small enough this will be up to a very good precision a constant.
It is not a physical quantity, since $`K_{ELECTRONS}`$ has _nothing_ to do with the quantum
kinetic energy of the electrons.
Using `CP` Using `CP`
@ -317,7 +260,7 @@ Using `CP`
It is important to understand that a CP simulation is a sequence of It is important to understand that a CP simulation is a sequence of
different runs, some of them used to \"prepare\" the initial state of different runs, some of them used to \"prepare\" the initial state of
the system, and other performed to collect statistics, or to modify the the system, and other performed to collect statistics, or to modify the
state of the system itself, i.e. modify the temperature or the pressure. state of the system itself, i.e. to modify the temperature or the pressure.
To prepare and run a CP simulation you should first of all define the To prepare and run a CP simulation you should first of all define the
system: system:
@ -393,8 +336,7 @@ An example of input file (Benzene Molecule):
H -2.2 2.2 0.0 H -2.2 2.2 0.0
H 2.2 2.2 0.0 H 2.2 2.2 0.0
You can find the description of the input variables in file You can find the description of the input variables in file `Doc/INPUT_CP.*`.
`Doc/INPUT_CP.*`.
Reaching the electronic ground state Reaching the electronic ground state
------------------------------------ ------------------------------------
@ -403,7 +345,7 @@ The first run, when starting from scratch, is always an electronic
minimization, with fixed ions and cell, to bring the electronic system minimization, with fixed ions and cell, to bring the electronic system
on the ground state (GS) relative to the starting atomic configuration. on the ground state (GS) relative to the starting atomic configuration.
This step is conceptually very similar to self-consistency in a This step is conceptually very similar to self-consistency in a
`pw.x` run. `pw.x` run.
Sometimes a single run is not enough to reach the GS. In this case, you Sometimes a single run is not enough to reach the GS. In this case, you
need to re-run the electronic minimization stage. Use the input of the need to re-run the electronic minimization stage. Use the input of the
@ -428,14 +370,12 @@ $`< 10^{-5}`$. You could check the value of the fictitious kinetic energy
on the standard output (column EKINC). on the standard output (column EKINC).
Different strategies are available to minimize electrons, but the most Different strategies are available to minimize electrons, but the most
used ones are: frequently used is _damped dynamics_: `electron_dynamics = damp` and
`electron_damping` = a number typically ranging from 0.1 and 0.5.
- steepest descent: `electron_dynamics = sd`
- damped dynamics: `electron_dynamics = damp`, `electron_damping` =
a number typically ranging from 0.1 and 0.5
See the input description to compute the optimal damping factor. See the input description to compute the optimal damping factor.
Steepest descent: `electron_dynamics = sd`, is also available but it
is typicallyslower than damped dynamics and should be used only to
start the minimization.
Relax the system Relax the system
---------------- ----------------
@ -860,14 +800,6 @@ ranges between 4 and 7.
All the other parameters have the same meaning in the usual `CP` input, All the other parameters have the same meaning in the usual `CP` input,
and they are discussed above. and they are discussed above.
### Free-energy surface calculations
Once `CP` is patched with `PLUMED` plug-in, it becomes possible to
turn-on most of the PLUMED functionalities running `CP` as:
`./cp.x -plumed` plus the other usual `CP` arguments. The PLUMED input
file has to be located in the specified `outdir` with the fixed name
`plumed.dat`.
### Treatment of USPPs ### Treatment of USPPs
The cutoff `ecutrho` defines the resolution on the real space FFT mesh The cutoff `ecutrho` defines the resolution on the real space FFT mesh
@ -1030,99 +962,62 @@ An example input is listed as following:
O 16.0D0 O_HSCV_PBE-1.0.UPF O 16.0D0 O_HSCV_PBE-1.0.UPF
H 2.0D0 H_HSCV_PBE-1.0.UPF H 2.0D0 H_HSCV_PBE-1.0.UPF
Performances Parallel Performances
============ =====================
`cp.x` can run in principle on any number of processors. The `cp.x` can run in principle on any number of processors. The
effectiveness of parallelization is ultimately judged by the "scaling", effectiveness of parallelization is ultimately judged by the "scaling",
i.e. how the time needed to perform a job scales with the number of i.e. how the time needed to perform a job scales with the number of
processors, and depends upon: processors. Ideally one would like to have linear scaling, i.e.
$`T \sim T_0/N_p`$ for $`N_p`$ processors, where $`T_0`$ is the estimated
time for serial execution. In addition, one would like to have linear
scaling of the RAM per processor: $`O_N \sim O_0/N_p`$, so that large-memory
systems fit into the RAM of each processor.
- the size and type of the system under study; We refer to the "Parallelization" section of the general User's Guide for
a description of MPI and OpenMP parallelization paradigms, of the various
MPI parallelization levels, and on how to activate them.
- the judicious choice of the various levels of parallelization A judicious choice of the various levels of parallelization, together
(detailed in with the availability of suitable hardware (e.g. fast communications)
Sec.[\[SubSec:para\]](#SubSec:para){reference-type="ref" is fundamental to reach good performances._VERY IMPORTANT_: For each
reference="SubSec:para"}); system there is an optimal range of number of processors on which to
run the job. A too large number of processors or a bad parallelization
style will yield performance degradation.
- the availability of fast interprocess communications (or lack of For `CP` with hybrid functionals, see the related section above this one.
it). For all other cases, the relevant MPI parallelization levels are:
Ideally one would like to have linear scaling, i.e. $`T \sim T_0/N_p`$ for - "plane waves" (PW);
$`N_p`$ processors, where $`T_0`$ is the estimated time for serial - "tasks" (activated by command-line option `-nt N`);
execution. In addition, one would like to have linear scaling of the RAM - "linear algebra" (`-nd N`);
per processor: $`O_N \sim O_0/N_p`$, so that large-memory systems fit into - "bands" parallelization (`-nb N`), to be used only in
the RAM of each processor. special cases;
- "images" parallelization (`-ni N`), used only in code `manycp.x`
(see the header of `CPV/src/manycp.f90` for documentation).
As a general rule, image parallelization: As a rule of thumb:
- start with PW parallelization only (e.g. `mpirun -np N cp.x ...` with
no other parallelization options); the code will scale well unless `N`
exceeds the third FFT dimensions `nr3` and/or `nr3s`.
- To further increase the number of processors, use "task groups",
typically 4 to 8 (e.g. `mpirun -np N cp.x -nt 8 ...`).
- Alternatively, or in addition, you may compile with OpenMP:
`./configure --enable-openmp ...`, then `export OMP_NUM_THREADS=n`
and run on `n` threads (4 to 8 typically).
_Beware conflicts between MPI and OpenMP threads_!
don't do this unless you know what you are doing.
- Finally, the optimal number of processors for \"linear-algebra\"
parallelization can be found by observing the performances of `ortho`
in the final time report for different numbers of processors in the
linear-algebra group (must be a square integer, not larger than the
number of processoris for plane-wave parallelization). Linear-algebra
parallelization distributes `M\times M`$ matrices, with `M` number of
bands, so it may be useful if memory-constrained.
- may give good scaling, but the slowest image will determine the Note: optimal serial performances are achieved when the data are as much
overall performances ("load balancing" may be a problem); as possible kept into the cache. As a side effect, PW parallelization may
yield superlinear (better than linear) scaling, thanks to the increase in
serial speed coming from the reduction of data size (making it easier for
the machine to keep data in the cache).
- requires very little communications (suitable for ethernet
communications);
- does not reduce the required memory per processor (unsuitable for
large-memory jobs).
Parallelization on k-points:
- guarantees (almost) linear scaling if the number of k-points is a
multiple of the number of pools;
- requires little communications (suitable for ethernet
communications);
- does not reduce the required memory per processor (unsuitable for
large-memory jobs).
Parallelization on PWs:
- yields good to very good scaling, especially if the number of
processors in a pool is a divisor of $`N_3`$ and $`N_{r3}`$ (the
dimensions along the z-axis of the FFT grids, `nr3` and `nr3s`,
which coincide for NCPPs);
- requires heavy communications (suitable for Gigabit ethernet up to
4, 8 CPUs at most, specialized communication hardware needed for 8
or more processors );
- yields almost linear reduction of memory per processor with the
number of processors in the pool.
A note on scaling: optimal serial performances are achieved when the
data are as much as possible kept into the cache. As a side effect, PW
parallelization may yield superlinear (better than linear) scaling,
thanks to the increase in serial speed coming from the reduction of data
size (making it easier for the machine to keep data in the cache).
VERY IMPORTANT: For each system there is an optimal range of number of
processors on which to run the job. A too large number of processors
will yield performance degradation. If the size of pools is especially
delicate: $`N_p`$ should not exceed $`N_3`$ and $`N_{r3}`$, and should ideally
be no larger than $`1/2\div1/4 N_3`$ and/or $`N_{r3}`$. In order to increase
scalability, it is often convenient to further subdivide a pool of
processors into "task groups". When the number of processors exceeds the
number of FFT planes, data can be redistributed to \"task groups\" so
that each group can process several wavefunctions at the same time.
The optimal number of processors for \"linear-algebra\" parallelization,
taking care of multiplication and diagonalization of $`M\times M`$
matrices, should be determined by observing the performances of
`cdiagh/rdiagh` (`pw.x`) or `ortho` (`cp.x`) for different numbers of
processors in the linear-algebra group (must be a square integer).
Actual parallel performances will also depend on the available software
(MPI libraries) and on the available communication hardware. For PC
clusters, OpenMPI (`http://www.openmpi.org/`) seems to yield better
performances than other implementations (info by Kostantin Kudin). Note
however that you need a decent communication hardware (at least Gigabit
ethernet) in order to have acceptable performances with PW
parallelization. Do not expect good scaling with cheap hardware: PW
calculations are by no means an \"embarrassing parallel\" problem.
Also note that multiprocessor motherboards for Intel Pentium CPUs
typically have just one memory bus for all processors. This dramatically
slows down any code doing massive access to memory (as most codes in the
Quantum ESPRESSO distribution do) that runs on processors of the same
motherboard.

View File

@ -1,5 +1,6 @@
New in development version: New in development version:
* RMM-DIIS for CPU (S. Nisihara) and GPU (E. de Paoli, P. Delugas) * RMM-DIIS for CPU (S. Nisihara) and GPU (E. de Paoli, P. Delugas)
* DFT-D3: MPI parallelization and GPU acceleration with OPenACC
Fixed in development version: Fixed in development version:
* Some build problems occurring under special circumstances * Some build problems occurring under special circumstances

View File

@ -11,10 +11,10 @@ Installation
This version requires the nvfortran (previously PGI) compiler from the This version requires the nvfortran (previously PGI) compiler from the
freely available NVidia HPC SDK. You are adviced to use a recent version freely available NVidia HPC SDK. You are adviced to use a recent version
of NVidia software. Any version later than 17.4 should work, but many glitches of NVidia software. Any version later than 17.4 should work, but many glitches
are know to exist in older versions. are known to exist in older versions.
The configure script checks for the presence of the nvfortran compiler and of The `configure` script checks for the presence of the nvfortran compiler and
a few cuda libraries.For this reason the path pointing to cudatoolkit must be of a few cuda libraries. For this reason the path pointing to the cuda toolkit
present in `LD_LIBRARY_PATH`. must be present in `LD_LIBRARY_PATH`.
A template for the configure command is: A template for the configure command is:
@ -26,7 +26,8 @@ where `XX` is the location of the CUDA Toolkit (in HPC environments is
generally `$CUDA_HOME`), `YY` is the version of the cuda toolkit and `ZZ` generally `$CUDA_HOME`), `YY` is the version of the cuda toolkit and `ZZ`
is the compute capability of the card. is the compute capability of the card.
If you have no idea what these numbers are you may give a try to the If you have no idea what these numbers are you may give a try to the
automatic tool `get_device_props.py`. An example using Slurm is: automatic tool `get_device_props.py`. Go to directory `dev-tools/` and
run `python get_device_props.py`. An example using Slurm:
``` ```
$ module load cuda $ module load cuda
@ -46,12 +47,12 @@ Compute capabilities for dev 3: 6.0
``` ```
It is generally a good idea to disable Scalapack when running small test It is generally a good idea to disable Scalapack when running small test
cases since the serial GPU eigensolver can outperform the parallel CPU cases since the serial GPU eigensolver outperforms the parallel CPU
eigensolver in many circumstances. eigensolver in many circumstances.
From time to time PGI links to the wrong CUDA libraries and fails reporting From time to time PGI links to the wrong CUDA libraries and fails reporting a
a problem in `cusolver` missing `GOmp` (GNU Openmp). The solution to this problem in `cusolver` missing `GOmp` (GNU Openmp). This problem can be solved
problem is removing cudatoolkit from the `LD_LIBRARY_PATH` before compiling. by removing the cuda toolkit from the `LD_LIBRARY_PATH` before compiling.
Serial compilation is also supported. Serial compilation is also supported.

View File

@ -48,16 +48,8 @@ If you encounter problems when adding the flag `__GPU_MPI` it might
be that the MPI library does not support some CUDA-aware APIs. be that the MPI library does not support some CUDA-aware APIs.
Known Issues
============
Owing to the use of the `source` option in data allocations,
PGI versions older than 17.10 may fail with arrays having initial index
different from 1.
Testing Testing
======= =======
Partial unit testing is available in the `tests` sub-directory. See the Partial unit testing is available in the `tests` sub-directory. See the
README in that directory for further information. README.md file in that directory for further information.