mirror of https://gitlab.com/QEF/q-e.git
193 lines
6.5 KiB
Markdown
193 lines
6.5 KiB
Markdown
Quantum ESPRESSO GPU
|
|
====================
|
|
|
|
[![License: GPL v2](https://img.shields.io/badge/License-GPL%20v2-blue.svg)](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)
|
|
|
|
This repository contains the up to date GPU accelerated version of QuantumESPRESSO.
|
|
|
|
Installation
|
|
============
|
|
|
|
This version is tested against PGI compilers v. >= 17.4. The configure
|
|
script checks the presence of a PGI compiler and of a few cuda libraries.
|
|
For this reason path pointing to cudatoolkit must be present in the
|
|
`LD_LIBRARY_PATH`.
|
|
|
|
A template for the configure command is:
|
|
|
|
```
|
|
./configure --with-cuda=XX --with-cuda-runtime=YY --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ]
|
|
```
|
|
|
|
where `XX` is the location of the CUDA Toolkit (in HPC environments is
|
|
generally `$CUDA_HOME`), `YY` is the version of the cuda toolkit and `ZZ`
|
|
is the compute capability of the card.
|
|
If you have no idea what these numbers are you may give a try to the
|
|
automatic tool `get_device_props.py`. An example using Slurm is:
|
|
|
|
```
|
|
$ module load cuda
|
|
$ cd dev-tools
|
|
$ salloc -n1 -t1
|
|
[...]
|
|
salloc: Granted job allocation xxxx
|
|
$ srun python get_device_props.py
|
|
[...]
|
|
Compute capabilities for dev 0: 6.0
|
|
Compute capabilities for dev 1: 6.0
|
|
Compute capabilities for dev 2: 6.0
|
|
Compute capabilities for dev 3: 6.0
|
|
|
|
If all compute capabilities match, configure QE with:
|
|
./configure --with-cuda=$CUDA_HOME --with-cuda-cc=60 --with-cuda-runtime=9.2
|
|
```
|
|
|
|
It is generally a good idea to disable Scalapack when running small test
|
|
cases since the serial GPU eigensolver can outperform the parallel CPU
|
|
eigensolver in many circumstances.
|
|
|
|
From time to time PGI links to the wrong CUDA libraries anf fails reporting
|
|
a problem in `cusolver` missing `GOmp` (GNU Openmp). The solution to this
|
|
problem is removing cudatoolkit from the `LD_LIBRARY_PATH` before compiling.
|
|
|
|
Serial compilation is also supported.
|
|
|
|
Active branches
|
|
===============
|
|
|
|
There are currently two active branches:
|
|
|
|
* gpu_develop
|
|
* gpu_exx
|
|
* gpu_forces
|
|
|
|
These branches are aligned with the develop branch of `QEF/q-e`.
|
|
|
|
|
|
Execution
|
|
=========
|
|
|
|
By default, GPU support is active. The following message will appear at
|
|
the beginning of the output
|
|
|
|
```
|
|
GPU acceleration is ACTIVE.
|
|
```
|
|
|
|
GPU acceleration can be switched off by setting the following environment
|
|
variable:
|
|
|
|
```
|
|
$ export USEGPU=no
|
|
```
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
The current GPU version passes all 186 tests with both parallel and serial
|
|
compilation. The testing suite should only be used to check the correctness of `pw.x`.
|
|
Therefore only `make run-tests-pw-parallel` and `make run-tests-pw-serial`
|
|
should be used.
|
|
|
|
Naming conventions
|
|
==================
|
|
|
|
Variables allocated on the device must end with `_d`.
|
|
Subroutines and functions replicating an algorithm on the GPU must end with `_gpu`.
|
|
Modules must end with `_gpum`.
|
|
Files with duplicated source code must end with `_gpu.f90`.
|
|
|
|
Porting functionalities
|
|
=======================
|
|
|
|
PW functionalities are ported to GPU by duplicating the subroutines and
|
|
the functions that operate on CPU variables.
|
|
The number of arguments should not change but input and output data may
|
|
be referring to device variables when applicable.
|
|
|
|
Bifurcations in code flow happen at runtime with commands similar to
|
|
|
|
```
|
|
use control_flags, only : use_gpu
|
|
[...]
|
|
if (use_gpu) then
|
|
call subroutine_gpu(arg_d)
|
|
else
|
|
call subroutine(arg)
|
|
end if
|
|
```
|
|
|
|
At each bifurcation point it should be possible to remove the call to the
|
|
accelerated routine without breaking the code. Note however that calling
|
|
both the CPU and the GPU version of a subroutine in the same place may
|
|
break the code execution.
|
|
|
|
|
|
Memory management
|
|
=================
|
|
|
|
[ DISCLAIMER STARTS ]
|
|
What described below is not the method that will be integrated
|
|
in the final release. Nonetheless it happens to be a good approach for:
|
|
|
|
1) simplify the alignment of this fork with the main repository,
|
|
2) debugging,
|
|
3) tracing evolution of memory paths as the CPU version evolves,
|
|
4) (in the future) report on a the set of global variables that should be
|
|
kept to guarantee a certain speedup.
|
|
|
|
For example, this simplified the integration of the changes that took
|
|
place to modernize the I/O.
|
|
[ DISCLAIMER ENDS ]
|
|
|
|
|
|
Global GPU data are tightly linked to global CPU data. One cannot allocate
|
|
global variables on the GPU manually. The global GPU variables follow the
|
|
allocation and deallocation of the CPU ones. This is an automatic mechanism
|
|
enforced by the managed memory system. In what follows, I will refer to
|
|
duplicated GPU variables as "duplicated variable" and to the equivalent
|
|
CPU variable as "parent variable".
|
|
|
|
Global variables in modules are synchronized through calls to subroutines
|
|
named `using_xxx` and `using_xxx_d` with `xxx` being the name of the variable
|
|
in the module globally accessed by multiple subroutines.
|
|
This function accepts one argument that replicates the role of the `intent`
|
|
attribute.
|
|
|
|
Acceptable values are:
|
|
```
|
|
0: variable will only be read (equal to intent in)
|
|
1: variable will be read and written (equal to intent inout)
|
|
2: variable will be only (entirely) updated (equal to intent out).
|
|
```
|
|
|
|
Function and subroutine calls having global variables in their argument
|
|
should be guarded by calls to `using_xxx` with the appropriate argument.
|
|
Obviously calls with argument 0 and 1 must always be prepended.
|
|
|
|
|
|
The actual allocation of a duplicated variable happens when `using_xxx_d`
|
|
is called and the parent variable is allocated.
|
|
Deallocation happens when `using_xxx_d(2)` is called and the CPU variable
|
|
is not allocated.
|
|
Data synchronization (done with synchronous copies, i.e. overloaded cudamemcpy)
|
|
happens when either the CPU or the GPU memory is found to be flagged
|
|
"out of date" by a previous call to `using_xxx(1)` or `using_xxx(2)`
|
|
or `using_xxx_d(1)` or `using_xxx_d(2)`.
|
|
|
|
Calls to `using_xxx_d` should only happen in GPU function/subroutines.
|
|
This rule can be avoided if the call is protected by ifdefs.
|
|
This is useful if you are lazy and a global variable is updated only a few times.
|
|
An example of this being g vectors that are set in a few places (at
|
|
initialization, after a scaling of the Hamiltonian etc) and are used
|
|
everywhere in the code.
|
|
|
|
Finally, there are global variables that are only updated with subroutines
|
|
residing inside the same module. The allocation and the update of the
|
|
duplicated counterpart becomes trivial and is simply done at the same time
|
|
as the CPU variable. At the time of writing this constitute an exception
|
|
to the general rule but it is actually the result of the efforts done in
|
|
the last year to modularize the code and is probably the correct method
|
|
to deal with duplicated data in the code.
|