mirror of https://github.com/abinit/abinit.git
450 lines
20 KiB
Markdown
450 lines
20 KiB
Markdown
---
|
|
title: Basic parallelization in ABINIT
|
|
authors: YP, XG
|
|
---
|
|
|
|
# Tutorial on basic parallelism
|
|
|
|
## Parallelism in ABINIT, generalities and environments.
|
|
|
|
There are many situations where a sequential code is not enough, often because
|
|
it would take too much time to get a result. There are also cases where you
|
|
just want things to go as fast as your computational resources allow it.
|
|
By using more than one processor, you might also have access to more memory than
|
|
with only one processor.
|
|
To this end, it is possible to use ABINIT in
|
|
parallel, with dozens, hundreds or even thousands of processors.
|
|
|
|
This tutorial offers you a quick guided tour inside the complex world
|
|
that emerges as soon as you want to use more than one processor.
|
|
From now on, we will suppose that you are already familiar with ABINIT and that you have
|
|
gone through all four basic tutorials. If this is not the case, we strongly
|
|
advise you to do so, in order to truly benefit from this tutorial.
|
|
|
|
We strongly recommend you to acquaint yourself with some basic concepts of
|
|
[parallel computing](https://en.wikipedia.org/wiki/Parallel_computing) too.
|
|
In particular [Almdalh's law](https://en.wikipedia.org/wiki/Amdahl%27s_law),
|
|
that rationalizes the fact that, beyond some number of processors, the
|
|
inherently sequential parts will dominate parallel parts, and give a
|
|
limitation to the maximal speedup that can be achieved.
|
|
|
|
This tutorial describes only basic possibilities of parallel computing with ABINIT.
|
|
After reading this basic tutorial, you will likely benefit to read other tutorials related
|
|
to parallelism in ABINIT. This will be explained later.
|
|
|
|
## Generalities
|
|
|
|
With the broad availability of multi-core processors, everybody now has a
|
|
parallel machine at hand. ABINIT will be able to take advantage of the
|
|
availability of several cores for most of its capabilities, be it ground-state
|
|
calculations, molecular dynamics, linear-response, many-body perturbation theory, ...
|
|
|
|
Such tightly integrated multi-core processors (or so-called SMP machines,
|
|
meaning Symmetric Multi-Processing) can be interlinked within networks, based
|
|
on Ethernet or other types of connections.
|
|
The number of cores in such composite machines can easily exceed one hundred, and
|
|
go up to several millions these days.
|
|
Most ABINIT capabilities can use efficiently several hundred computing cores.
|
|
In some cases, even more than ten thousand computing cores can be used efficiently.
|
|
|
|
Before actually starting this tutorial and the associated ones, we strongly
|
|
advise you to get familiar with your own parallel environment. It might be
|
|
relatively simple for a SMP machine, but more difficult for very powerful
|
|
machines. You will need at least to have MPI (see next section) installed on
|
|
your machine. Take some time to determine how you can launch a job in parallel
|
|
with MPI, what are the resources available and the limitations as well.
|
|
Perhaps you will have to use a batch system
|
|
(typically the `qsub` or `sbatch` command and an associated shell script).
|
|
Do not hesitate to
|
|
discuss with your system administrator if you feel that something is not clear to you.
|
|
|
|
We will suppose in the following that you know how to run a parallel program
|
|
and that you are familiar with the peculiarities of your system.
|
|
Please remember that, as there is no standard way of setting up a parallel
|
|
environment, we are not able to provide you with support beyond ABINIT itself.
|
|
|
|
## Characteristics of parallel environments
|
|
|
|
Different software solutions can be used to benefit from parallelism.
|
|
Most of ABINIT parallelism is based on MPI, but significant additional speedup (or a better
|
|
distribution of data, allowing to run bigger calculations) is based on OpenMP and multi-threaded libraries.
|
|
As of writing, efforts also focus on Graphical Processing Units (GPUs), with
|
|
CUDA and MAGMA. The latter will not be described in the present tutorial.
|
|
|
|
### MPI
|
|
|
|
MPI stands for Message Passing Interface. The goal of MPI, simply stated, is
|
|
to develop a widely used standard for writing message-passing programs.
|
|
As such the interface attempts to establish a practical, portable, efficient, and
|
|
flexible standard for message passing.
|
|
|
|
The main advantages of establishing a message-passing standard are portability
|
|
and ease of use. In a distributed memory communication environment in which
|
|
the higher-level routines and/or abstractions are build upon lower-level
|
|
message-passing routines, the benefits of standardization are particularly
|
|
obvious. Furthermore, the definition of a message-passing standard provides
|
|
vendors with a clearly defined base set of routines that they can implement
|
|
efficiently, or in some cases provide hardware support for, thereby enhancing
|
|
scalability (see <http://mpi-forum.org>).
|
|
|
|
At some point in its history MPI has reach a critical popularity level, and a
|
|
bunch of projects have popped-up like daisies in the grass. Now the tendency
|
|
is back to gathering and merging. For instance, Open MPI is a project
|
|
combining technologies and resources from several other projects
|
|
(FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) in order to build the best MPI library available.
|
|
Open MPI is a completely new MPI3.1-compliant implementation, offering
|
|
advantages for system and software vendors, application developers and
|
|
computer science researchers (see <https://www.open-mpi.org>)
|
|
|
|
### OpenMP
|
|
|
|
The OpenMP Application Program Interface (API) supports multi-platform
|
|
**shared-memory** parallel programming in C/C++ and Fortran on all
|
|
architectures, including Unix platforms and Windows NT platforms.
|
|
Jointly defined by a group of major computer hardware and software vendors, OpenMP is
|
|
a portable, scalable model that gives shared-memory parallel programmers a
|
|
simple and flexible interface for developing parallel applications for
|
|
platforms ranging from the desktop to the supercomputer (<http://www.openmp.org>).
|
|
|
|
OpenMP was rarely used within ABINIT versions < 8.8.x, and only for specific purposes.
|
|
Last versions > 8.8 now benefits from multi-threaded libraries speedup like MKL and fftw3.
|
|
Still not mandatory, on new architectures, multithreading shows better performances than MPI (*if and only if* an multithread version of linear
|
|
algebra library is provided)
|
|
|
|
### Scalapack
|
|
|
|
Scalapack is the parallel version of the popular LAPACK library (for linear
|
|
algebra). It can play some role in the parallelism of several parts of ABINIT,
|
|
especially the LOBPCG algorithm in ground state calculations,
|
|
and the parallelism for the Bethe-Salpether equation. ScaLAPACK being itself based on MPI, we will not discuss
|
|
its use in ABINIT in this tutorial.
|
|
|
|
!!! warning
|
|
|
|
Scalapack is not thread-safe in many versions.
|
|
Combining OpenMP and Scalapack can result in unpredictable behaviours.
|
|
|
|
### Fast/slow communications
|
|
|
|
Characterizing the data-transfer efficiency between two computing cores (or
|
|
the whole set of cores) is a complex task. At a quite basic level, one has to
|
|
recognize that not only the quantity of data that can be transferred per unit
|
|
of time is important, but also the time that is needed to initialize such a
|
|
transfer (so called *latency*).
|
|
|
|
Broadly speaking, one can categorize computers following the speed of
|
|
communications. In the fast communication machines, the latency is very low
|
|
and the transfer time, once initialized, is very low too. For the parallelised
|
|
part of ABINIT, SMP machines and machines with fast interconnect
|
|
will usually not be limited by their network characteristics, but
|
|
by the existence of residual sequential parts. The tutorials that have been
|
|
developed for ABINIT have been based on fast communication machines.
|
|
|
|
If the set of computing cores that you plan to use is not entirely linked
|
|
using a fast network, but includes some connections based e.g. on Ethernet,
|
|
then, you might not be able to benefit from the speed-up announced in the
|
|
tutorials. You have to perform some tests on your actual machine to gain
|
|
knowledge of it, and perhaps consider using multithreading.
|
|
|
|
## What parts of ABINIT are parallel?
|
|
|
|
Parallelizing a code is a very delicate and complicated task, thus do not
|
|
expect that things will systematically go faster just because you are using
|
|
more processors. Please keep also in mind that in some situations,
|
|
parallelization is simply impossible. At the present time, the parts of ABINIT
|
|
that have been parallelized, and for which a tutorial is available, include:
|
|
|
|
* [parallelism over bands and plane waves](/tutorial/paral_bandpw),
|
|
* [ground state with wavelets](/tutorial/paral_gswvl),
|
|
* [molecular dynamics](/tutorial/paral_moldyn),
|
|
* [parallelism on "images"](/tutorial/paral_images),
|
|
* [density-functional perturbation theory (DFPT)](/tutorial/paral_dfpt),
|
|
* [Many-Body Perturbation Theory](/tutorial/paral_mbt).
|
|
|
|
Note that the tutorial on [parallelism over bands and plane waves](/tutorial/paral_bandpw) presents a complete overview of the
|
|
parallelism for the ground state, including up to four levels of parallelisation and, as such, is rather complex.
|
|
Of course, it is also quite powerful, and allows to use several hundreds of processors.
|
|
|
|
Albeit, the two levels based on
|
|
|
|
* the treatment of k-points in reciprocal space;
|
|
* the treatment of spins, for spin-polarized collinear situations [[nsppol]] = 2);
|
|
|
|
are, on the contrary, quite easy to use. Examples of such parallelism will
|
|
be given in the next sections.
|
|
|
|
## A simple example of parallelism in ABINIT
|
|
|
|
[TUTORIAL_README]
|
|
|
|
### Running a job
|
|
|
|
*Before starting, you might consider working in a different subdirectory as
|
|
for the other tutorials. Why not Work_paral?*
|
|
|
|
First one needs to copy the input file from the *\$ABI_TESTS/tutorial*
|
|
directory to your work directory, namely *tbasepar_1.abi*.
|
|
|
|
```sh
|
|
cd $ABI_TESTS/tutorial/Input
|
|
mkdir Work_paral
|
|
cd Work_paral
|
|
cp ../tbasepar_1.abi .
|
|
```
|
|
|
|
{% dialog tests/tutorial/Input/tbasepar_1.abi %}
|
|
|
|
You can start immediately a sequential run with
|
|
|
|
abinit tbasepar_1.abi >& log 2> err &
|
|
|
|
to have a reference CPU time.
|
|
On a Intel Xeon 20C 2.1 GHz, it runs in about 40 seconds.
|
|
|
|
The input file (*.abi) might possibly be modified for parallel execution, as one should avoid
|
|
unnecessary network communications. Indeed, if every node has its own temporary or
|
|
scratch directory (so not in the multicore case), you can achieve this by providing a path to a local disk
|
|
for the temporary files in the input file by using the [[tmpdata_prefix]] variable. Supposing each processor has access
|
|
to a local temporary disk space named `/scratch/user`, then you might add to the input *.abi file the following line
|
|
|
|
tmpdata_prefix="/scratch/user/tbasepar_1"
|
|
|
|
|
|
Note that determining ahead of time the precise resources you will need for
|
|
your run will save you a lot of time if you are using a batch queue system.
|
|
|
|
Also, for parallel runs, note that the _log_ files **will not** be written except the main log file.
|
|
You can change this behaviour by creating a file named `_LOG` to enforce the creation of all log files
|
|
|
|
```bash
|
|
touch _LOG
|
|
```
|
|
|
|
On the contrary, you can create a *_NOLOG* file if you want to avoid all log files.
|
|
|
|
### Parallelism over the k-points
|
|
|
|
The most favorable case for a parallel run is to treat the k-points
|
|
concurrently, since most calculations can be done independently for each one of them.
|
|
|
|
Actually, *tbasepar_1.abi* corresponds to the investigation of a *FCC* crystal of
|
|
lead, which requires a large number of k-points if one wants to get an
|
|
accurate description of the ground state. Examine this file. Note that the
|
|
cut-off is realistic, as well as the grid of k-points (giving 182 k points in
|
|
the irreducible Brillouin zone).
|
|
Once done, your output files for the sequential run, launched while starting to read this section, have likely been produced.
|
|
Examine the timing in the output file (the last line gives the `Overall time`, `cpu` and `wall`), and keep note of it.
|
|
|
|
We assume you have compiled ABINIT indicating `with_mpi="yes"` at configuration step.
|
|
|
|
On a multi-core PC, you might succeed to use two compute cores by issuing the run command for your MPI
|
|
implementation, and mention the number of processors you want to use, as well
|
|
as the abinit command:
|
|
|
|
```bash
|
|
mpirun -n 2 abinit tbasepar_1.abi >& tbasepar_1.log &
|
|
```
|
|
|
|
Depending on your particular machine, *mpirun* might have to be replaced by
|
|
*mpiexec*, and `-n` by some other option.
|
|
|
|
At variance, on a cluster, with the MPICH implementation of MPI, you have to set up a file
|
|
with the addresses of the different CPUs. Let's suppose you call it _cluster_.
|
|
For a PC bi-processor machine, this file could have only one line, like the following:
|
|
|
|
sleepy.pcpm.ucl.ac.be:2
|
|
|
|
For a cluster of four machines, you might have something like:
|
|
|
|
tux0
|
|
tux1
|
|
tux2
|
|
tux3
|
|
|
|
Then, you have to issue the run command for your MPI implementation, and
|
|
mention the number of processors you want to use, as well as the abinit
|
|
command and the file containing the CPU addresses.
|
|
|
|
On a PC bi-processor machine, this gives the following:
|
|
|
|
```bash
|
|
mpirun -np 2 -machinefile cluster ../../src/main/abinit tbasepar_1.abi >& tbasepar_1.log &
|
|
```
|
|
|
|
Now, examine the corresponding output file. If you have kept the output from
|
|
the sequential job, you can make a diff between the two files.
|
|
|
|
{% dialog tests/tutorial/Refs/tbasepar_1.abo %}
|
|
|
|
You will notice
|
|
that the numerical results are quite identical. You will also see that 182
|
|
k-points have been kept in the memory in the sequential case (keyword `mkmem`), while 91
|
|
k-points have been kept in the memory (per processor !) in the parallel case.
|
|
|
|
The timing can be found at the end of the file. Here is an example:
|
|
|
|
- Proc. 0 individual time (sec): cpu= 20.0 wall= 20.1
|
|
|
|
================================================================================
|
|
|
|
Calculation completed.
|
|
Delivered 0 WARNINGs and 1 COMMENTs to log file.
|
|
+Overall time at end (sec) : cpu= 40.1 wall= 40.3
|
|
|
|
|
|
This corresponds effectively to a speed-up of the job by a factor of two.
|
|
Let's examine it. The line beginning with `Proc. 0` corresponds to the CPU and
|
|
Wall clock timing seen by the processor number 0 (processor indexing always
|
|
starts at 0: here the other is number 1): 20.0 sec of CPU time, and nearly the same
|
|
amount of Wall clock time. The line that starts with `+Overall time`
|
|
corresponds to the sum of CPU times and Wall clock timing for all processors.
|
|
The summation is quite meaningful for the CPU time, but not so for the wall
|
|
clock time: the job was finished after 20.1 sec, and not 40.3 sec.
|
|
|
|
Now, you might try to increase the number of processors, and see whether the
|
|
CPU time is shared equally amongst the different processors, so that the Wall
|
|
clock time seen by each processor decreases. At some point (depending on your
|
|
machine, and the sequential part of ABINIT), you will not be able to decrease
|
|
further the Wall clock time seen by one processor. It is not worth to try to
|
|
use more processors. Let us define the speedup
|
|
as the time taken in a
|
|
sequential calculation divided by the time for your parallel calculation (hopefully > 1) .
|
|
You should get a curve similar to this one:
|
|
|
|

|
|
_Speedup with k point parallelization_
|
|
|
|
The red curve materializes the speedup achieved, while the green one is the
|
|
$y = x$ line. The shape of the red curve will vary depending on your hardware
|
|
configuration.
|
|
|
|
One last remark: the number of k-points need not be a multiple of the number
|
|
of processors. As an example, you might try to run the above case with 16
|
|
processors: all will treat $\lfloor 182/16 \rfloor=11$ k points, but $182-16\times11=6$ processors
|
|
will have to treat one more k point so that $6*12+10*11=182$.
|
|
The maximal speedup will only be $15.2 (=182/12)$, instead of 16.
|
|
|
|
Try to avoid leaving an empty processor as this can make abinit fail with
|
|
certain compilers. An empty processor happens, for example, if you use more processors
|
|
than the number of k point.
|
|
The extra processors do no useful work, but have to run anyway, just to confirm to abinit
|
|
once in a while that all processors are alive.
|
|
|
|
### Parallelism over the spins
|
|
|
|
The parallelization over the spins (up, down) is done along with the one over
|
|
the k-points, so it works exactly the same way. The file
|
|
*tbasepar_2.abi* in *\$ABI_TESTS/tutorial* treats a spin-polarized system
|
|
(distorted FCC Iron) with only one k-point in the Irreducible Brillouin Zone.
|
|
This is quite unphysical, and has the sole purpose to show the spin
|
|
parallelism with as few as two processors: the k-point parallelism has
|
|
precedence over the spin parallelism, so that with 2 processors, one ought
|
|
to have only one k-point to see the spin parallelism.
|
|
|
|
{% dialog tests/tutorial/Input/tbasepar_2.abi %}
|
|
|
|
If needed, modify the input file, to provide a local temporary disk space.
|
|
Run this test case, in sequential, then in parallel.
|
|
|
|
While the jobs are running, read the input. Then look closely
|
|
at the output and log files in the sequential and parallel cases. They are quite similar.
|
|
Actually, apart the mention of two processors and the speedup, there is no other
|
|
manifestation of the parallelism.
|
|
|
|
{% dialog tests/tutorial/Refs/tbasepar_2.abo %}
|
|
|
|
If you have more than 2 processors at hand, you might increase the value of
|
|
[[ngkpt]], so that more than one k-point is available, and see that the
|
|
k-point and spin parallelism indeed work concurrently.
|
|
|
|
### Number of computing cores to accomplish a task
|
|
|
|
Balancing efficiently the load on the processors is not always
|
|
straightforward. When using k-point- and spin-parallelism, the ideal numbers
|
|
of processors to use are those that divide the product of [[nsppol]] by
|
|
[[nkpt]] (e.g. for [[nsppol]] * [[nkpt]] = 12, it is quite efficient to use 2, 3, 4,
|
|
6 or 12 processors). ABINIT will nevertheless handle correctly other numbers
|
|
of processors, albeit slightly less efficiently, as the final time will be
|
|
determined by the processor that will have the biggest share of the work to do.
|
|
|
|
### Evidencing overhead
|
|
|
|
Beyond a certain number of processors, the efficiency of parallelism
|
|
saturates, and may even decrease. This is due to the inevitable overhead
|
|
resulting from the increasing amount of communication between the processors.
|
|
The loss of efficiency is highly dependent on the implementation and linked to
|
|
the decreasing charge on each processor too.
|
|
|
|
<!--
|
|
## Details of the implementation
|
|
|
|
### The MPI toolbox in ABINIT
|
|
|
|
The ABINIT-specific MPI routines are located in different subdirectories of
|
|
`~abinit/src` : `12_hide_mpi/`, `51_manage_mpi/`, `56_io_mpi/`, `79_seqpar_mpi/`. They include:
|
|
|
|
* low-level communication handlers
|
|
* header I/O helpers (hdr_io, hdr_io_netcdf);
|
|
* wavefunction I/O helpers (Wff*);
|
|
* a multiprocess-aware output routine (wrtout);
|
|
* a clean exit routine (leave_new).
|
|
|
|
They are used by a wide range of routines.
|
|
|
|
You might want to have a look at the routine headers for more detailed descriptions.
|
|
|
|
### How to parallelize a routine: some hints
|
|
|
|
Here we will give you some advice on how to parallelize a subroutine of
|
|
ABINIT. Do not expect too much, and remember that you remain mostly on your
|
|
own for most decisions. Furthermore, we will suppose that you are familiar
|
|
with ABINIT internals and source code. Anyway, you can skip this section
|
|
without hesitation, as it is primarily intended for advanced developers.
|
|
|
|
First, every call to a MPI routine and every purely parallel section of your
|
|
subroutine **must** be surrounded by the following preprocessing directives
|
|
if you don't use functions provided by the `m_xmpi` module.
|
|
|
|
```fortran
|
|
#if defined HAVE_MPI
|
|
...
|
|
#endif
|
|
```
|
|
|
|
Usually, the function you will write will have as an argument a communicator.
|
|
You can then retrieve the number of processors in this communicator with `xmpi_comm_size(comm)`,
|
|
the rank of a processor with `xmpi_comm_rank(comm)`.
|
|
Example of a function
|
|
|
|
```fortran
|
|
subroutine dosomethin(arg,comm)
|
|
!define arguments
|
|
integer, intent(in) :: arg
|
|
integer, intent(in) :: com,
|
|
!define local arguments
|
|
integer :: rank
|
|
integer :: size
|
|
integer, parameter :: master = 0 ! proc 0 will be our master
|
|
|
|
!retrieve size and rank
|
|
size = xmpi_comm_size(comm)
|
|
rank = xmpi_comm_rank(comm)
|
|
|
|
!do things
|
|
call xmpi_sum(...)
|
|
|
|
!Use not wrapped-functions
|
|
#if defined HAVE_MPI
|
|
call mpi_comm_split(com,....)
|
|
#endif
|
|
|
|
!Master do something
|
|
if ( rank == master ) then
|
|
!do things
|
|
endif
|
|
|
|
call wrtout(std_out,"Only master will write",'COLL')
|
|
call wrtout(std_out,"Each proc will write in its own std_out",'PERS')
|
|
```
|
|
-->
|