mirror of https://gitlab.com/QEF/q-e.git
Old stuff
git-svn-id: http://qeforge.qe-forge.org/svn/q-e/trunk/espresso@11978 c92efa57-630b-4861-b058-cf58834340f0
This commit is contained in:
parent
d55fa05fb7
commit
cb2e008d8d
|
@ -1,188 +0,0 @@
|
|||
Info by F. Spiga (spiga -dot- filippo -at- gmail -dot- com) -- Jun 19, 2013
|
||||
|
||||
Machine name : MonteRosa at CSCS(CH)
|
||||
Machine spec : http://user.cscs.ch/hardware/monte_rosa_cray_xe6/index.html
|
||||
|
||||
# IMPORTANT NOTE:
|
||||
Other CRAY XE6 systems might have different modules, please check for
|
||||
equivalent if the ones mentioned are missing.
|
||||
|
||||
|
||||
0. Architecture peculiarities
|
||||
|
||||
CRAY XE6 systems currently in operation are equipped with XE6 nodes that have
|
||||
two 16-core AMD. Interlagos is composed of a number of "Bulldozer modules"
|
||||
or "Compute Unit". A compute unit has shared and dedicated components:
|
||||
- there are two independent integer units
|
||||
- a SHARED 256-bit Floating Point pipeline supporting SSEx and AVX extension
|
||||
|
||||
For this reason it is better to use the CPU in SINGLE STREAM MODE (aprun -j 1)
|
||||
reducing the maximum number of OpenMP thread per node from 32 to 16 (split
|
||||
across two two sockets) being able to exploit at maximum the floating-point
|
||||
pipeline. In this way the L2 cache is effectively twice as large and the
|
||||
peak performance (in double-precision)should not be affected.
|
||||
|
||||
The interconnection topology might have "holes" due to service nodes and I/O
|
||||
nodes. Cray's Application Level Placement Scheduler (ALPS) should be able to
|
||||
support a resource manager to identify a subset of free nodes in the cluster
|
||||
to minimize hops. Please refer to specific user-guide provided by your HPC
|
||||
centre.
|
||||
|
||||
|
||||
1. Compile the code
|
||||
|
||||
All the compilers tested work. I prefer to use PGI (or eventually INTEL).
|
||||
|
||||
|
||||
1.1 MonteRosa modules (PGI):
|
||||
|
||||
There is not a default compiler after login...
|
||||
|
||||
$ module load PrgEnv-pgi
|
||||
|
||||
$ module unload atp totalview-support xt-totalview hss-llm
|
||||
|
||||
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack
|
||||
|
||||
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack --with-elpa
|
||||
|
||||
# NOTE:
|
||||
It is possible to try to push more aggressive PGI compiler flags and
|
||||
optimizations by editing make.sys directly...
|
||||
|
||||
CFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 $(DFLAGS) $(IFLAGS)
|
||||
F90FLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -Mcache_align -r8 -Mpreprocess -mp=nonuma $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
|
||||
FFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -r8 -mp=nonuma
|
||||
|
||||
Real benefits have to be proven...
|
||||
|
||||
|
||||
1.2 MonteRosa modules (Intel):
|
||||
|
||||
There is not a default compiler after login...
|
||||
|
||||
$ module load PrgEnv-intel
|
||||
|
||||
$ module unload atp totalview-support xt-totalview hss-llm
|
||||
|
||||
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack
|
||||
|
||||
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack --with-elpa
|
||||
|
||||
|
||||
2. Good practices
|
||||
|
||||
- if your calculation is FFT-bounded Use the hybrid version of code. The
|
||||
reason is that there are 1 GByte RAM/core and if you put 32 MPI in a
|
||||
single node you are going to stress the GEMINI interconnection.
|
||||
|
||||
- CRAY LIBSCI library works well for all the compilers, I do not see any
|
||||
advantages to use ACML explicitly.
|
||||
|
||||
- use ScaLAPACK (--with-scalapack), let the configure detect and use the
|
||||
default library (it will be the CRAY libsci, the make.sys will not show
|
||||
anything because everything is done by the CRAY wrapper ftn/cc).
|
||||
|
||||
- try ELPA library (--with-elpa) but check properly results
|
||||
|
||||
- The environment is exported automatically by 'sbatch' during the
|
||||
submission operation. So check to have loaded properly the right modules.
|
||||
|
||||
|
||||
3. Example scripts
|
||||
|
||||
SLURM user-guide at CSCS
|
||||
http://user.cscs.ch/running_batch_jobs/slurm_at_cscs/index.html#c1130
|
||||
|
||||
|
||||
3.1 MonteRosa (SLURM) - SINGLE STREAM MODE
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
# This script run pw.x using 128 cores (32 MPI, 4 MPI per node,
|
||||
# 4 OMP per MPI threads in SINGLE STREAM MODE).
|
||||
|
||||
#SBATCH --job-name="QE-BENCH"
|
||||
#SBATCH --nodes=8
|
||||
#SBATCH --time=01:00:00
|
||||
#SBATCH --output=QE-BENCH.%j.o
|
||||
#SBATCH --error=QE-BENCH.%j.e
|
||||
#SBATCH --account=<...>
|
||||
|
||||
module load slurm
|
||||
|
||||
# Useful information...
|
||||
echo "The current job ID is $SLURM_JOB_ID"
|
||||
echo "Running on $SLURM_JOB_NUM_NODES nodes"
|
||||
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
|
||||
echo "A total of $SLURM_NTASKS tasks is used"
|
||||
|
||||
export OMP_NUM_THREADS=4
|
||||
|
||||
aprun -n $SLURM_NTASKS -j 1 -d 4 -S 1 ./pw.x -input ausurf_gamma.in | tee out
|
||||
|
||||
|
||||
# NOTE:
|
||||
The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes
|
||||
contains 2 x 16-core CPU, 4 NUMA nodes in total (each NUMA node has 4 Bulldozer
|
||||
Modules). The value of "-S" has to change according to the combination MPIxOMP
|
||||
in the node:
|
||||
|
||||
-d 2 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes)
|
||||
-d 4 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes)
|
||||
|
||||
"-S" is OPTIONAL. The resource manager should be enough smart to
|
||||
place the MPI processes in the right place but I never double-check
|
||||
|
||||
|
||||
3.2 MonteRosa (SLURM) - DUAL STREAM MODE (or default)
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
# This script run pw.x using 256 cores (32 MPI, 4 MPI per node,
|
||||
# 8 OMP per MPI threads).
|
||||
|
||||
#SBATCH --job-name="QE-BENCH"
|
||||
#SBATCH --nodes=8
|
||||
#SBATCH --time=01:00:00
|
||||
#SBATCH --output=QE-BENCH.%j.o
|
||||
#SBATCH --error=QE-BENCH.%j.e
|
||||
#SBATCH --account=<...>
|
||||
|
||||
module load slurm
|
||||
|
||||
# Useful information...
|
||||
echo "The current job ID is $SLURM_JOB_ID"
|
||||
echo "Running on $SLURM_JOB_NUM_NODES nodes"
|
||||
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
|
||||
echo "A total of $SLURM_NTASKS tasks is used"
|
||||
|
||||
export OMP_NUM_THREADS=8
|
||||
aprun -n $SLURM_NPROCS -N 4 -d 8 -S 1 ./pw.x -input ausurf_gamma.in -npool 4 | tee out
|
||||
|
||||
|
||||
# NOTE (1):
|
||||
The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes
|
||||
contains 2 x 16-core CPU, 4 NUMA nodes in total (each NUMA node has 4 Bulldozer
|
||||
Modules). The value of "-S" has to change according to the combination MPIxOMP
|
||||
in the node:
|
||||
|
||||
-N 8 -d 4 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes)
|
||||
-N 4 -d 8 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes)
|
||||
|
||||
"-S" is OPTIONAL. The resource manager should be enough smart to
|
||||
place the MPI processes in the right place but I never double-check
|
||||
|
||||
# NOTE (2):
|
||||
Other two useful options for aprun are:
|
||||
-ss (Optional) Demands strict memory containment per NUMA node.
|
||||
-cc (Optional) Controls how tasks are bound to cores and NUMA nodes.
|
||||
The recommend setting for most codes is -cc cpu which restricts
|
||||
each task to run on a specific core.
|
||||
|
||||
Try and use them wisely.
|
||||
|
||||
|
||||
4. Benchmarks
|
||||
|
||||
[TO BE ADDED]
|
Loading…
Reference in New Issue