Old stuff

git-svn-id: http://qeforge.qe-forge.org/svn/q-e/trunk/espresso@11978 c92efa57-630b-4861-b058-cf58834340f0
This commit is contained in:
spigafi 2016-01-12 05:09:26 +00:00
parent d55fa05fb7
commit cb2e008d8d
1 changed files with 0 additions and 188 deletions

View File

@ -1,188 +0,0 @@
Info by F. Spiga (spiga -dot- filippo -at- gmail -dot- com) -- Jun 19, 2013
Machine name : MonteRosa at CSCS(CH)
Machine spec : http://user.cscs.ch/hardware/monte_rosa_cray_xe6/index.html
# IMPORTANT NOTE:
Other CRAY XE6 systems might have different modules, please check for
equivalent if the ones mentioned are missing.
0. Architecture peculiarities
CRAY XE6 systems currently in operation are equipped with XE6 nodes that have
two 16-core AMD. Interlagos is composed of a number of "Bulldozer modules"
or "Compute Unit". A compute unit has shared and dedicated components:
- there are two independent integer units
- a SHARED 256-bit Floating Point pipeline supporting SSEx and AVX extension
For this reason it is better to use the CPU in SINGLE STREAM MODE (aprun -j 1)
reducing the maximum number of OpenMP thread per node from 32 to 16 (split
across two two sockets) being able to exploit at maximum the floating-point
pipeline. In this way the L2 cache is effectively twice as large and the
peak performance (in double-precision)should not be affected.
The interconnection topology might have "holes" due to service nodes and I/O
nodes. Cray's Application Level Placement Scheduler (ALPS) should be able to
support a resource manager to identify a subset of free nodes in the cluster
to minimize hops. Please refer to specific user-guide provided by your HPC
centre.
1. Compile the code
All the compilers tested work. I prefer to use PGI (or eventually INTEL).
1.1 MonteRosa modules (PGI):
There is not a default compiler after login...
$ module load PrgEnv-pgi
$ module unload atp totalview-support xt-totalview hss-llm
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack --with-elpa
# NOTE:
It is possible to try to push more aggressive PGI compiler flags and
optimizations by editing make.sys directly...
CFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 $(DFLAGS) $(IFLAGS)
F90FLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -Mcache_align -r8 -Mpreprocess -mp=nonuma $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
FFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -r8 -mp=nonuma
Real benefits have to be proven...
1.2 MonteRosa modules (Intel):
There is not a default compiler after login...
$ module load PrgEnv-intel
$ module unload atp totalview-support xt-totalview hss-llm
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack
$ ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack --with-elpa
2. Good practices
- if your calculation is FFT-bounded Use the hybrid version of code. The
reason is that there are 1 GByte RAM/core and if you put 32 MPI in a
single node you are going to stress the GEMINI interconnection.
- CRAY LIBSCI library works well for all the compilers, I do not see any
advantages to use ACML explicitly.
- use ScaLAPACK (--with-scalapack), let the configure detect and use the
default library (it will be the CRAY libsci, the make.sys will not show
anything because everything is done by the CRAY wrapper ftn/cc).
- try ELPA library (--with-elpa) but check properly results
- The environment is exported automatically by 'sbatch' during the
submission operation. So check to have loaded properly the right modules.
3. Example scripts
SLURM user-guide at CSCS
http://user.cscs.ch/running_batch_jobs/slurm_at_cscs/index.html#c1130
3.1 MonteRosa (SLURM) - SINGLE STREAM MODE
#!/bin/bash
# This script run pw.x using 128 cores (32 MPI, 4 MPI per node,
# 4 OMP per MPI threads in SINGLE STREAM MODE).
#SBATCH --job-name="QE-BENCH"
#SBATCH --nodes=8
#SBATCH --time=01:00:00
#SBATCH --output=QE-BENCH.%j.o
#SBATCH --error=QE-BENCH.%j.e
#SBATCH --account=<...>
module load slurm
# Useful information...
echo "The current job ID is $SLURM_JOB_ID"
echo "Running on $SLURM_JOB_NUM_NODES nodes"
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
echo "A total of $SLURM_NTASKS tasks is used"
export OMP_NUM_THREADS=4
aprun -n $SLURM_NTASKS -j 1 -d 4 -S 1 ./pw.x -input ausurf_gamma.in | tee out
# NOTE:
The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes
contains 2 x 16-core CPU, 4 NUMA nodes in total (each NUMA node has 4 Bulldozer
Modules). The value of "-S" has to change according to the combination MPIxOMP
in the node:
-d 2 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes)
-d 4 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes)
"-S" is OPTIONAL. The resource manager should be enough smart to
place the MPI processes in the right place but I never double-check
3.2 MonteRosa (SLURM) - DUAL STREAM MODE (or default)
#!/bin/bash
# This script run pw.x using 256 cores (32 MPI, 4 MPI per node,
# 8 OMP per MPI threads).
#SBATCH --job-name="QE-BENCH"
#SBATCH --nodes=8
#SBATCH --time=01:00:00
#SBATCH --output=QE-BENCH.%j.o
#SBATCH --error=QE-BENCH.%j.e
#SBATCH --account=<...>
module load slurm
# Useful information...
echo "The current job ID is $SLURM_JOB_ID"
echo "Running on $SLURM_JOB_NUM_NODES nodes"
echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
echo "A total of $SLURM_NTASKS tasks is used"
export OMP_NUM_THREADS=8
aprun -n $SLURM_NPROCS -N 4 -d 8 -S 1 ./pw.x -input ausurf_gamma.in -npool 4 | tee out
# NOTE (1):
The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes
contains 2 x 16-core CPU, 4 NUMA nodes in total (each NUMA node has 4 Bulldozer
Modules). The value of "-S" has to change according to the combination MPIxOMP
in the node:
-N 8 -d 4 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes)
-N 4 -d 8 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes)
"-S" is OPTIONAL. The resource manager should be enough smart to
place the MPI processes in the right place but I never double-check
# NOTE (2):
Other two useful options for aprun are:
-ss (Optional) Demands strict memory containment per NUMA node.
-cc (Optional) Controls how tasks are bound to cores and NUMA nodes.
The recommend setting for most codes is -cc cpu which restricts
each task to run on a specific core.
Try and use them wisely.
4. Benchmarks
[TO BE ADDED]