quantum-espresso/install
spigafi 39dabfb394 Several updates
git-svn-id: http://qeforge.qe-forge.org/svn/q-e/trunk/espresso@9546 c92efa57-630b-4861-b058-cf58834340f0
2012-10-17 16:37:08 +00:00
..
Make.BGP eliminated Multigrid dep in Make.something in dir install 2011-03-24 15:44:01 +00:00
Make.BGP-openMP eliminated Multigrid dep in Make.something in dir install 2011-03-24 15:44:01 +00:00
Make.BGP-openMP+FFTW Recent make.sys sample for BG, may be useful 2012-06-28 15:44:23 +00:00
Make.altix eliminated Multigrid dep in Make.something in dir install 2011-03-24 15:44:01 +00:00
Make.cray-xt4 Minor addition to installation system: makefile for cray-xt4 with openmp, 2010-02-25 20:29:47 +00:00
Makefile_iotk extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
Makefile_lapack extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
Makefile_lapack_testing_lin extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
README.CINECA_fermi UPdated with latest performance improvements 2012-10-11 21:17:02 +00:00
README.CSCS_rosa Several updates 2012-10-17 16:37:08 +00:00
clean.sh An error in previous commit. 2012-08-20 15:29:26 +00:00
config.guess More minor tweaking: obsolete or useless variables removed, 2006-09-21 20:07:55 +00:00
config.sub added autoconf-based configure (file "configure.new") and related files 2003-11-13 13:35:10 +00:00
configure Correction to a previous commit (now it integrates ELPA updated with revision 9449 2012-09-26 15:53:20 +00:00
configure.ac Correction to a previous commit (now it integrates ELPA updated with revision 9449 2012-09-26 15:53:20 +00:00
configure.msg.in - better support for SCALAPACK library. 2009-08-20 13:24:31 +00:00
extlibs_makefile extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
includedep.sh More miscellanous cleanup from Axel: 2006-12-12 11:02:09 +00:00
install-sh added autoconf-based configure (file "configure.new") and related files 2003-11-13 13:35:10 +00:00
iotk_config.h extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
make.sys.in A trick to specify macros manually without editing the make.sys file. 2012-09-29 17:47:27 +00:00
make_blas.inc.in extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
make_lapack.inc.in extlibs deleted moved to archive and main install 2012-01-03 11:33:44 +00:00
make_wannier90.sys.in make_wannier90.sys.in added to dir install 2010-11-23 11:57:58 +00:00
makedeps.sh The Intel compiler has a very informative backtrace routine. 2012-10-15 10:03:26 +00:00
moduledep.sh More miscellanous cleanup from Axel: 2006-12-12 11:02:09 +00:00
namedep.sh More miscellanous cleanup from Axel: 2006-12-12 11:02:09 +00:00
plugins_list plugin_list now new url 2012-08-28 13:00:27 +00:00
plugins_makefile Instances of "test ! -e" (true if file does not exist) replaced with "test ! -s" (true if file does not exist, or is empty). This should mitigate the recent problems with empty archives. Tested on SUSE and AIX. 2012-08-22 13:53:59 +00:00
update_version IBM machines do not like "diff -q" 2012-07-20 13:37:00 +00:00

README.CSCS_rosa

Info by Filippo Spiga, Oct. 2012, valid for any version of QE after 5.


Machine name    : MonteRosa (Cray XE6) at CSCS, Lugano (CH)
Machine spec    : http://user.cscs.ch/hardware/monte_rosa_cray_xe6/index.html
Similar systems : HECToR (EPCC, UK), HERMIT (HLRS, DE), Cielo (LANL, US), 
                  HOPPER (NERSC, US), Beagle (ANL, US)

IMPORTANT NOTE: the CRAY XK7 system has a different compute node spec because
                one CPU has been replaced by a NVIDIA GPU K20. Instructions to 
                compile and run are similar BUT some parameters have to change 
                according to the different architecture
 

1. Compile the code

The code can be compiled using INTEL, PGI or GNU compilers. On this 
machine it seems that surprising GNU compiler delivers slightly better 
performance.

The default package available is built using GNU but it does not 
use OpenMP. I think it is not the best solution for lot of cases.

NOTE: these flags are just a starting point, further investigations 
      might identify better combinations

1.1 Using GNU compilers (thanks to Luca Marsella, CSCS)

module load PrgEnv-gnu
module load fftw
module load perftools
export CC="cc"
export FC="ftn"
export CXX="CC"
export MPICC="cc"
export MPIF90="ftn"
export CFLAGS="-O3 -fopenmp"
export FCLAGS="-O3 -fopenmp"
export CXXFLAGS="-O3 -fopenmp"
./configure LIBDIRS="$LIBDIRS" --enable-parallel --enable-openmp --with-scalapack --disable-shared

1.2 Using PGI compilers

module load PrgEnv-pgi
module switch pgi/12.5.0 pgi/12.8.0
module unload atp totalview-support xt-totalview hss-llm
./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack 

and then eventually edit "make.inc" manually in this way

CFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 $(DFLAGS) $(IFLAGS)
F90FLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -Mcache_align -r8 -Mpreprocess -mp $(FDFLAGS) $(IFLAGS) $(MODFLAGS)
FFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -r8 -mp 

those flags raise the performance up to ~10%  than the default

1.3 Using INTEL compilers

module load PrgEnv-intel
module unload atp totalview-support xt-totalview hss-llm
export CFLAGS="-O3 -openmp"
export FFLAGS="-O2 -assume byterecl -g -traceback -par-report0 -vec-report0 -openmp"
export F90FLAGS="-nomodule -openmp"
export FFLAGS_NOOPT="-O0 -assume byterecl -g -traceback"
export FFLAGS_NOMAIN="-nofor_main"
./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack 

NOTE: the configure does not detect properly Intel on CRAY because 
      historically CRAY systems had PGI, CRAY and PATHSCALE compilers
      

2. Good practices

- if your calculation is FFT-bounded Use the hybrid version of code. The 
reason is that there are 1 GByte RAM/core and if you put 32 MPI in a 
single node you are going to stress the GEMINI interconnection.

- CRAY LIBSCI library works well for all the compilers, I do not see any
advantages to use ACML explicitly.

- use ScaLAPACK (--with-scalapack), let the configure detect and use the 
default library (it will be the CRAY libsci, the make.sys will not show
anything because everything is done by the CRAY wrapper ftn/cc).

- the new ELPA library (--with-elpa) has not yet tested

- The environment is exported automatically by 'sbatch' during the 
submission operation. So check to have loaded properly the right modules.



3. Example scripts 

This script run pw.x over 6400 cores (800 MPI, 8 MPI per node, 4
OMP per MPI thread). The flag ""

#SBATCH --job-name="QE-BENCH-SPIGA"
#SBATCH --nodes=200
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=4
#SBATCH --time=06:00:00
#SBATCH --output=QE-BENCH.%j.o
#SBATCH --error=QE-BENCH.%j.e
#SBATCH --account=<...>

export OMP_NUM_THREADS=4
aprun -n $SLURM_NPROCS -N 8 -d 4 -S 2 ./pw.x -input SiGe25.in -npool 4 | tee out


This script run pw.x over 6400 cores (800 MPI, 4 MPI per node, 8
OMP per MPI thread).

#!/bin/bash
#SBATCH --job-name="QE-BENCH-SPIGA"
#SBATCH --nodes=200
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=06:00:00
#SBATCH --output=QE-BENCH.%j.o
#SBATCH --error=QE-BENCH.%j.e
#SBATCH --account=<...>

export OMP_NUM_THREADS=8
aprun -n $SLURM_NPROCS -N 4 -d 8 -S 1./pw.x -input SiGe25.in -npool 4 | tee out


The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes 
contains 2 x 16-core CPU, 4 NUMA nodes  in total. The value of "-S" has 
to change according to the combination MPIxOMP in the node:

-N 8 -d 4 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes)
-N 4 -d 8 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes)


NOTE (1): "-S" is optional. The resource manager should be enough smart to
      place the MPi processes in the right place but I never double-check
      
NOTE (2): other two useful options for aprun are:

-ss	(Optional) Demands strict memory containment per NUMA node. 

-cc	(Optional) Controls how tasks are bound to cores and NUMA nodes. 
               The recommend setting for most codes is -cc cpu which restricts 
               each task to run on a specific core. 
               
          Try and use them wisely.