mirror of https://gitlab.com/QEF/q-e.git
39dabfb394
git-svn-id: http://qeforge.qe-forge.org/svn/q-e/trunk/espresso@9546 c92efa57-630b-4861-b058-cf58834340f0 |
||
---|---|---|
.. | ||
Make.BGP | ||
Make.BGP-openMP | ||
Make.BGP-openMP+FFTW | ||
Make.altix | ||
Make.cray-xt4 | ||
Makefile_iotk | ||
Makefile_lapack | ||
Makefile_lapack_testing_lin | ||
README.CINECA_fermi | ||
README.CSCS_rosa | ||
clean.sh | ||
config.guess | ||
config.sub | ||
configure | ||
configure.ac | ||
configure.msg.in | ||
extlibs_makefile | ||
includedep.sh | ||
install-sh | ||
iotk_config.h | ||
make.sys.in | ||
make_blas.inc.in | ||
make_lapack.inc.in | ||
make_wannier90.sys.in | ||
makedeps.sh | ||
moduledep.sh | ||
namedep.sh | ||
plugins_list | ||
plugins_makefile | ||
update_version |
README.CSCS_rosa
Info by Filippo Spiga, Oct. 2012, valid for any version of QE after 5. Machine name : MonteRosa (Cray XE6) at CSCS, Lugano (CH) Machine spec : http://user.cscs.ch/hardware/monte_rosa_cray_xe6/index.html Similar systems : HECToR (EPCC, UK), HERMIT (HLRS, DE), Cielo (LANL, US), HOPPER (NERSC, US), Beagle (ANL, US) IMPORTANT NOTE: the CRAY XK7 system has a different compute node spec because one CPU has been replaced by a NVIDIA GPU K20. Instructions to compile and run are similar BUT some parameters have to change according to the different architecture 1. Compile the code The code can be compiled using INTEL, PGI or GNU compilers. On this machine it seems that surprising GNU compiler delivers slightly better performance. The default package available is built using GNU but it does not use OpenMP. I think it is not the best solution for lot of cases. NOTE: these flags are just a starting point, further investigations might identify better combinations 1.1 Using GNU compilers (thanks to Luca Marsella, CSCS) module load PrgEnv-gnu module load fftw module load perftools export CC="cc" export FC="ftn" export CXX="CC" export MPICC="cc" export MPIF90="ftn" export CFLAGS="-O3 -fopenmp" export FCLAGS="-O3 -fopenmp" export CXXFLAGS="-O3 -fopenmp" ./configure LIBDIRS="$LIBDIRS" --enable-parallel --enable-openmp --with-scalapack --disable-shared 1.2 Using PGI compilers module load PrgEnv-pgi module switch pgi/12.5.0 pgi/12.8.0 module unload atp totalview-support xt-totalview hss-llm ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack and then eventually edit "make.inc" manually in this way CFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 $(DFLAGS) $(IFLAGS) F90FLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -Mcache_align -r8 -Mpreprocess -mp $(FDFLAGS) $(IFLAGS) $(MODFLAGS) FFLAGS = -Minfo=all -Mneginfo=all -O3 -fastsse -Mipa=fast,inline -tp bulldozer-64 -r8 -mp those flags raise the performance up to ~10% than the default 1.3 Using INTEL compilers module load PrgEnv-intel module unload atp totalview-support xt-totalview hss-llm export CFLAGS="-O3 -openmp" export FFLAGS="-O2 -assume byterecl -g -traceback -par-report0 -vec-report0 -openmp" export F90FLAGS="-nomodule -openmp" export FFLAGS_NOOPT="-O0 -assume byterecl -g -traceback" export FFLAGS_NOMAIN="-nofor_main" ./configure ARCH=crayxt --enable-openmp --enable-parallel --with-scalapack NOTE: the configure does not detect properly Intel on CRAY because historically CRAY systems had PGI, CRAY and PATHSCALE compilers 2. Good practices - if your calculation is FFT-bounded Use the hybrid version of code. The reason is that there are 1 GByte RAM/core and if you put 32 MPI in a single node you are going to stress the GEMINI interconnection. - CRAY LIBSCI library works well for all the compilers, I do not see any advantages to use ACML explicitly. - use ScaLAPACK (--with-scalapack), let the configure detect and use the default library (it will be the CRAY libsci, the make.sys will not show anything because everything is done by the CRAY wrapper ftn/cc). - the new ELPA library (--with-elpa) has not yet tested - The environment is exported automatically by 'sbatch' during the submission operation. So check to have loaded properly the right modules. 3. Example scripts This script run pw.x over 6400 cores (800 MPI, 8 MPI per node, 4 OMP per MPI thread). The flag "" #SBATCH --job-name="QE-BENCH-SPIGA" #SBATCH --nodes=200 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=4 #SBATCH --time=06:00:00 #SBATCH --output=QE-BENCH.%j.o #SBATCH --error=QE-BENCH.%j.e #SBATCH --account=<...> export OMP_NUM_THREADS=4 aprun -n $SLURM_NPROCS -N 8 -d 4 -S 2 ./pw.x -input SiGe25.in -npool 4 | tee out This script run pw.x over 6400 cores (800 MPI, 4 MPI per node, 8 OMP per MPI thread). #!/bin/bash #SBATCH --job-name="QE-BENCH-SPIGA" #SBATCH --nodes=200 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=8 #SBATCH --time=06:00:00 #SBATCH --output=QE-BENCH.%j.o #SBATCH --error=QE-BENCH.%j.e #SBATCH --account=<...> export OMP_NUM_THREADS=8 aprun -n $SLURM_NPROCS -N 4 -d 8 -S 1./pw.x -input SiGe25.in -npool 4 | tee out The flag "-S" is the number of MPI tasks per NUMA node. Each XE6 nodes contains 2 x 16-core CPU, 4 NUMA nodes in total. The value of "-S" has to change according to the combination MPIxOMP in the node: -N 8 -d 4 --> -S 2 (because there are 8 MPI to distribute across 4 NUMA nodes) -N 4 -d 8 --> -S 1 (because there are 4 MPI to distribute across 4 NUMA nodes) NOTE (1): "-S" is optional. The resource manager should be enough smart to place the MPi processes in the right place but I never double-check NOTE (2): other two useful options for aprun are: -ss (Optional) Demands strict memory containment per NUMA node. -cc (Optional) Controls how tasks are bound to cores and NUMA nodes. The recommend setting for most codes is -cc cpu which restricts each task to run on a specific core. Try and use them wisely.