From e741259d1bea511dfab5e153db275029578ff5b7 Mon Sep 17 00:00:00 2001 From: Mark Dewing Date: Fri, 25 Mar 2022 11:23:37 -0500 Subject: [PATCH 01/26] Add chapter on performance portable implementation. This is a major change, so it seems worthwhile to highlight in its own chapter. Then it also shows up in the table of contents. --- docs/index.rst | 1 + docs/methods.rst | 28 ---------------------------- docs/performance_portable.rst | 29 +++++++++++++++++++++++++++++ 3 files changed, 30 insertions(+), 28 deletions(-) create mode 100644 docs/performance_portable.rst diff --git a/docs/index.rst b/docs/index.rst index 5799efb3d..1f7edc6ee 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -18,6 +18,7 @@ User's Guide and Developer's Manual introduction features + performance_portable installation running units diff --git a/docs/methods.rst b/docs/methods.rst index 5ab2275bd..56c594116 100644 --- a/docs/methods.rst +++ b/docs/methods.rst @@ -114,34 +114,6 @@ To continue a run, specify the ``mcwalkerset`` element before your VMC/DMC block In the project id section, make sure that the series number is different from any existing ones to avoid overwriting them. -.. _batched_drivers: - -Batched drivers ---------------- - -Under the Exascale Computing Project effort a new set of QMC drivers was developed -to eliminate the divergence of legacy CPU and GPU code paths at the QMC driver level and make the drivers CPU/GPU agnostic. -The divergence came from the the fact that the CPU code path favors executing all the compute tasks within a step -for one walker and then advance walker by walker. Multiple CPU threads process their own assigned walkers in parallel. -In this way, walkers are not synchronized with each other and maximal throughout can be achieved on CPU. -The GPU code path favors executing the same compute task over all the walkers together to maximize GPU throughput. -This compute dispatch pattern minimizes the overhead of dispatching computation and host-device data transfer. -However, the legacy GPU code path only leverages the OpenMP main host thread for handling -all the interaction between the host and GPUs and limit the kernel dispatch capability. -In brief, the CPU code path handles computation with a walker batch size of one and many batches -while the GPU code path uses only one batch containing all the walkers. -The new drivers that implement this flexible batching scheme are called "batched drivers". - -The batched drivers introduce a new concept, "crowd", as a sub-organization of walker population. -A crowd is a subset of the walkers that are operated on as as single batch. -Walkers within a crowd operate their computation in lock-step, which helps the GPU efficiency. -Walkers between crowds remain fully asynchronous unless operations involving the full population are needed. -With this flexible batching capability the new drivers are capable of delivering maximal performance on given hardware. -In the new driver design, all the batched API calls may fallback to an existing single walker implementation. -Consequently, batched drivers allow mixing and matching CPU-only and GPU-accelerated features -in a way that is not feasible with the legacy GPU implementation. - -For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. .. _transition_guide: diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst new file mode 100644 index 000000000..59f3462a6 --- /dev/null +++ b/docs/performance_portable.rst @@ -0,0 +1,29 @@ +.. _performance_portable: + +Performance Portable Implementation +=================================== + +Under the Exascale Computing Project effort a new set of QMC drivers was developed +to eliminate the divergence of legacy CPU and GPU code paths at the QMC driver level and make the drivers CPU/GPU agnostic. +The divergence came from the the fact that the CPU code path favors executing all the compute tasks within a step +for one walker and then advance walker by walker. Multiple CPU threads process their own assigned walkers in parallel. +In this way, walkers are not synchronized with each other and maximal throughout can be achieved on CPU. +The GPU code path favors executing the same compute task over all the walkers together to maximize GPU throughput. +This compute dispatch pattern minimizes the overhead of dispatching computation and host-device data transfer. +However, the legacy GPU code path only leverages the OpenMP main host thread for handling +all the interaction between the host and GPUs and limit the kernel dispatch capability. +In brief, the CPU code path handles computation with a walker batch size of one and many batches +while the GPU code path uses only one batch containing all the walkers. +The new drivers that implement this flexible batching scheme are called "batched drivers". + +The batched drivers introduce a new concept, "crowd", as a sub-organization of walker population. +A crowd is a subset of the walkers that are operated on as as single batch. +Walkers within a crowd operate their computation in lock-step, which helps the GPU efficiency. +Walkers between crowds remain fully asynchronous unless operations involving the full population are needed. +With this flexible batching capability the new drivers are capable of delivering maximal performance on given hardware. +In the new driver design, all the batched API calls may fallback to an existing single walker implementation. +Consequently, batched drivers allow mixing and matching CPU-only and GPU-accelerated features +in a way that is not feasible with the legacy GPU implementation. + +For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. + From 637811914ea31b77c9ce0c6df4314bb64d699d9f Mon Sep 17 00:00:00 2001 From: Mark Dewing Date: Mon, 28 Mar 2022 13:12:30 -0500 Subject: [PATCH 02/26] Move details about batched drivers Move details about batched drivers back to Methods chapter. Add a link to that section from the peformance portable chapter. --- docs/methods.rst | 16 ++++++++++++++++ docs/performance_portable.rst | 10 +--------- 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/docs/methods.rst b/docs/methods.rst index 56c594116..22c5ce3a2 100644 --- a/docs/methods.rst +++ b/docs/methods.rst @@ -115,6 +115,22 @@ To continue a run, specify the ``mcwalkerset`` element before your VMC/DMC block In the project id section, make sure that the series number is different from any existing ones to avoid overwriting them. +.. _batched_drivers: + +Batched drivers +--------------- + +The batched drivers introduce a new concept, "crowd", as a sub-organization of walker population. +A crowd is a subset of the walkers that are operated on as as single batch. +Walkers within a crowd operate their computation in lock-step, which helps the GPU efficiency. +Walkers between crowds remain fully asynchronous unless operations involving the full population are needed. +With this flexible batching capability the new drivers are capable of delivering maximal performance on given hardware. +In the new driver design, all the batched API calls may fallback to an existing single walker implementation. +Consequently, batched drivers allow mixing and matching CPU-only and GPU-accelerated features +in a way that is not feasible with the legacy GPU implementation. + + + .. _transition_guide: Transition from classic drivers diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst index 59f3462a6..584175a6f 100644 --- a/docs/performance_portable.rst +++ b/docs/performance_portable.rst @@ -16,14 +16,6 @@ In brief, the CPU code path handles computation with a walker batch size of one while the GPU code path uses only one batch containing all the walkers. The new drivers that implement this flexible batching scheme are called "batched drivers". -The batched drivers introduce a new concept, "crowd", as a sub-organization of walker population. -A crowd is a subset of the walkers that are operated on as as single batch. -Walkers within a crowd operate their computation in lock-step, which helps the GPU efficiency. -Walkers between crowds remain fully asynchronous unless operations involving the full population are needed. -With this flexible batching capability the new drivers are capable of delivering maximal performance on given hardware. -In the new driver design, all the batched API calls may fallback to an existing single walker implementation. -Consequently, batched drivers allow mixing and matching CPU-only and GPU-accelerated features -in a way that is not feasible with the legacy GPU implementation. - For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. +More information about the new drivers is given in the :ref:`Batched Drivers ` section. From 3e36d085a6bf30c60711d285d725e7f0dad38cf0 Mon Sep 17 00:00:00 2001 From: Mark Dewing Date: Tue, 12 Apr 2022 12:38:09 -0500 Subject: [PATCH 03/26] Add links to other relevant manual sections --- docs/installation.rst | 3 +++ docs/performance_portable.rst | 8 +++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/installation.rst b/docs/installation.rst index ad96d7656..daf703ca7 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -398,6 +398,9 @@ the path to the source directory. See :ref:`Sanitizer-Libraries` for more information. + +.. _offloadbuild: + Notes for OpenMP target offload to accelerators (experimental) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ QMCPACK is currently being updated to support OpenMP target offload and obtain performance diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst index 584175a6f..dad901e9e 100644 --- a/docs/performance_portable.rst +++ b/docs/performance_portable.rst @@ -18,4 +18,10 @@ The new drivers that implement this flexible batching scheme are called "batched For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. -More information about the new drivers is given in the :ref:`Batched Drivers ` section. +Links to more information in other sections of the manual: + + - **Build instructions:** :ref:`OpenMP target offload ` section of the :ref:`obtaininginstalling` chapter. + + - **Supported features:** :ref:`gpufeatures` section of the :ref:`chap:features` chapter. + + - **Driver Inputs:** :ref:`batched_drivers` section of the :ref:`qmcmethods` chapter. From 96998d0bb9b409aa84eab315d235418a4adfbcd6 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Thu, 21 Apr 2022 12:10:03 -0500 Subject: [PATCH 04/26] Refine CI script. --- tests/test_automation/github-actions/ci/run_step.sh | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tests/test_automation/github-actions/ci/run_step.sh b/tests/test_automation/github-actions/ci/run_step.sh index 7b9d12c33..1b38ea6c0 100755 --- a/tests/test_automation/github-actions/ci/run_step.sh +++ b/tests/test_automation/github-actions/ci/run_step.sh @@ -41,7 +41,7 @@ case "$1" in export CUDACXX=/usr/local/cuda-11.2/bin/nvcc # Make current environment variables available to subsequent steps - echo "PATH=/usr/local/cuda-11.2/bin:$PATH" >> $GITHUB_ENV + echo "PATH=$PATH" >> $GITHUB_ENV echo "CUDACXX=/usr/local/cuda-11.2/bin/nvcc" >> $GITHUB_ENV else @@ -286,6 +286,12 @@ case "$1" in # Build using ninja (~ 25 minutes on GitHub-hosted runner) build) + # CUDA toolchain can be used implicitly by the compiler. Double check the location. + if [[ "${GH_JOBNAME}" =~ (CUDA) ]] + then + which nvcc + fi + cd ${GITHUB_WORKSPACE}/../qmcpack-build ninja ;; From 3dfd107fc9acae8d504057f858943555a30b59da Mon Sep 17 00:00:00 2001 From: camelto2 Date: Fri, 22 Apr 2022 12:39:43 -0600 Subject: [PATCH 05/26] add evalGradSource to SpinorSet --- src/QMCWaveFunctions/SpinorSet.cpp | 26 +++++++++++++++++ src/QMCWaveFunctions/SpinorSet.h | 16 +++++++++++ src/QMCWaveFunctions/tests/test_MO_spinor.cpp | 28 +++++++++++++++++++ 3 files changed, 70 insertions(+) diff --git a/src/QMCWaveFunctions/SpinorSet.cpp b/src/QMCWaveFunctions/SpinorSet.cpp index 197a43127..dc9825e82 100644 --- a/src/QMCWaveFunctions/SpinorSet.cpp +++ b/src/QMCWaveFunctions/SpinorSet.cpp @@ -419,6 +419,32 @@ void SpinorSet::evaluate_spin(const ParticleSet& P, int iat, ValueVector& psi, V dpsi = eye * (eis * psi_work_up - emis * psi_work_down); } +void SpinorSet::evaluateGradSource(const ParticleSet& P, + int first, + int last, + const ParticleSet& source, + int iat_src, + GradMatrix& gradphi) +{ + IndexType nelec = P.getTotalNum(); + + GradMatrix gradphi_up(nelec, OrbitalSetSize); + GradMatrix gradphi_dn(nelec, OrbitalSetSize); + spo_up->evaluateGradSource(P, first, last, source, iat_src, gradphi_up); + spo_dn->evaluateGradSource(P, first, last, source, iat_src, gradphi_dn); + + for (int iat = 0; iat < nelec; iat++) + { + ParticleSet::Scalar_t s = P.activeSpin(iat); + RealType coss = std::cos(s); + RealType sins = std::sin(s); + ValueType eis(coss, sins); + ValueType emis(coss, -sins); + for (int imo = 0; imo < OrbitalSetSize; imo++) + gradphi(iat, imo) = gradphi_up(iat, imo) * eis + gradphi_dn(iat, imo) * emis; + } +} + std::unique_ptr SpinorSet::makeClone() const { auto myclone = std::make_unique(); diff --git a/src/QMCWaveFunctions/SpinorSet.h b/src/QMCWaveFunctions/SpinorSet.h index 30a5445e2..233b94a71 100644 --- a/src/QMCWaveFunctions/SpinorSet.h +++ b/src/QMCWaveFunctions/SpinorSet.h @@ -129,6 +129,22 @@ public: */ void evaluate_spin(const ParticleSet& P, int iat, ValueVector& psi, ValueVector& dpsi) override; + /** evaluate the gradients of this single-particle orbital + * for [first,last) target particles with respect to the given source particle + * @param P current ParticleSet + * @param first starting index of the particles + * @param last ending index of the particles + * @param iat_src source particle index + * @param gradphi gradients + * + */ + virtual void evaluateGradSource(const ParticleSet& P, + int first, + int last, + const ParticleSet& source, + int iat_src, + GradMatrix& gradphi) override; + std::unique_ptr makeClone() const override; private: diff --git a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp index 231b41891..d859d4a50 100644 --- a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp +++ b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp @@ -118,6 +118,20 @@ void test_lcao_spinor() CHECK(d2psiM[iat][0] == ComplexApprox(vlp).epsilon(eps)); } + /** this is a somewhat simple example. We have an ion at the origin + * and a gaussian basis function centered on the ion as a orbital. + * In this case, the ion derivative is actually just the negative of + * the electron gradient. + */ + SPOSet::GradMatrix gradIon(elec_.R.size(), spo->getOrbitalSetSize()); + spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 0, gradIon); + for (int iat = 0; iat < 1; iat++) + { + CHECK(gradIon[iat][0][0] == ComplexApprox(-vdx).epsilon(eps)); + CHECK(gradIon[iat][0][1] == ComplexApprox(-vdy).epsilon(eps)); + CHECK(gradIon[iat][0][2] == ComplexApprox(-vdz).epsilon(eps)); + } + int OrbitalSetSize = spo->getOrbitalSetSize(); //temporary arrays for holding the values of the up and down channels respectively. SPOSet::ValueVector psi_work; @@ -434,6 +448,20 @@ void test_lcao_spinor_excited() CHECK(d2psiM[iat][0] == ComplexApprox(vlp).epsilon(eps)); } + /** this is a somewhat simple example. We have an ion at the origin + * and a gaussian basis function centered on the ion as a orbital. + * In this case, the ion derivative is actually just the negative of + * the electron gradient. + */ + SPOSet::GradMatrix gradIon(elec_.R.size(), spo->getOrbitalSetSize()); + spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 0, gradIon); + for (int iat = 0; iat < 1; iat++) + { + CHECK(gradIon[iat][0][0] == ComplexApprox(-vdx).epsilon(eps)); + CHECK(gradIon[iat][0][1] == ComplexApprox(-vdy).epsilon(eps)); + CHECK(gradIon[iat][0][2] == ComplexApprox(-vdz).epsilon(eps)); + } + //temporary arrays for holding the values of the up and down channels respectively. SPOSet::ValueVector psi_work; From cbc9f67f820cd94530598e4368c047aeb8015e48 Mon Sep 17 00:00:00 2001 From: camelto2 Date: Fri, 22 Apr 2022 15:54:33 -0600 Subject: [PATCH 06/26] add unit test of evaluateGradSource for spinor with multiple atoms --- src/QMCWaveFunctions/tests/CMakeLists.txt | 2 + .../tests/lcao_spinor_molecule.h5 | Bin 0 -> 22992 bytes .../tests/lcao_spinor_molecule_test.py | 231 ++++++++++++++++++ src/QMCWaveFunctions/tests/test_MO_spinor.cpp | 96 ++++++++ 4 files changed, 329 insertions(+) create mode 100644 src/QMCWaveFunctions/tests/lcao_spinor_molecule.h5 create mode 100644 src/QMCWaveFunctions/tests/lcao_spinor_molecule_test.py diff --git a/src/QMCWaveFunctions/tests/CMakeLists.txt b/src/QMCWaveFunctions/tests/CMakeLists.txt index 84f91400b..3f88e9f00 100644 --- a/src/QMCWaveFunctions/tests/CMakeLists.txt +++ b/src/QMCWaveFunctions/tests/CMakeLists.txt @@ -23,6 +23,7 @@ set(UTEST_HDF_INPUT6 ${qmcpack_SOURCE_DIR}/src/QMCWaveFunctions/tests/lcao_spino set(UTEST_HDF_INPUT7 ${qmcpack_SOURCE_DIR}/tests/molecules/LiH_ae_MSD/LiH.orbs.h5) set(UTEST_HDF_INPUT8 ${qmcpack_SOURCE_DIR}/tests/molecules/LiH_ae_MSD/LiH.Multidet.h5) set(UTEST_HDF_INPUT9 ${qmcpack_SOURCE_DIR}/tests/converter/test_Bi_dirac/gold.orbs.h5) +set(UTEST_HDF_INPUT10 ${qmcpack_SOURCE_DIR}/src/QMCWaveFunctions/tests/lcao_spinor_molecule.h5) maybe_symlink(${UTEST_HDF_INPUT0} ${UTEST_DIR}/diamondC_1x1x1.pwscf.h5) maybe_symlink(${UTEST_HDF_INPUT1} ${UTEST_DIR}/diamondC_2x1x1.pwscf.h5) @@ -34,6 +35,7 @@ maybe_symlink(${UTEST_HDF_INPUT6} ${UTEST_DIR}/lcao_spinor.h5) maybe_symlink(${UTEST_HDF_INPUT7} ${UTEST_DIR}/LiH.orbs.h5) maybe_symlink(${UTEST_HDF_INPUT8} ${UTEST_DIR}/LiH.Multidet.h5) maybe_symlink(${UTEST_HDF_INPUT9} ${UTEST_DIR}/Bi.orbs.h5) +maybe_symlink(${UTEST_HDF_INPUT10} ${UTEST_DIR}/lcao_spinor_molecule.h5) set(FILES_TO_COPY he_sto3g.wfj.xml diff --git a/src/QMCWaveFunctions/tests/lcao_spinor_molecule.h5 b/src/QMCWaveFunctions/tests/lcao_spinor_molecule.h5 new file mode 100644 index 0000000000000000000000000000000000000000..a7d17330891168084cb8e041ac01571b0c3ffbb0 GIT binary patch literal 22992 zcmeHPO>7%Q6dosSFh!uG1(B9NmtU735oi%9h-TB~$3<1@HYubhdt+~tm7Mj;c0}Oh z3z9h?^;kF=PH;g+2r0eLA|zCPRm3Go2*IrtAoYMCC(6#g_q=%B+4VYgyta2E?c166 zcJ|FTZ{B-5JMmkwiK*Ln?AgKi5e~C$Y|y-<({(&9pN1ML6F-R;A*3awAN0XRi19zh zhOvGK+qcWjot&Ixh9tF56mMg{ZilNVRRpxB5Qt5V$9ae6vDY!Am(b3z$mBwHId2r= zhmV+5vaV;;lw44=nps|u^J+e?6i9w{MOSi(x!2S@RFNIiED+gmtRGrfLc4Uc-%)H( zq6Y`Grw}0ftzfT{NGo^3g~-m5cLBe6x3gUFKH!%h07~cCnd1{7w!P}rD`>A|=l1(X z@y5~#xFER-mM_mjJ^6F}!CJXcc=-1m_F2XP-ah*GrGpwpYRO<$c&tn@ETa zm~jkFJW>Gyp%C8UO$WlBxN2qr~g``ua@KFt*bl8bzp(n9_{x#Jjn0 z{-M+YV@2e&myp{vJ7K8u?mPTAKL+%xvHj-pBhd?3kM#TnyY|$MoyORrqA58wMT}!d z#*ZJKRSHy2$+?1(S7j|WEiWti{8coxG(O?&KR+%1^JV2wCc8K-ujIL;mZh8RTEs?1 zNC;Th#(-{ z+V>8^gQ)lSj#cdUU`@a3`(3Qzx4WRNEH4-fH`y}@Sn;vSd0k1V3SR@_hn~%=eDR0M z$W5qesG%~AI=vZ?e}Pt-KfA39g?B^SVCBs}C%CgS5ukOUc~qb5u#D?H)GKb~YAOL{ zZd9h0n6TzXJKpHjEjJZP@S! z?X5HAGx0rE-HomySed-%MqI0o#!J7~s_Vq7*4x$DjSTK9){Y}YN)5kCROw9x1#3! z95!A=xDNcg2j~)JQ{%^GCylzflczW_8RB{T9-2D!#JuAMdBYMgw&)D`L8zJ^3`e%; z!Cm$k(RWV!cSAS7C_M_mps{orN@Fh~Q z-+LcTxppDd@A-?}FOEFYB(Obxo5(NP_dmG##rY=yNP0i*aaen<`*y!b>q*ya`)6hr zCNs*iq7`sql-7@6o#`;&C1-517{MHlv8W4@-u6z{2EW)} z%@<>jJ3lvq?8;McLD}y*mN%Y;dYaeeC-H_ZC*JMKinrWS#PRmy2JNZjD}+UH!fq!J zC;Ei@sf*7*x5z(iya?r0U-dRnrd0?~oT%Uf&LIsy4;Lakiys2ME;-8;r-5HS3Y5;? z=d;&1alXNMT_ly-N+C{gF9nBA6H=fR`TtAL3Iwp4yw)(bNAbiyei4*k-|yTWsdeRm zvs}Bl-tEeY6So>li{r%SI5b?!I_mQdLfn#R=hbHL;I<;h<7>o&EDCzr&3Ftr;_(gQ z!KJLD`|$AUhkrm^kRUYmxD+kWd~h|hV7~iI=dvq0e>IY|#g&YlV+y=2&AqOh%SlXG z)nzSxIAH=DZ|yW)E`#dh@<+7<2W1kL?E&)n9hH^!kK^geO3$0KOZ-w4pU1Nn&d4kL%O=vm-41TywCeQ^_nmJ-C!IjqDJi{ z_>pXgIUk(Mx8Z&&|z zlwGwRj{#hWA!+IS%rq zcf$$N!{N8HZS+3W(TW42$JHoC+@z(yk zTW;}o{uEf!Xs;Pm&|eb)d%TSyF4tyj0a8@sceRY)K}=vkwaN)~Szct-GVT+z%D|^M zI*85I&u57A^t%ltPpjGXa`V62un*@qX#xt>uLmsE0?^_T#0l&FiDPrl52QFze#Kdi p@~6tH&T=&WFU~v5`JLN7yE^DgyJ{aEM*}>J$zuIp<7r>w@gFi?xbOe~ literal 0 HcmV?d00001 diff --git a/src/QMCWaveFunctions/tests/lcao_spinor_molecule_test.py b/src/QMCWaveFunctions/tests/lcao_spinor_molecule_test.py new file mode 100644 index 000000000..17fa96e64 --- /dev/null +++ b/src/QMCWaveFunctions/tests/lcao_spinor_molecule_test.py @@ -0,0 +1,231 @@ +import h5py +import numpy as np +from scipy.special import sph_harm, factorial2 + +def write_h5_file(): + hf = h5py.File('lcao_spinor_molecule.h5','w') + + #atoms + atoms = hf.create_group('atoms') + nat = np.array([2]) + nsp = np.array([1]) + pos = np.array([[0.1,0.2,0.3],[-0.3,-0.2,-0.1]]) + ids = np.array([0,0]) + atoms.create_dataset('number_of_atoms', data=nat) + atoms.create_dataset('number_of_species', data=nsp) + atoms.create_dataset('positions', data=pos) + atoms.create_dataset('species_ids', data=ids) + sp = atoms.create_group('species_0') + + atnum = np.array([1]) + charge = np.array([1]) + core = np.array([1]) + name = "H" + mylen = "S"+str(len(name)) + strList = [name] + asciiList = [n.encode("ascii", "ignore") for n in strList] + sp.create_dataset("atomic_number", data=atnum) + sp.create_dataset("charge", data=charge) + sp.create_dataset("core", data=core) + sp.create_dataset("name", (1,), mylen, asciiList) + + #PBC + pbc = hf.create_group("PBC") + pbc.create_dataset("PBC",(1,), dtype="b1", data=False) + + #application + app = hf.create_group("application") + code = "generic" + mylen = "S"+str(len(code)) + strList = [code] + asciiList = [n.encode("ascii", "ignore") for n in strList] + app.create_dataset("code",(1,), mylen, asciiList) + + #basisset + bs = hf.create_group("basisset") + bs.create_dataset("NbElements", data=np.array([1])) + name="LCAOBSet" + mylen="S"+str(len(name)) + strList=[name] + asciiList=[n.encode("ascii","ignore") for n in strList] + bs.create_dataset("name", (1,), mylen, asciiList) + atbs = bs.create_group("atomicBasisSet0") + + atbs.create_dataset("NbBasisGroups", data=np.array([1])) + mystr = "cartesian" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("angular",(1,), mylen, asciiList) + mystr = "H" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("elementType",(1,), mylen, asciiList) + mystr = "Gamess" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("expandYlm",(1,), mylen, asciiList) + atbs.create_dataset("grid_npts", data=np.array([1001])) + atbs.create_dataset("grid_rf", data=np.array([100])) + atbs.create_dataset("grid_ri", data=np.array([1e-06])) + mystr = "log" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("grid_type",(1,), mylen, asciiList) + mystr = "Gaussian" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("name",(1,), mylen, asciiList) + mystr = "no" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + atbs.create_dataset("normalized",(1,), mylen, asciiList) + + bg = atbs.create_group("basisGroup0") + bg.create_dataset("NbRadFunc", data=np.array([1])) + bg.create_dataset("l", data=np.array([0])) + bg.create_dataset("n", data=np.array([0])) + mystr = "H00" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + bg.create_dataset("rid",(1,), mylen, asciiList) + mystr = "Gaussian" + mylen = "S"+str(len(mystr)) + strList = [mystr] + asciiList = [n.encode("ascii","ignore") for n in strList] + bg.create_dataset("type",(1,), mylen, asciiList) + rf = bg.create_group("radfunctions") + dr = rf.create_group("DataRad0") + dr.create_dataset("contraction", data=np.array([1.0])) + dr.create_dataset("exponent", data=np.array([2.5])) + + + kpts = hf.create_group("Super_Twist") + kpts.create_dataset("eigenset_0", data=np.array([[0.075, 0.15]])) + kpts.create_dataset("eigenset_0_imag", data=np.array([[0.225, 0.45]])) + kpts.create_dataset("eigenset_1", data=np.array([[-0.12, -0.06]])) + kpts.create_dataset("eigenset_1_imag", data=np.array([[0.48, 0.24]])) + hf.close() + +class cartGauss: + def __init__(self,expt,l=0,i=0,j=0,k=0): + self.expt = expt + self.l = l + self.i = i + self.j = j + self.k = k + assert(i+j+k == l) + def norm(self): + n = (2*self.expt / np.pi)**(3./4.) + n *= np.sqrt(2.**(self.l) / factorial2(2*self.i - 1) / factorial2(2*self.j - 1) / factorial2(2*self.k - 1)) * np.sqrt(2*self.expt)**self.l + return n + def val(self,pos): + r = np.linalg.norm(pos) + norm = self.norm() + return norm *pos[0]**self.i * pos[1]**self.j * pos[2]**self.k * np.exp(-self.expt * r * r) + +def get_reference_values(pos, s): + cs = np.cos(s) + ss = np.sin(s) + eis = cs + 1.j*ss + emis = cs - 1.j*ss + + print("Position: {}".format(pos)) + print("Spin: {}".format(s)) + + g0 = cartGauss(2.5, 0, 0, 0, 0) + g1 = cartGauss(2.5, 0, 0, 0, 0) + + R0 = np.array([0.1,0.2,0.3]) + R1 = np.array([-0.3,-0.2,-0.1]) + + c0 = 0.3 + c1 = 0.6 + + upcoef = (0.25 + 0.75j) + dncoef = (-0.2 + 0.8j) + + dr = 1e-7 + + g0val = g0.val(pos-R0) + g0px = g0.val(pos-(R0 + np.array([dr,0,0]))) + g0mx = g0.val(pos-(R0 - np.array([dr,0,0]))) + g0py = g0.val(pos-(R0 + np.array([0,dr,0]))) + g0my = g0.val(pos-(R0 - np.array([0,dr,0]))) + g0pz = g0.val(pos-(R0 + np.array([0,0,dr]))) + g0mz = g0.val(pos-(R0 - np.array([0,0,dr]))) + + g1val = g1.val(pos-R1) + g1px = g1.val(pos-(R1 + np.array([dr,0,0]))) + g1mx = g1.val(pos-(R1 - np.array([dr,0,0]))) + g1py = g1.val(pos-(R1 + np.array([0,dr,0]))) + g1my = g1.val(pos-(R1 - np.array([0,dr,0]))) + g1pz = g1.val(pos-(R1 + np.array([0,0,dr]))) + g1mz = g1.val(pos-(R1 - np.array([0,0,dr]))) + + #atom 0 + uppx = c0*g0px + c1*g1val + upmx = c0*g0mx + c1*g1val + updx = (uppx - upmx) / (2*dr) + dnpx = c1*g0px + c0*g1val + dnmx = c1*g0mx + c0*g1val + dndx = (dnpx - dnmx) / (2*dr) + uppy = c0*g0py + c1*g1val + upmy = c0*g0my + c1*g1val + updy = (uppy - upmy) / (2*dr) + dnpy = c1*g0py + c0*g1val + dnmy = c1*g0my + c0*g1val + dndy = (dnpy - dnmy) / (2*dr) + uppz = c0*g0pz + c1*g1val + upmz = c0*g0mz + c1*g1val + updz = (uppz - upmz) / (2*dr) + dnpz = c1*g0pz + c0*g1val + dnmz = c1*g0mz + c0*g1val + dndz = (dnpz - dnmz) / (2*dr) + + spdx = upcoef * updx * eis + dncoef * dndx * emis + spdy = upcoef * updy * eis + dncoef * dndy * emis + spdz = upcoef * updz * eis + dncoef * dndz * emis + + print("grad atom 0: {}, {}, {}".format(spdx, spdy, spdz)) + + #atom 1 + uppx = c0*g0val + c1*g1px + upmx = c0*g0val + c1*g1mx + updx = (uppx - upmx) / (2*dr) + dnpx = c1*g0val + c0*g1px + dnmx = c1*g0val + c0*g1mx + dndx = (dnpx - dnmx) / (2*dr) + uppy = c0*g0val + c1*g1py + upmy = c0*g0val + c1*g1my + updy = (uppy - upmy) / (2*dr) + dnpy = c1*g0val + c0*g1py + dnmy = c1*g0val + c0*g1my + dndy = (dnpy - dnmy) / (2*dr) + uppz = c0*g0val + c1*g1pz + upmz = c0*g0val + c1*g1mz + updz = (uppz - upmz) / (2*dr) + dnpz = c1*g0val + c0*g1pz + dnmz = c1*g0val + c0*g1mz + dndz = (dnpz - dnmz) / (2*dr) + + spdx = upcoef * updx * eis + dncoef * dndx * emis + spdy = upcoef * updy * eis + dncoef * dndy * emis + spdz = upcoef * updz * eis + dncoef * dndz * emis + + print("grad atom 1: {}, {}, {}".format(spdx, spdy, spdz)) + + + +if __name__ == "__main__": + write_h5_file() + pos = np.array([0.01, -0.02, 0.03]) + s = 0.6 + get_reference_values(pos, s) + diff --git a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp index d859d4a50..008be5854 100644 --- a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp +++ b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp @@ -673,8 +673,104 @@ void test_lcao_spinor_excited() } } +void test_lcao_spinor_ion_derivs() +{ + app_log() << "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; + app_log() << "!!!! LCAO SpinorSet from HDF (ion derivatives) !!!!\n"; + app_log() << "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"; + + using ValueType = SPOSet::ValueType; + using RealType = SPOSet::RealType; + Communicate* c = OHMMS::Controller; + + ParticleSetPool ptcl = ParticleSetPool(c); + auto ions_uptr = std::make_unique(ptcl.getSimulationCell()); + auto elec_uptr = std::make_unique(ptcl.getSimulationCell()); + ParticleSet& ions_(*ions_uptr); + ParticleSet& elec_(*elec_uptr); + + ions_.setName("ion"); + ptcl.addParticleSet(std::move(ions_uptr)); + ions_.create({2}); + + ions_.R[0][0] = 0.10000000; + ions_.R[0][1] = 0.20000000; + ions_.R[0][2] = 0.30000000; + ions_.R[1][0] = -0.30000000; + ions_.R[1][1] = -0.20000000; + ions_.R[1][2] = -0.10000000; + SpeciesSet& ispecies = ions_.getSpeciesSet(); + int hIdx = ispecies.addSpecies("H"); + ions_.update(); + + elec_.setName("elec"); + ptcl.addParticleSet(std::move(elec_uptr)); + elec_.create({1}); + elec_.R[0][0] = 0.01; + elec_.R[0][1] = -0.02; + elec_.R[0][2] = 0.03; + elec_.spins[0] = 0.6; + elec_.setSpinor(true); + + SpeciesSet& tspecies = elec_.getSpeciesSet(); + int upIdx = tspecies.addSpecies("u"); + int chargeIdx = tspecies.addAttribute("charge"); + tspecies(chargeIdx, upIdx) = -1; + + + elec_.addTable(ions_); + elec_.update(); + + const char* particles = " \ + \ + \ + \ + \ + \ +"; + + Libxml2Document doc; + bool okay = doc.parseFromString(particles); + REQUIRE(okay); + + xmlNodePtr root = doc.getRoot(); + + xmlNodePtr bnode = xmlFirstElementChild(root); + SPOSetBuilderFactory fac(c, elec_, ptcl.getPool()); + const auto spo_builder_ptr = fac.createSPOSetBuilder(bnode); + auto& bb = *spo_builder_ptr; + + // only pick up the last sposet + std::unique_ptr spo; + processChildren(bnode, [&](const std::string& cname, const xmlNodePtr element) { + if (cname == "sposet") + spo = bb.createSPOSet(element); + }); + REQUIRE(spo); + + //reference values from finite difference in lcao_spinor_molecule_test.py + ValueType dx0(-0.0492983, -0.3192778); + ValueType dy0(-0.1205071, -0.7804567); + ValueType dz0(-0.1478950, -0.9578333); + ValueType dx1(-0.0676367, 1.0506422); + ValueType dy1(-0.0392729, 0.6100503); + ValueType dz1(-0.0283638, 0.4405919); + + const RealType eps = 1e-4; + SPOSet::GradMatrix gradIon(elec_.R.size(), spo->getOrbitalSetSize()); + spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 0, gradIon); + CHECK(gradIon[0][0][0] == ComplexApprox(dx0).epsilon(eps)); + CHECK(gradIon[0][0][1] == ComplexApprox(dy0).epsilon(eps)); + CHECK(gradIon[0][0][2] == ComplexApprox(dz0).epsilon(eps)); + spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 1, gradIon); + CHECK(gradIon[0][0][0] == ComplexApprox(dx1).epsilon(eps)); + CHECK(gradIon[0][0][1] == ComplexApprox(dy1).epsilon(eps)); + CHECK(gradIon[0][0][2] == ComplexApprox(dz1).epsilon(eps)); +} + TEST_CASE("ReadMolecularOrbital GTO spinor", "[wavefunction]") { test_lcao_spinor(); } TEST_CASE("ReadMolecularOrbital GTO spinor with excited", "[wavefunction]") { test_lcao_spinor_excited(); } +TEST_CASE("spinor ion derivatives for molecule", "[wavefunction]") { test_lcao_spinor_ion_derivs(); } } // namespace qmcplusplus From bdf902f73bf85a5ab18153a20d8087b57c2b553b Mon Sep 17 00:00:00 2001 From: camelto2 Date: Fri, 22 Apr 2022 15:59:26 -0600 Subject: [PATCH 07/26] clang --- src/QMCWaveFunctions/SpinorSet.cpp | 8 ++++---- src/QMCWaveFunctions/tests/test_MO_spinor.cpp | 12 ++++++------ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/src/QMCWaveFunctions/SpinorSet.cpp b/src/QMCWaveFunctions/SpinorSet.cpp index dc9825e82..7609800b5 100644 --- a/src/QMCWaveFunctions/SpinorSet.cpp +++ b/src/QMCWaveFunctions/SpinorSet.cpp @@ -419,9 +419,9 @@ void SpinorSet::evaluate_spin(const ParticleSet& P, int iat, ValueVector& psi, V dpsi = eye * (eis * psi_work_up - emis * psi_work_down); } -void SpinorSet::evaluateGradSource(const ParticleSet& P, +void SpinorSet::evaluateGradSource(const ParticleSet& P, int first, - int last, + int last, const ParticleSet& source, int iat_src, GradMatrix& gradphi) @@ -436,8 +436,8 @@ void SpinorSet::evaluateGradSource(const ParticleSet& P, for (int iat = 0; iat < nelec; iat++) { ParticleSet::Scalar_t s = P.activeSpin(iat); - RealType coss = std::cos(s); - RealType sins = std::sin(s); + RealType coss = std::cos(s); + RealType sins = std::sin(s); ValueType eis(coss, sins); ValueType emis(coss, -sins); for (int imo = 0; imo < OrbitalSetSize; imo++) diff --git a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp index 008be5854..2fa966bb2 100644 --- a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp +++ b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp @@ -747,23 +747,23 @@ void test_lcao_spinor_ion_derivs() spo = bb.createSPOSet(element); }); REQUIRE(spo); - + //reference values from finite difference in lcao_spinor_molecule_test.py ValueType dx0(-0.0492983, -0.3192778); ValueType dy0(-0.1205071, -0.7804567); ValueType dz0(-0.1478950, -0.9578333); - ValueType dx1(-0.0676367, 1.0506422); - ValueType dy1(-0.0392729, 0.6100503); - ValueType dz1(-0.0283638, 0.4405919); + ValueType dx1(-0.0676367, 1.0506422); + ValueType dy1(-0.0392729, 0.6100503); + ValueType dz1(-0.0283638, 0.4405919); const RealType eps = 1e-4; SPOSet::GradMatrix gradIon(elec_.R.size(), spo->getOrbitalSetSize()); spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 0, gradIon); - CHECK(gradIon[0][0][0] == ComplexApprox(dx0).epsilon(eps)); + CHECK(gradIon[0][0][0] == ComplexApprox(dx0).epsilon(eps)); CHECK(gradIon[0][0][1] == ComplexApprox(dy0).epsilon(eps)); CHECK(gradIon[0][0][2] == ComplexApprox(dz0).epsilon(eps)); spo->evaluateGradSource(elec_, 0, elec_.R.size(), ions_, 1, gradIon); - CHECK(gradIon[0][0][0] == ComplexApprox(dx1).epsilon(eps)); + CHECK(gradIon[0][0][0] == ComplexApprox(dx1).epsilon(eps)); CHECK(gradIon[0][0][1] == ComplexApprox(dy1).epsilon(eps)); CHECK(gradIon[0][0][2] == ComplexApprox(dz1).epsilon(eps)); } From e41530d3536c205ac32b18bdb42f3037e6c55505 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Tue, 19 Apr 2022 18:33:40 -0500 Subject: [PATCH 08/26] Default CUDA host compiler as CXX compiler. 1) Require CUDA 11 2) -allow-unsupported-compiler by default. --- CMakeLists.txt | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index e2edef0ad..abb5f6308 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -715,8 +715,12 @@ if(QMC_CUDA OR ENABLE_CUDA) endif() set(CMAKE_CUDA_STANDARD_REQUIRED TRUE) set(CMAKE_CUDA_EXTENSIONS OFF) + set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "nvcc host compiler passed via -ccbin") + if(NOT CMAKE_CUDA_FLAGS MATCHES "allow-unsupported-compiler") + set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --allow-unsupported-compiler") + endif() enable_language(CUDA) - find_package(CUDAToolkit REQUIRED) + find_package(CUDAToolkit 11.0 REQUIRED) if(NOT TARGET CUDA::cublas) message( FATAL_ERROR From 41956a9a535bea508b8a808a2df3ca4dbb645dca Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Fri, 22 Apr 2022 17:05:46 -0500 Subject: [PATCH 09/26] Update QE pw2qmcpack install guide --- docs/installation.rst | 38 ++++++++++++++++++++++++++++++-------- 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/docs/installation.rst b/docs/installation.rst index 0f67110b3..78e7616dd 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -1777,8 +1777,8 @@ for the creation of projectors in UPF can introduce severe errors and inaccuraci .. _buildqe: -Installing and patching Quantum ESPRESSO ----------------------------------------- +Installing Quantum ESPRESSO with pw2qmcpack +------------------------------------------- For trial wavefunctions obtained in a plane-wave basis, we mainly support QE. Note that ABINIT and QBox were supported historically @@ -1789,6 +1789,10 @@ QE stores wavefunctions in a nonstandard internal we have developed a converter---pw2qmcpack---which is an add-on to the QE distribution. + +Quantum ESPRESSO (<=6.8) +~~~~~~~~~~~~~~~~~~~~~~~~ + To simplify the process of patching QE we have developed a script that will automatically download and patch the source code. The patches are specific to each version. For example, to download and @@ -1821,15 +1825,33 @@ the HDF5 capability enabled in either way: The complete process is described in external\_codes/quantum\_espresso/README. -The tests involving pw.x and pw2qmcpack.x have been integrated into the test suite of QMCPACK. -By adding ``-D QE_BIN=your_QE_binary_path`` in the CMake command line when building your QMCPACK, -tests named with the "qe-" prefix will be included in the test set of your build. -You can test the whole ``pw > pw2qmcpack > qmcpack workflow`` by +Quantum ESPRESSO (6.7, 6.8 and 7.0) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +After patching the QE source code like above, users may use CMake instead of configure to build QE with pw2qmcpack. +Options needed to enable pw2qmcpack have been set ON by default. +A HDF5 library installation with Fortran support is required. -:: + :: - ctest -R qe + mkdir build_mpi + cd build_mpi + cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 .. + make -j 16 +Quantum ESPRESSO (>7.0) +~~~~~~~~~~~~~~~~~~~~~~~ +There is no more need of patching QE. Users may use upstream QE. Full QE CMake documentation can be found at +https://gitlab.com/QEF/q-e/-/wikis/Developers/CMake-build-system . + :: + + mkdir build_mpi + cd build_mpi + cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 -DQE_ENABLE_PLUGINS=pw2qmcpack .. + make -j 16 + +Post QE installation +~~~~~~~~~~~~~~~~~~~~ +Testing QE to QMCPACK workflow after building QE and QMCPACK is highly recommended. See :ref:`integtestqe` and the testing section for more details. .. _buildperformance: From 6e79dd3337c0895ceac704f42b1231a180d3204d Mon Sep 17 00:00:00 2001 From: "Paul R. C. Kent" Date: Fri, 22 Apr 2022 19:13:48 -0400 Subject: [PATCH 10/26] Polish text --- docs/installation.rst | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/installation.rst b/docs/installation.rst index 78e7616dd..ef3507dff 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -1777,8 +1777,8 @@ for the creation of projectors in UPF can introduce severe errors and inaccuraci .. _buildqe: -Installing Quantum ESPRESSO with pw2qmcpack -------------------------------------------- +Installing Quantum ESPRESSO and pw2qmcpack +------------------------------------------ For trial wavefunctions obtained in a plane-wave basis, we mainly support QE. Note that ABINIT and QBox were supported historically @@ -1840,8 +1840,12 @@ A HDF5 library installation with Fortran support is required. Quantum ESPRESSO (>7.0) ~~~~~~~~~~~~~~~~~~~~~~~ -There is no more need of patching QE. Users may use upstream QE. Full QE CMake documentation can be found at +Due to incorporation of pw2qmcpack as a plugin, there is no longer any need to patch QE. +Users may use upstream QE and activate the plugin by specifying ``-DQE_ENABLE_PLUGINS=pw2qmcpack`` at the CMake configure step. +Full QE CMake documentation can be found at https://gitlab.com/QEF/q-e/-/wikis/Developers/CMake-build-system . +A HDF5 library installation with Fortran support is required. + :: mkdir build_mpi @@ -1849,9 +1853,9 @@ https://gitlab.com/QEF/q-e/-/wikis/Developers/CMake-build-system . cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 -DQE_ENABLE_PLUGINS=pw2qmcpack .. make -j 16 -Post QE installation -~~~~~~~~~~~~~~~~~~~~ -Testing QE to QMCPACK workflow after building QE and QMCPACK is highly recommended. +Testing QE after installation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Testing the QE to QMCPACK workflow after building QE and QMCPACK is highly recommended. See :ref:`integtestqe` and the testing section for more details. .. _buildperformance: From 002d2bc7358a31d671fc608a1c039e6f1b375ade Mon Sep 17 00:00:00 2001 From: "Paul R. C. Kent" Date: Fri, 22 Apr 2022 19:21:49 -0400 Subject: [PATCH 11/26] no hdf5 with plugin --- docs/installation.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/installation.rst b/docs/installation.rst index ef3507dff..872779a59 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -1844,7 +1844,6 @@ Due to incorporation of pw2qmcpack as a plugin, there is no longer any need to p Users may use upstream QE and activate the plugin by specifying ``-DQE_ENABLE_PLUGINS=pw2qmcpack`` at the CMake configure step. Full QE CMake documentation can be found at https://gitlab.com/QEF/q-e/-/wikis/Developers/CMake-build-system . -A HDF5 library installation with Fortran support is required. :: From a3ca8c2abf12726fb1c6baf739e297589f6169dd Mon Sep 17 00:00:00 2001 From: camelto2 Date: Fri, 22 Apr 2022 18:08:01 -0600 Subject: [PATCH 12/26] modernize xml --- src/QMCWaveFunctions/tests/test_MO_spinor.cpp | 48 +++++++++---------- 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp index 2fa966bb2..354bea6d4 100644 --- a/src/QMCWaveFunctions/tests/test_MO_spinor.cpp +++ b/src/QMCWaveFunctions/tests/test_MO_spinor.cpp @@ -67,13 +67,12 @@ void test_lcao_spinor() elec_.addTable(ions_); elec_.update(); - const char* particles = " \ - \ - \ - \ - \ - \ -"; + const char* particles = R"XML( + + + + + )XML"; Libxml2Document doc; bool okay = doc.parseFromString(particles); @@ -390,17 +389,16 @@ void test_lcao_spinor_excited() elec_.addTable(ions_); elec_.update(); - const char* particles = " \ - \ - \ - \ - \ - -1 2 \ - \ - \ - \ - \ -"; + const char* particles = R"XML( + + + + + -1 2 + + + + )XML"; Libxml2Document doc; bool okay = doc.parseFromString(particles); @@ -721,13 +719,12 @@ void test_lcao_spinor_ion_derivs() elec_.addTable(ions_); elec_.update(); - const char* particles = " \ - \ - \ - \ - \ - \ -"; + const char* particles = R"XML( + + + + + )XML"; Libxml2Document doc; bool okay = doc.parseFromString(particles); @@ -768,7 +765,6 @@ void test_lcao_spinor_ion_derivs() CHECK(gradIon[0][0][2] == ComplexApprox(dz1).epsilon(eps)); } - TEST_CASE("ReadMolecularOrbital GTO spinor", "[wavefunction]") { test_lcao_spinor(); } TEST_CASE("ReadMolecularOrbital GTO spinor with excited", "[wavefunction]") { test_lcao_spinor_excited(); } TEST_CASE("spinor ion derivatives for molecule", "[wavefunction]") { test_lcao_spinor_ion_derivs(); } From 01e631f757dbe920545cadcd3fec9a3b24e3f630 Mon Sep 17 00:00:00 2001 From: Paul Kent Date: Sun, 24 Apr 2022 12:12:28 -0400 Subject: [PATCH 13/26] CUDA minver in docs --- CHANGELOG.md | 4 ++++ README.md | 3 ++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 2c924a6c6..ccf41247a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,10 @@ Notable changes to QMCPACK are documented in this file. +## [Unreleased] + +- Minimum CUDA version increased to 11.0 [\#3957](https://github.com/QMCPACK/qmcpack/pull/3957) + ## [3.14.0] - 2022-04-06 This release focuses on performance improvements to the OpenMP target offload version for GPUs as well as ongoing minor diff --git a/README.md b/README.md index c0960973c..256150501 100644 --- a/README.md +++ b/README.md @@ -24,13 +24,14 @@ particular emphasis is placed on code quality and reproducibility. * C++ 17 and C99 capable compilers. * CMake v3.15.0 or later, build utility, http://www.cmake.org - * BLAS/LAPACK, numerical library. Use platform-optimized libraries. + * BLAS/LAPACK, numerical library. Use vendor and platform-optimized libraries. * LibXml2, XML parser, http://xmlsoft.org/ * HDF5, portable I/O library, http://www.hdfgroup.org/HDF5/ * BOOST v1.61.0 or newer, peer-reviewed portable C++ source libraries, http://www.boost.org * FFTW, FFT library, http://www.fftw.org/ * MPI, parallel library. Optional, but a near requirement for production calculations. * Python3. Older versions are not supported as of January 2020. + * CUDA v11.0 or later. Optional, but required for builds with NVIDIA GPU support. We aim to support open source compilers and libraries released within two years of each QMCPACK release. Use of software versions over two years old may work but is discouraged and untested. Proprietary compilers (Intel, NVHPC) are generally supported over the From f049684947aaa3bb69358d70cbef488660edf8ba Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 12:35:20 -0500 Subject: [PATCH 14/26] Remove dead CUDA code. --- src/Platforms/CUDA/CUDAallocator.hpp | 11 --- src/Platforms/tests/CUDA/CMakeLists.txt | 7 -- .../tests/CUDA/test_device_value_kernels.cu | 85 ------------------- .../tests/CUDA/test_device_value_kernels.hpp | 38 --------- .../tests/test_DiracMatrixComputeCUDA.cpp | 1 - 5 files changed, 142 deletions(-) delete mode 100644 src/Platforms/tests/CUDA/test_device_value_kernels.cu delete mode 100644 src/Platforms/tests/CUDA/test_device_value_kernels.hpp diff --git a/src/Platforms/CUDA/CUDAallocator.hpp b/src/Platforms/CUDA/CUDAallocator.hpp index 450add33f..a964dc9fe 100644 --- a/src/Platforms/CUDA/CUDAallocator.hpp +++ b/src/Platforms/CUDA/CUDAallocator.hpp @@ -188,17 +188,6 @@ struct qmc_allocator_traits> static const bool is_host_accessible = false; static const bool is_dual_space = false; static void fill_n(T* ptr, size_t n, const T& value) { qmcplusplus::CUDAfill_n(ptr, n, value); } - static void updateTo(CUDAAllocator& alloc, T* host_ptr, size_t n) - { - T* device_ptr = alloc.getDevicePtr(host_ptr); - copyToDevice(device_ptr, host_ptr, n); - } - - static void updateFrom(CUDAAllocator& alloc, T* host_ptr, size_t n) - { - T* device_ptr = alloc.getDevicePtr(host_ptr); - copyFromDevice(host_ptr, device_ptr, n); - } }; /** allocator for CUDA host pinned memory diff --git a/src/Platforms/tests/CUDA/CMakeLists.txt b/src/Platforms/tests/CUDA/CMakeLists.txt index 56c1ebdd5..96df02abe 100644 --- a/src/Platforms/tests/CUDA/CMakeLists.txt +++ b/src/Platforms/tests/CUDA/CMakeLists.txt @@ -13,13 +13,6 @@ set(SRC_DIR CUDA) set(UTEST_EXE test_${SRC_DIR}) set(UTEST_NAME deterministic-unit_test_${SRC_DIR}) -if(NOT QMC_CUDA2HIP) - add_library(cuda_device_value_test_kernels test_device_value_kernels.cu) -else() - hip_add_library(cuda_device_value_test_kernels test_device_value_kernels.cu) -endif() -target_link_libraries(cuda_device_value_test_kernels PUBLIC platform_runtime) - add_executable(${UTEST_EXE} test_CUDAallocator.cpp) target_link_libraries(${UTEST_EXE} platform_runtime containers catch_main) diff --git a/src/Platforms/tests/CUDA/test_device_value_kernels.cu b/src/Platforms/tests/CUDA/test_device_value_kernels.cu deleted file mode 100644 index a27526399..000000000 --- a/src/Platforms/tests/CUDA/test_device_value_kernels.cu +++ /dev/null @@ -1,85 +0,0 @@ -////////////////////////////////////////////////////////////////////////////////////// -// This file is distributed under the University of Illinois/NCSA Open Source License. -// See LICENSE file in top directory for details. -// -// Copyright (c) 2021 QMCPACK developers. -// -// File developed by: Peter Doak, doakpw@ornl.gov, Oak Ridge National Laboratory -// -// File created by: Peter Doak, doakpw@ornl.gov, Oak Ridge National Laboratory -////////////////////////////////////////////////////////////////////////////////////// - -#include "test_device_value_kernels.hpp" -#include "CUDA/CUDAallocator.hpp" - -namespace qmcplusplus -{ -namespace testing -{ - -/** checking an on device pointer's value against a passed value - */ -template -__global__ void checkValue_kernel(T* device_value_ptr, const T value, bool* result) -{ - if (*device_value_ptr == value) - *result = true; - else - *result = false; -} - -/** checking an on device pointer's value against a passed value - */ -template -cudaError_t checkValueCUDA(cudaStream_t hstream, T* device_value_ptr, T value, bool& result) -{ - CUDAAllocator bool_allocator; - CUDAHostAllocator host_bool_allocator; - bool* device_result = bool_allocator.allocate(1); - bool* host_result = host_bool_allocator.allocate(1); - dim3 dim_block(1); - dim3 dim_grid(1); - checkValue_kernel<<>>(device_value_ptr, value, device_result); - cudaCheck(cudaStreamSynchronize(hstream)); - cudaError_t kernel_error = cudaPeekAtLastError(); - cudaCheck(cudaMemcpyAsync(host_result, device_result, sizeof(bool), cudaMemcpyDeviceToHost, hstream)); - cudaCheck(cudaStreamSynchronize(hstream)); - result = *host_result; - bool_allocator.deallocate(device_result, 1); - host_bool_allocator.deallocate(host_result, 1); - return kernel_error; -} - -__global__ void checkDualStruct_kernel(DualStruct* device_struct_ptr, const DualStruct dual_struct, bool* result) -{ - if (device_struct_ptr->index == dual_struct.index && device_struct_ptr->value == dual_struct.value) - *result = true; - else - *result = false; -} - -/** check a particular test structure at device pointer against passed by value struct - */ -cudaError_t checkDualStruct(cudaStream_t hstream, DualStruct* device_struct_ptr, DualStruct dual_struct, bool& result) -{ - CUDAAllocator bool_allocator; - CUDAHostAllocator host_bool_allocator; - bool* device_result = bool_allocator.allocate(1); - bool* host_result = host_bool_allocator.allocate(1); - dim3 dim_block(1); - dim3 dim_grid(1); - checkDualStruct_kernel<<>>(device_struct_ptr, dual_struct, device_result); - cudaCheck(cudaStreamSynchronize(hstream)); - cudaError_t kernel_error = cudaPeekAtLastError(); - cudaCheck(cudaMemcpyAsync(host_result, device_result, sizeof(bool), cudaMemcpyDeviceToHost, hstream)); - cudaCheck(cudaStreamSynchronize(hstream)); - result = *host_result; - bool_allocator.deallocate(device_result, 1); - host_bool_allocator.deallocate(host_result, 1); - return kernel_error; -} - -template cudaError_t checkValueCUDA(cudaStream_t hstream, double* device_value_ptr, double value, bool& result); - -} // namespace testing -} // namespace qmcplusplus diff --git a/src/Platforms/tests/CUDA/test_device_value_kernels.hpp b/src/Platforms/tests/CUDA/test_device_value_kernels.hpp deleted file mode 100644 index be1a266be..000000000 --- a/src/Platforms/tests/CUDA/test_device_value_kernels.hpp +++ /dev/null @@ -1,38 +0,0 @@ -////////////////////////////////////////////////////////////////////////////////////// -// This file is distributed under the University of Illinois/NCSA Open Source License. -// See LICENSE file in top directory for details. -// -// Copyright (c) 2021 QMCPACK developers. -// -// File developed by: Peter Doak, doakpw@ornl.gov, Oak Ridge National Laboratory -// -// File created by: Peter Doak, doakpw@ornl.gov, Oak Ridge National Laboratory -////////////////////////////////////////////////////////////////////////////////////// - -#ifndef QMCPLUSPLUS_TEST_DEVICE_VALUES_KERNELS_HPP -#define QMCPLUSPLUS_TEST_DEVICE_VALUES_KERNELS_HPP - -#include "CUDA/CUDAruntime.hpp" - -namespace qmcplusplus -{ -namespace testing -{ - -template -cudaError_t checkValueCUDA(cudaStream_t hstream, T* device_value_ptr, T value, bool& result); - -/** just an arbitrary struct for testing */ -struct DualStruct -{ - int index; - double value; -}; -cudaError_t checkDualStruct(cudaStream_t hstream, DualStruct* device_struct_ptr, DualStruct dual_struct, bool& result); - -extern template cudaError_t checkValueCUDA(cudaStream_t hstream, double* device_value_ptr, double value, bool& result); - -} // namespace testing -} // namespace qmcplusplus - -#endif diff --git a/src/QMCWaveFunctions/tests/test_DiracMatrixComputeCUDA.cpp b/src/QMCWaveFunctions/tests/test_DiracMatrixComputeCUDA.cpp index 327b3f1a8..9b7c15daf 100644 --- a/src/QMCWaveFunctions/tests/test_DiracMatrixComputeCUDA.cpp +++ b/src/QMCWaveFunctions/tests/test_DiracMatrixComputeCUDA.cpp @@ -21,7 +21,6 @@ #include "Utilities/for_testing/RandomForTest.h" #include "Platforms/DualAllocatorAliases.hpp" #include "Platforms/CUDA/CUDALinearAlgebraHandles.h" -#include "Platforms/tests/CUDA/test_device_value_kernels.hpp" // Legacy CPU inversion for temporary testing #include "QMCWaveFunctions/Fermion/DiracMatrix.h" From 073f2af2f943d0f6ff1b9596ded713e7f0d65810 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 14:12:44 -0500 Subject: [PATCH 15/26] Add SYCLallocator. --- src/Platforms/DualAllocator.hpp | 2 + src/Platforms/DualAllocatorAliases.hpp | 32 ++- src/Platforms/PinnedAllocator.h | 12 +- src/Platforms/SYCL/CMakeLists.txt | 2 +- src/Platforms/SYCL/SYCLallocator.cpp | 19 ++ src/Platforms/SYCL/SYCLallocator.hpp | 260 +++++++++++++++++++++++++ 6 files changed, 313 insertions(+), 14 deletions(-) create mode 100644 src/Platforms/SYCL/SYCLallocator.cpp create mode 100644 src/Platforms/SYCL/SYCLallocator.hpp diff --git a/src/Platforms/DualAllocator.hpp b/src/Platforms/DualAllocator.hpp index 3327aa9c9..f06b732bf 100644 --- a/src/Platforms/DualAllocator.hpp +++ b/src/Platforms/DualAllocator.hpp @@ -23,6 +23,8 @@ #include "PinnedAllocator.h" #if defined(ENABLE_CUDA) #include "CUDA/CUDAallocator.hpp" +#elif defined(ENABLE_SYCL) +#include "SYCL/SYCLallocator.hpp" #endif namespace qmcplusplus diff --git a/src/Platforms/DualAllocatorAliases.hpp b/src/Platforms/DualAllocatorAliases.hpp index 62b9fd297..291aae51c 100644 --- a/src/Platforms/DualAllocatorAliases.hpp +++ b/src/Platforms/DualAllocatorAliases.hpp @@ -23,16 +23,7 @@ #include "PinnedAllocator.h" -#if defined(ENABLE_CUDA) && !defined(ENABLE_OFFLOAD) -#include "DualAllocator.hpp" -namespace qmcplusplus -{ - template - using UnpinnedDualAllocator = DualAllocator, aligned_allocator>; - template - using PinnedDualAllocator = DualAllocator, PinnedAlignedAllocator>; -} -#else +#if defined(ENABLE_OFFLOAD) #include "OMPTarget/OffloadAlignedAllocators.hpp" namespace qmcplusplus { @@ -41,6 +32,27 @@ namespace qmcplusplus template using PinnedDualAllocator = OffloadPinnedAllocator; } +#else +#include "DualAllocator.hpp" +#if defined(ENABLE_CUDA) +namespace qmcplusplus +{ + template + using UnpinnedDualAllocator = DualAllocator, aligned_allocator>; + template + using PinnedDualAllocator = DualAllocator, PinnedAlignedAllocator>; +} +#elif defined(ENABLE_SYCL) +namespace qmcplusplus +{ + template + using UnpinnedDualAllocator = DualAllocator, aligned_allocator>; + template + using PinnedDualAllocator = DualAllocator, PinnedAlignedAllocator>; +} +#else +#error unhandled platform +#endif #endif #endif diff --git a/src/Platforms/PinnedAllocator.h b/src/Platforms/PinnedAllocator.h index fa12a763c..331f738ce 100644 --- a/src/Platforms/PinnedAllocator.h +++ b/src/Platforms/PinnedAllocator.h @@ -15,8 +15,10 @@ #include #include "CPU/SIMD/aligned_allocator.hpp" -#ifdef ENABLE_CUDA +#if defined(ENABLE_CUDA) #include "CUDA/CUDAallocator.hpp" +#elif defined(ENABLE_SYCL) +#include "SYCL/SYCLallocator.hpp" #endif namespace qmcplusplus @@ -24,15 +26,19 @@ namespace qmcplusplus /** The fact that the pinned allocators are not always pinned hurts readability elsewhere. */ template -#ifdef ENABLE_CUDA +#if defined(ENABLE_CUDA) using PinnedAllocator = CUDALockedPageAllocator; +#elif defined(ENABLE_SYCL) +using PinnedAllocator = SYCLHostAllocator; #else using PinnedAllocator = std::allocator; #endif template -#ifdef ENABLE_CUDA +#if defined(ENABLE_CUDA) using PinnedAlignedAllocator = CUDALockedPageAllocator>; +#elif defined(ENABLE_SYCL) +using PinnedAlignedAllocator = SYCLHostAllocator; #else using PinnedAlignedAllocator = aligned_allocator; #endif diff --git a/src/Platforms/SYCL/CMakeLists.txt b/src/Platforms/SYCL/CMakeLists.txt index 82b20f181..46b7692bc 100644 --- a/src/Platforms/SYCL/CMakeLists.txt +++ b/src/Platforms/SYCL/CMakeLists.txt @@ -10,7 +10,7 @@ #////////////////////////////////////////////////////////////////////////////////////// -set(SYCL_RT_SRCS SYCLDeviceManager.cpp) +set(SYCL_RT_SRCS SYCLDeviceManager.cpp SYCLallocator.cpp) add_library(platform_sycl_runtime ${SYCL_RT_SRCS}) target_link_libraries(platform_sycl_runtime PUBLIC SYCL::host diff --git a/src/Platforms/SYCL/SYCLallocator.cpp b/src/Platforms/SYCL/SYCLallocator.cpp new file mode 100644 index 000000000..431979c95 --- /dev/null +++ b/src/Platforms/SYCL/SYCLallocator.cpp @@ -0,0 +1,19 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2021 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + + +#include +#include + +namespace qmcplusplus +{ + std::atomic SYCLallocator_device_mem_allocated(0); +} diff --git a/src/Platforms/SYCL/SYCLallocator.hpp b/src/Platforms/SYCL/SYCLallocator.hpp new file mode 100644 index 000000000..7cd082bc6 --- /dev/null +++ b/src/Platforms/SYCL/SYCLallocator.hpp @@ -0,0 +1,260 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2019 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// +// -*- C++ -*- +/** @file SYCLallocator.hpp + * this file provides three C++ memory allocators using SYCL specific memory allocation functions. + * + * SYCLManagedAllocator allocates SYCL shared memory + * SYCLAllocator allocates SYCL device memory + * SYCLHostAllocator allocates SYCL host memory + * They are based on CUDA*Allocator implementation + */ +#ifndef QMCPLUSPLUS_SYCL_ALLOCATOR_H +#define QMCPLUSPLUS_SYCL_ALLOCATOR_H + +#include +#include +#include +#include +#include +#include +#include "allocator_traits.hpp" +#include "DeviceManager.h" + +namespace qmcplusplus +{ +extern sycl::queue* get_default_queue(); + +extern std::atomic SYCLallocator_device_mem_allocated; + +inline size_t getSYCLdeviceMemAllocated() { return SYCLallocator_device_mem_allocated; } + +/** allocator for SYCL shared memory + * @tparm T data type + * @tparm ALIGN alignment in bytes + */ +template +struct SYCLSharedAllocator +{ + typedef T value_type; + typedef size_t size_type; + typedef T* pointer; + typedef const T* const_pointer; + + static constexpr size_t alignment = ALIGN; + + SYCLSharedAllocator() = default; + template + SYCLSharedAllocator(const SYCLSharedAllocator&) + {} + + template + struct rebind + { + typedef SYCLSharedAllocator other; + }; + + T* allocate(std::size_t n) + { + T* pt= sycl::aligned_alloc_shared(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + return pt; + } + void deallocate(T* p, std::size_t) { + sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + } +}; + +template +bool operator==(const SYCLSharedAllocator&, const SYCLSharedAllocator&) +{ + return true; +} +template +bool operator!=(const SYCLSharedAllocator&, const SYCLSharedAllocator&) +{ + return false; +} + +/** allocator for SYCL device memory + * @tparm T data type + * @tparm ALIGN alignment in bytes + * + * using this with something other than Ohmms containers? + * -- use caution, write unit tests! -- + * It's not tested beyond use in some unit tests using std::vector with constant size. + * SYCLAllocator appears to meet all the nonoptional requirements of a c++ Allocator. + * + * Some of the default implementations in std::allocator_traits + * of optional Allocator requirements may cause runtime or compilation failures. + * They assume there is only one memory space and that the host has access to it. + */ +template +class SYCLAllocator +{ +public: + typedef T value_type; + typedef size_t size_type; + typedef T* pointer; + typedef const T* const_pointer; + + static constexpr size_t alignment = ALIGN; + + SYCLAllocator() = default; + template + SYCLAllocator(const SYCLAllocator&) + {} + + template + struct rebind + { + typedef SYCLAllocator other; + }; + + T* allocate(std::size_t n) + { + T* pt=sycl::aligned_alloc_device(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + SYCLallocator_device_mem_allocated += n * sizeof(T); + return pt; + } + + void deallocate(T* p, std::size_t n) + { + sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + SYCLallocator_device_mem_allocated -= n * sizeof(T); + } + + /** Provide a construct for std::allocator_traits::contruct to call. + * Don't do anything on construct, pointer p is on the device! + * + * For example std::vector calls this to default initialize each element. You'll segfault + * if std::allocator_traits::construct tries doing that at p. + * + * The standard is a bit confusing on this point. Implementing this is an optional requirement + * of Allocator from C++11 on, its not slated to be removed. + * + * Its deprecated for the std::allocator in c++17 and will be removed in c++20. But we are not implementing + * std::allocator. + * + * STL containers only use Allocators through allocator_traits and std::allocator_traits handles the case + * where no construct method is present in the Allocator. + * But std::allocator_traits will call the Allocators construct method if present. + */ + template + static void construct(U* p, Args&&... args) + {} + + /** Give std::allocator_traits something to call. + * The default if this isn't present is to call p->~T() which + * we can't do on device memory. + */ + template + static void destroy(U* p) + {} + + void copyToDevice(T* device_ptr, T* host_ptr, size_t n) + { + DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(device_ptr,host_ptr,n*sizeof(T)).wait(); + } + + void copyFromDevice(T* host_ptr, T* device_ptr, size_t n) + { + DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(host_ptr,device_ptr,n*sizeof(T)).wait(); + } + + void copyDeviceToDevice(T* to_ptr, size_t n, T* from_ptr) + { + DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(to_ptr,from_ptr,n*sizeof(T)).wait(); + } +}; + +template +bool operator==(const SYCLAllocator&, const SYCLAllocator&) +{ + return true; +} +template +bool operator!=(const SYCLAllocator&, const SYCLAllocator&) +{ + return false; +} + +template +struct qmc_allocator_traits> +{ + static const bool is_host_accessible = false; + static const bool is_dual_space = false; + static void fill_n(T* ptr, size_t n, const T& value) { + //THINK + //qmcplusplus::SYCLfill_n(ptr, n, value); + } + static void updateTo(SYCLAllocator& alloc, T* host_ptr, size_t n) + { + T* device_ptr = alloc.getDevicePtr(host_ptr); + alloc.copyToDevice(device_ptr, host_ptr, n); + } + + static void updateFrom(SYCLAllocator& alloc, T* host_ptr, size_t n) + { + T* device_ptr = alloc.getDevicePtr(host_ptr); + alloc.copyFromDevice(host_ptr, device_ptr, n); + } + +}; + +/** allocator for SYCL host pinned memory + * @tparm T data type + * @tparm ALIGN alignment in bytes + */ +template +struct SYCLHostAllocator +{ + typedef T value_type; + typedef size_t size_type; + typedef T* pointer; + typedef const T* const_pointer; + + static constexpr size_t alignment = ALIGN; + + SYCLHostAllocator() = default; + template + SYCLHostAllocator(const SYCLHostAllocator&) + {} + + template + struct rebind + { + typedef SYCLHostAllocator other; + }; + + T* allocate(std::size_t n) + { + return sycl::aligned_alloc_host(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + } + void deallocate(T* p, std::size_t) { + sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + } +}; + +template +bool operator==(const SYCLHostAllocator&, const SYCLHostAllocator&) +{ + return true; +} + +template +bool operator!=(const SYCLHostAllocator&, const SYCLHostAllocator&) +{ + return false; +} + +} // namespace qmcplusplus + +#endif From 67e47983307e500a58fb9aecc26bc12ca271cd48 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 15:59:29 -0500 Subject: [PATCH 16/26] Add test_SYCLallocator. --- src/Platforms/SYCL/CMakeLists.txt | 3 +- src/Platforms/SYCL/SYCLallocator.cpp | 4 +- src/Platforms/SYCL/SYCLallocator.hpp | 38 ++++------ src/Platforms/SYCL/SYCLruntime.cpp | 22 ++++++ src/Platforms/SYCL/SYCLruntime.hpp | 22 ++++++ src/Platforms/tests/CMakeLists.txt | 4 + src/Platforms/tests/SYCL/CMakeLists.txt | 19 +++++ .../tests/SYCL/test_SYCLallocator.cpp | 73 +++++++++++++++++++ 8 files changed, 159 insertions(+), 26 deletions(-) create mode 100644 src/Platforms/SYCL/SYCLruntime.cpp create mode 100644 src/Platforms/SYCL/SYCLruntime.hpp create mode 100644 src/Platforms/tests/SYCL/CMakeLists.txt create mode 100644 src/Platforms/tests/SYCL/test_SYCLallocator.cpp diff --git a/src/Platforms/SYCL/CMakeLists.txt b/src/Platforms/SYCL/CMakeLists.txt index 46b7692bc..d77931e18 100644 --- a/src/Platforms/SYCL/CMakeLists.txt +++ b/src/Platforms/SYCL/CMakeLists.txt @@ -10,8 +10,9 @@ #////////////////////////////////////////////////////////////////////////////////////// -set(SYCL_RT_SRCS SYCLDeviceManager.cpp SYCLallocator.cpp) +set(SYCL_RT_SRCS SYCLDeviceManager.cpp SYCLallocator.cpp SYCLruntime.cpp) add_library(platform_sycl_runtime ${SYCL_RT_SRCS}) +target_include_directories(platform_sycl_runtime PRIVATE "../") target_link_libraries(platform_sycl_runtime PUBLIC SYCL::host PRIVATE platform_host_runtime) diff --git a/src/Platforms/SYCL/SYCLallocator.cpp b/src/Platforms/SYCL/SYCLallocator.cpp index 431979c95..450bbd6f0 100644 --- a/src/Platforms/SYCL/SYCLallocator.cpp +++ b/src/Platforms/SYCL/SYCLallocator.cpp @@ -2,7 +2,7 @@ // This file is distributed under the University of Illinois/NCSA Open Source License. // See LICENSE file in top directory for details. // -// Copyright (c) 2021 QMCPACK developers. +// Copyright (c) 2022 QMCPACK developers. // // File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory // @@ -15,5 +15,5 @@ namespace qmcplusplus { - std::atomic SYCLallocator_device_mem_allocated(0); +std::atomic SYCLallocator_device_mem_allocated(0); } diff --git a/src/Platforms/SYCL/SYCLallocator.hpp b/src/Platforms/SYCL/SYCLallocator.hpp index 7cd082bc6..696ab92ef 100644 --- a/src/Platforms/SYCL/SYCLallocator.hpp +++ b/src/Platforms/SYCL/SYCLallocator.hpp @@ -2,7 +2,7 @@ // This file is distributed under the University of Illinois/NCSA Open Source License. // See LICENSE file in top directory for details. // -// Copyright (c) 2019 QMCPACK developers. +// Copyright (c) 2022 QMCPACK developers. // // File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory // @@ -26,13 +26,12 @@ #include #include #include +#include "config.h" #include "allocator_traits.hpp" -#include "DeviceManager.h" +#include "SYCLruntime.hpp" namespace qmcplusplus { -extern sycl::queue* get_default_queue(); - extern std::atomic SYCLallocator_device_mem_allocated; inline size_t getSYCLdeviceMemAllocated() { return SYCLallocator_device_mem_allocated; } @@ -64,12 +63,10 @@ struct SYCLSharedAllocator T* allocate(std::size_t n) { - T* pt= sycl::aligned_alloc_shared(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + T* pt = sycl::aligned_alloc_shared(ALIGN, n, getSYCLDefaultDeviceDefaultQueue()); return pt; } - void deallocate(T* p, std::size_t) { - sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); - } + void deallocate(T* p, std::size_t) { sycl::free(p, getSYCLDefaultDeviceDefaultQueue()); } }; template @@ -120,14 +117,14 @@ public: T* allocate(std::size_t n) { - T* pt=sycl::aligned_alloc_device(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + T* pt = sycl::aligned_alloc_device(ALIGN, n, getSYCLDefaultDeviceDefaultQueue()); SYCLallocator_device_mem_allocated += n * sizeof(T); return pt; } void deallocate(T* p, std::size_t n) { - sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); + sycl::free(p, getSYCLDefaultDeviceDefaultQueue()); SYCLallocator_device_mem_allocated -= n * sizeof(T); } @@ -161,17 +158,17 @@ public: void copyToDevice(T* device_ptr, T* host_ptr, size_t n) { - DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(device_ptr,host_ptr,n*sizeof(T)).wait(); + getSYCLDefaultDeviceDefaultQueue().memcpy(device_ptr, host_ptr, n * sizeof(T)).wait(); } void copyFromDevice(T* host_ptr, T* device_ptr, size_t n) { - DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(host_ptr,device_ptr,n*sizeof(T)).wait(); + getSYCLDefaultDeviceDefaultQueue().memcpy(host_ptr, device_ptr, n * sizeof(T)).wait(); } void copyDeviceToDevice(T* to_ptr, size_t n, T* from_ptr) { - DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue().memcpy(to_ptr,from_ptr,n*sizeof(T)).wait(); + getSYCLDefaultDeviceDefaultQueue().memcpy(to_ptr, from_ptr, n * sizeof(T)).wait(); } }; @@ -191,9 +188,10 @@ struct qmc_allocator_traits> { static const bool is_host_accessible = false; static const bool is_dual_space = false; - static void fill_n(T* ptr, size_t n, const T& value) { + static void fill_n(T* ptr, size_t n, const T& value) + { //THINK - //qmcplusplus::SYCLfill_n(ptr, n, value); + //qmcplusplus::SYCLfill_n(ptr, n, value); } static void updateTo(SYCLAllocator& alloc, T* host_ptr, size_t n) { @@ -206,7 +204,6 @@ struct qmc_allocator_traits> T* device_ptr = alloc.getDevicePtr(host_ptr); alloc.copyFromDevice(host_ptr, device_ptr, n); } - }; /** allocator for SYCL host pinned memory @@ -234,13 +231,8 @@ struct SYCLHostAllocator typedef SYCLHostAllocator other; }; - T* allocate(std::size_t n) - { - return sycl::aligned_alloc_host(ALIGN, n, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); - } - void deallocate(T* p, std::size_t) { - sycl::free(p, DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue()); - } + T* allocate(std::size_t n) { return sycl::aligned_alloc_host(ALIGN, n, getSYCLDefaultDeviceDefaultQueue()); } + void deallocate(T* p, std::size_t) { sycl::free(p, getSYCLDefaultDeviceDefaultQueue()); } }; template diff --git a/src/Platforms/SYCL/SYCLruntime.cpp b/src/Platforms/SYCL/SYCLruntime.cpp new file mode 100644 index 000000000..d4e48fcb7 --- /dev/null +++ b/src/Platforms/SYCL/SYCLruntime.cpp @@ -0,0 +1,22 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2022 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + +#include +#include "DeviceManager.h" +#include "SYCLruntime.hpp" + +namespace qmcplusplus +{ +sycl::queue getSYCLDefaultDeviceDefaultQueue() +{ + return DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue(); +} +} // namespace qmcplusplus diff --git a/src/Platforms/SYCL/SYCLruntime.hpp b/src/Platforms/SYCL/SYCLruntime.hpp new file mode 100644 index 000000000..152c4ccb2 --- /dev/null +++ b/src/Platforms/SYCL/SYCLruntime.hpp @@ -0,0 +1,22 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2022 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + +#ifndef QMCPLUSPLUS_SYCL_RUNTIME_H +#define QMCPLUSPLUS_SYCL_RUNTIME_H + +#include + +namespace qmcplusplus +{ +sycl::queue getSYCLDefaultDeviceDefaultQueue(); +} // namespace qmcplusplus + +#endif diff --git a/src/Platforms/tests/CMakeLists.txt b/src/Platforms/tests/CMakeLists.txt index a7014a693..3bdc4bbd3 100644 --- a/src/Platforms/tests/CMakeLists.txt +++ b/src/Platforms/tests/CMakeLists.txt @@ -15,6 +15,10 @@ if(ENABLE_CUDA) add_subdirectory(CUDA) endif() +if(ENABLE_SYCL) + add_subdirectory(SYCL) +endif() + if(ENABLE_OFFLOAD) add_subdirectory(OMPTarget) endif(ENABLE_OFFLOAD) diff --git a/src/Platforms/tests/SYCL/CMakeLists.txt b/src/Platforms/tests/SYCL/CMakeLists.txt new file mode 100644 index 000000000..57ecaac58 --- /dev/null +++ b/src/Platforms/tests/SYCL/CMakeLists.txt @@ -0,0 +1,19 @@ +#////////////////////////////////////////////////////////////////////////////////////// +#// This file is distributed under the University of Illinois/NCSA Open Source License. +#// See LICENSE file in top directory for details. +#// +#// Copyright (c) 2022 QMCPACK developers. +#// +#// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +#// +#// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +#////////////////////////////////////////////////////////////////////////////////////// + + +set(UTEST_EXE test_sycl) +set(UTEST_NAME deterministic-unit_${UTEST_EXE}) + +add_executable(${UTEST_EXE} test_SYCLallocator.cpp) +target_link_libraries(${UTEST_EXE} SYCL::device platform_runtime containers catch_main) + +add_unit_test(${UTEST_NAME} 1 1 $) diff --git a/src/Platforms/tests/SYCL/test_SYCLallocator.cpp b/src/Platforms/tests/SYCL/test_SYCLallocator.cpp new file mode 100644 index 000000000..458cf64c4 --- /dev/null +++ b/src/Platforms/tests/SYCL/test_SYCLallocator.cpp @@ -0,0 +1,73 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2019 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + + +#include "catch.hpp" + +#include +#include +#include "SYCL/SYCLallocator.hpp" +#include "OhmmsPETE/OhmmsVector.h" + +namespace qmcplusplus +{ +TEST_CASE("SYCL_allocator", "[SYCL]") +{ + // SYCLAllocator + sycl::queue m_queue = getSYCLDefaultDeviceDefaultQueue(); + Vector> vec(1024); + Vector vec_h(1024); + + sycl::event e; + { + double* V = vec.data(); + + e = m_queue.parallel_for(sycl::range<1>{1024}, [=](sycl::id<1> item) { V[item] = item + 1; }); + } + + //copy to host + m_queue.memcpy(vec_h.data(), vec.data(), 1024 * sizeof(double), {e}).wait(); + + CHECK(vec_h[0] == 1); + CHECK(vec_h[77] == 78); +} + +TEST_CASE("SYCL_host_allocator", "[SYCL]") +{ + sycl::queue m_queue = getSYCLDefaultDeviceDefaultQueue(); + // SYCLHostAllocator + Vector> vec(1024); + vec = 1; + + { + double* V = vec.data(); + m_queue.parallel_for(sycl::range<1>{1024}, [=](sycl::id<1> item) { V[item] += item + 1; }).wait(); + } + + CHECK(vec[0] == 2); + CHECK(vec[77] == 79); +} + +TEST_CASE("SYCL_shared_allocator", "[SYCL]") +{ + sycl::queue m_queue = getSYCLDefaultDeviceDefaultQueue(); + Vector> vec(1024); + + std::cout << "Size " << vec.size() << std::endl; + { + double* V = vec.data(); + m_queue.parallel_for(sycl::range<1>{1024}, [=](sycl::id<1> item) { V[item] = item + 1; }).wait(); + } + CHECK(vec[0] == 1); + CHECK(vec[77] == 78); +} + +} // namespace qmcplusplus From c1593d48c6c52125363606a1052353bc81382a27 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 16:03:31 -0500 Subject: [PATCH 17/26] Minor change. --- src/Platforms/tests/SYCL/test_SYCLallocator.cpp | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/Platforms/tests/SYCL/test_SYCLallocator.cpp b/src/Platforms/tests/SYCL/test_SYCLallocator.cpp index 458cf64c4..233c3922d 100644 --- a/src/Platforms/tests/SYCL/test_SYCLallocator.cpp +++ b/src/Platforms/tests/SYCL/test_SYCLallocator.cpp @@ -44,8 +44,7 @@ TEST_CASE("SYCL_host_allocator", "[SYCL]") { sycl::queue m_queue = getSYCLDefaultDeviceDefaultQueue(); // SYCLHostAllocator - Vector> vec(1024); - vec = 1; + Vector> vec(1024, 1); { double* V = vec.data(); @@ -56,6 +55,7 @@ TEST_CASE("SYCL_host_allocator", "[SYCL]") CHECK(vec[77] == 79); } +/* TEST_CASE("SYCL_shared_allocator", "[SYCL]") { sycl::queue m_queue = getSYCLDefaultDeviceDefaultQueue(); @@ -69,5 +69,6 @@ TEST_CASE("SYCL_shared_allocator", "[SYCL]") CHECK(vec[0] == 1); CHECK(vec[77] == 78); } +*/ } // namespace qmcplusplus From 67d91a0fa3e4079ba883dfee4bec190f2b6c6ee5 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 20:22:45 -0500 Subject: [PATCH 18/26] Add MKL::sycl and syclBLAS.hpp placeholder --- CMake/FindMKL.cmake | 15 ++++ CMakeLists.txt | 3 + src/Platforms/CMakeLists.txt | 1 + src/Platforms/SYCL/CMakeLists.txt | 5 ++ src/Platforms/SYCL/syclBLAS.cpp | 100 +++++++++++++++++++++ src/Platforms/SYCL/syclBLAS.hpp | 44 +++++++++ src/Platforms/tests/SYCL/CMakeLists.txt | 10 +++ src/Platforms/tests/SYCL/test_syclBLAS.cpp | 94 +++++++++++++++++++ 8 files changed, 272 insertions(+) create mode 100644 src/Platforms/SYCL/syclBLAS.cpp create mode 100644 src/Platforms/SYCL/syclBLAS.hpp create mode 100644 src/Platforms/tests/SYCL/test_syclBLAS.cpp diff --git a/CMake/FindMKL.cmake b/CMake/FindMKL.cmake index ec634e7c8..76fac00ba 100644 --- a/CMake/FindMKL.cmake +++ b/CMake/FindMKL.cmake @@ -78,3 +78,18 @@ else(HAVE_MKL) set(MKL_FOUND FALSE) message(STATUS "MKL header files not found") endif(HAVE_MKL) + +# check for mkl_sycl +if(HAVE_MKL AND ENABLE_SYCL) + find_library(MKL_SYCL mkl_sycl + HINTS ${MKL_ROOT} $ENV{MKLROOT} $ENV{MKL_ROOT} $ENV{MKL_HOME} + PATH_SUFFIXES lib/intel64 + REQUIRED + ) + + if(MKL_SYCL) + add_library(MKL::sycl INTERFACE IMPORTED) + target_include_directories(MKL::sycl INTERFACE "${MKL_INCLUDE}") + target_link_libraries(MKL::sycl INTERFACE ${MKL_SYCL}) + endif() +endif() diff --git a/CMakeLists.txt b/CMakeLists.txt index abb5f6308..d0ee59ae6 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -851,6 +851,9 @@ if(ENABLE_SYCL) find_package(IntelDPCPP REQUIRED CONFIGS IntelDPCPPConfig-modified.cmake PATHS ${PROJECT_CMAKE}) target_link_libraries(SYCL::host INTERFACE OneAPI::DPCPP-host) target_link_libraries(SYCL::device INTERFACE OneAPI::DPCPP-device) + if(TARGET MKL::sycl) + target_link_libraries(MKL::sycl INTERFACE OneAPI::DPCPP-host) + endif() endif(ENABLE_SYCL) #------------------------------------------------------------------- diff --git a/src/Platforms/CMakeLists.txt b/src/Platforms/CMakeLists.txt index e024bb46c..ec0155cb9 100644 --- a/src/Platforms/CMakeLists.txt +++ b/src/Platforms/CMakeLists.txt @@ -63,6 +63,7 @@ endif(ENABLE_ROCM) if(ENABLE_SYCL) add_subdirectory(SYCL) target_link_libraries(platform_runtime PUBLIC platform_sycl_runtime) + target_link_libraries(platform_LA INTERFACE platform_sycl_LA) endif(ENABLE_SYCL) if(BUILD_UNIT_TESTS) diff --git a/src/Platforms/SYCL/CMakeLists.txt b/src/Platforms/SYCL/CMakeLists.txt index d77931e18..df2ec9409 100644 --- a/src/Platforms/SYCL/CMakeLists.txt +++ b/src/Platforms/SYCL/CMakeLists.txt @@ -11,8 +11,13 @@ set(SYCL_RT_SRCS SYCLDeviceManager.cpp SYCLallocator.cpp SYCLruntime.cpp) +set(SYCL_LA_SRCS syclBLAS.cpp) add_library(platform_sycl_runtime ${SYCL_RT_SRCS}) target_include_directories(platform_sycl_runtime PRIVATE "../") target_link_libraries(platform_sycl_runtime PUBLIC SYCL::host PRIVATE platform_host_runtime) + +add_library(platform_sycl_LA ${SYCL_LA_SRCS}) +target_link_libraries(platform_sycl_LA PUBLIC platform_sycl_runtime MKL::sycl + PRIVATE SYCL::device platform_sycl_runtime) diff --git a/src/Platforms/SYCL/syclBLAS.cpp b/src/Platforms/SYCL/syclBLAS.cpp new file mode 100644 index 000000000..ef6fce3b8 --- /dev/null +++ b/src/Platforms/SYCL/syclBLAS.cpp @@ -0,0 +1,100 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2022 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + + +#include "syclBLAS.hpp" +#include "oneapi/mkl/blas.hpp" + +namespace qmcplusplus +{ +namespace syclBLAS +{ +inline oneapi::mkl::transpose convertTransEnum(char trans) +{ + return trans == 'T' ? oneapi::mkl::transpose::trans : oneapi::mkl::transpose::nontrans; +} + +template +sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const T alpha, + const T* const A, + const int lda, + const T* const x, + const int incx, + const T beta, + T* const y, + const int incy, + const std::vector& events) +{ + return oneapi::mkl::blas::gemv(handle, convertTransEnum(trans), m, n, alpha, A, lda, x, incx, beta, y, incy, events); +} + +template sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const double alpha, + const double* const A, + const int lda, + const double* const x, + const int incx, + const double beta, + double* const y, + const int incy, + const std::vector& events); + +template sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const float alpha, + const float* const A, + const int lda, + const float* const x, + const int incx, + const float beta, + float* const y, + const int incy, + const std::vector& events); + +template sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const std::complex alpha, + const std::complex* const A, + const int lda, + const std::complex* const x, + const int incx, + const std::complex beta, + std::complex* const y, + const int incy, + const std::vector& events); + +template sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const std::complex alpha, + const std::complex* const A, + const int lda, + const std::complex* const x, + const int incx, + const std::complex beta, + std::complex* const y, + const int incy, + const std::vector& events); +} // namespace syclBLAS + +} // namespace qmcplusplus diff --git a/src/Platforms/SYCL/syclBLAS.hpp b/src/Platforms/SYCL/syclBLAS.hpp new file mode 100644 index 000000000..08fd0001d --- /dev/null +++ b/src/Platforms/SYCL/syclBLAS.hpp @@ -0,0 +1,44 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2022 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + + +#ifndef QMCPLUSPLUS_SYCL_BLAS_H +#define QMCPLUSPLUS_SYCL_BLAS_H + +#include +#include + +namespace qmcplusplus +{ +namespace syclBLAS +{ +using syclBLAS_int = std::int64_t; +using syclBLAS_status = sycl::event; +using syclBLAS_handle = sycl::queue; + +template +sycl::event gemv(sycl::queue& handle, + const char trans, + const int m, + const int n, + const T alpha, + const T* const A, + const int lda, + const T* const x, + const int incx, + const T beta, + T* const y, + const int incy, + const std::vector& events = {}); +} // namespace syclBLAS + +} // namespace qmcplusplus +#endif // QMCPLUSPLUS_OMPBLAS_H diff --git a/src/Platforms/tests/SYCL/CMakeLists.txt b/src/Platforms/tests/SYCL/CMakeLists.txt index 57ecaac58..1237cae17 100644 --- a/src/Platforms/tests/SYCL/CMakeLists.txt +++ b/src/Platforms/tests/SYCL/CMakeLists.txt @@ -10,6 +10,7 @@ #////////////////////////////////////////////////////////////////////////////////////// +## runtime set(UTEST_EXE test_sycl) set(UTEST_NAME deterministic-unit_${UTEST_EXE}) @@ -17,3 +18,12 @@ add_executable(${UTEST_EXE} test_SYCLallocator.cpp) target_link_libraries(${UTEST_EXE} SYCL::device platform_runtime containers catch_main) add_unit_test(${UTEST_NAME} 1 1 $) + +## gemv/ger +set(UTEST_EXE test_sycl_blas) +set(UTEST_NAME deterministic-unit_${UTEST_EXE}) + +add_executable(${UTEST_EXE} test_syclBLAS.cpp) +target_link_libraries(${UTEST_EXE} catch_main containers platform_LA) + +add_unit_test(${UTEST_NAME} 1 1 $) diff --git a/src/Platforms/tests/SYCL/test_syclBLAS.cpp b/src/Platforms/tests/SYCL/test_syclBLAS.cpp new file mode 100644 index 000000000..3004fe02e --- /dev/null +++ b/src/Platforms/tests/SYCL/test_syclBLAS.cpp @@ -0,0 +1,94 @@ +////////////////////////////////////////////////////////////////////////////////////// +// This file is distributed under the University of Illinois/NCSA Open Source License. +// See LICENSE file in top directory for details. +// +// Copyright (c) 2021 QMCPACK developers. +// +// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +// +// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory +////////////////////////////////////////////////////////////////////////////////////// + +#include "catch.hpp" + +#include +#include +#include +#include "OMPTarget/OMPallocator.hpp" +#include "SYCL/SYCLruntime.hpp" +#include "SYCL/SYCLallocator.hpp" +#include "SYCL/syclBLAS.hpp" +#include +#include +#include "CPU/BLAS.hpp" + +namespace qmcplusplus +{ +template +void test_gemv(const int M_b, const int N_b, const char trans) +{ + const int M = trans == 'T' ? M_b : N_b; + const int N = trans == 'T' ? N_b : M_b; + + using vec_t = Vector; + using mat_t = Matrix; + + sycl::queue handle = getSYCLDefaultDeviceDefaultQueue(); + + vec_t A(N); // Input vector + mat_t B(M_b, N_b); // Input matrix + vec_t C(M); // Result vector ompBLAS + vec_t D(M); // Result vector BLAS + + // Fill data + for (int i = 0; i < N; i++) + A[i] = i; + + for (int j = 0; j < M_b; j++) + for (int i = 0; i < N_b; i++) + B[j][i] = i + j * 2; + + // Fill C and D with 0 + for (int i = 0; i < M; i++) + C[i] = D[i] = T(-0.1); + + A.updateTo(); + B.updateTo(); + + T alpha(1); + T beta(0); + + // in Fortran, B[M][N] is viewed as B^T + // when trans == 'T', the actual calculation is B * A[N] = C[M] + // when trans == 'N', the actual calculation is B^T * A[M] = C[N] + //ompBLAS::gemv(handle, trans, N_b, M_b, alpha, B.device_data(), N_b, A.device_data(), 1, beta, C.device_data(), 1); + + syclBLAS::gemv(handle, trans, M_b, M_b, alpha, B.device_data(), N_b, A.device_data(), 1, beta, C.device_data(), 1) + .wait(); + + C.updateFrom(); + + if (trans == 'T') + BLAS::gemv_trans(M_b, N_b, B.data(), A.data(), D.data()); + else + BLAS::gemv(M_b, N_b, B.data(), A.data(), D.data()); + + for (int index = 0; index < M; index++) + CHECK(C[index] == D[index]); +} + +TEST_CASE("OmpSYCL gemv", "[SYCL]") +{ + const int M = 137; + const int N = 79; + const int batch_count = 23; + + // Non-batched test + std::cout << "Testing TRANS gemv" << std::endl; +#if defined(ENABLE_OFFLOAD) + test_gemv>(M, N, 'T'); + test_gemv>(M, N, 'T'); +#endif +} + +} // namespace qmcplusplus From 2354c3aa2dbafa80ef77a59a74816c527f14ee66 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sat, 23 Apr 2022 23:27:08 -0500 Subject: [PATCH 19/26] Make default SYCL queue static. --- src/Platforms/DeviceManager.cpp | 2 +- src/Platforms/SYCL/CMakeLists.txt | 1 - src/Platforms/SYCL/SYCLDeviceManager.cpp | 15 ++++++++++++--- src/Platforms/SYCL/SYCLDeviceManager.h | 13 ++++++------- src/Platforms/SYCL/SYCLruntime.cpp | 7 ++----- 5 files changed, 21 insertions(+), 17 deletions(-) diff --git a/src/Platforms/DeviceManager.cpp b/src/Platforms/DeviceManager.cpp index 319bcc1e9..30c45b628 100644 --- a/src/Platforms/DeviceManager.cpp +++ b/src/Platforms/DeviceManager.cpp @@ -57,7 +57,7 @@ void DeviceManager::initializeGlobalDeviceManager(int local_rank, int local_size const DeviceManager& DeviceManager::getGlobal() { if (!global) - throw std::runtime_error("DeviceManager::getGlobal cannot access initialized the global instance."); + throw std::runtime_error("DeviceManager::getGlobal the global instance was not initialized."); return *global; } } // namespace qmcplusplus diff --git a/src/Platforms/SYCL/CMakeLists.txt b/src/Platforms/SYCL/CMakeLists.txt index df2ec9409..a1a73c5d2 100644 --- a/src/Platforms/SYCL/CMakeLists.txt +++ b/src/Platforms/SYCL/CMakeLists.txt @@ -14,7 +14,6 @@ set(SYCL_RT_SRCS SYCLDeviceManager.cpp SYCLallocator.cpp SYCLruntime.cpp) set(SYCL_LA_SRCS syclBLAS.cpp) add_library(platform_sycl_runtime ${SYCL_RT_SRCS}) -target_include_directories(platform_sycl_runtime PRIVATE "../") target_link_libraries(platform_sycl_runtime PUBLIC SYCL::host PRIVATE platform_host_runtime) diff --git a/src/Platforms/SYCL/SYCLDeviceManager.cpp b/src/Platforms/SYCL/SYCLDeviceManager.cpp index b8d07ac55..583cafbe8 100644 --- a/src/Platforms/SYCL/SYCLDeviceManager.cpp +++ b/src/Platforms/SYCL/SYCLDeviceManager.cpp @@ -16,6 +16,7 @@ #include #include #include +#include "config.h" #include "OutputManager.h" #include "determineDefaultDeviceNum.h" #if defined(_OPENMP) @@ -27,7 +28,6 @@ namespace qmcplusplus { - #if defined(_OPENMP) /** create SYCL device/contexts from OpenMP owned ones to ensure interoperability. * CUDA has the notion of primary context while SYCL requires explicitly sharing context. @@ -86,11 +86,20 @@ SYCLDeviceManager::SYCLDeviceManager(int& default_device_num, int& num_devices, else if (default_device_num != sycl_default_device_num) throw std::runtime_error("Inconsistent assigned SYCL devices with the previous record!"); - default_device_queue = - sycl::queue(visible_devices[sycl_default_device_num].context, visible_devices[sycl_default_device_num].device); + default_device_queue = std::make_unique(visible_devices[sycl_default_device_num].context, + visible_devices[sycl_default_device_num].device); } } +std::unique_ptr SYCLDeviceManager::default_device_queue; + +sycl::queue& SYCLDeviceManager::getDefaultDeviceQueue() +{ + if (!default_device_queue) + throw std::runtime_error("SYCLDeviceManager::getDefaultDeviceQueue() the global instance not initialized."); + return *default_device_queue; +} + #if defined(_OPENMP) static std::vector xomp_get_sycl_devices() { diff --git a/src/Platforms/SYCL/SYCLDeviceManager.h b/src/Platforms/SYCL/SYCLDeviceManager.h index 73ddfd846..3b5dff0c3 100644 --- a/src/Platforms/SYCL/SYCLDeviceManager.h +++ b/src/Platforms/SYCL/SYCLDeviceManager.h @@ -15,13 +15,12 @@ #ifndef QMCPLUSPLUS_SYCLDEVICEMANAGER_H #define QMCPLUSPLUS_SYCLDEVICEMANAGER_H -#include #include -#include "config.h" +#include +#include namespace qmcplusplus { - struct syclDeviceInfo { sycl::context context; @@ -34,7 +33,9 @@ class SYCLDeviceManager { int sycl_default_device_num; std::vector visible_devices; - sycl::queue default_device_queue; + + /// the global singleton which can be used to access the default queue of the default device. + static std::unique_ptr default_device_queue; public: SYCLDeviceManager(int& default_device_num, int& num_devices, int local_rank, int local_size); @@ -42,10 +43,8 @@ public: /** access the the DeviceManager owned default queue. * Restrict the use of it to performance non-critical operations. * Note: CUDA has a default queue but all the SYCL queues are explicit. - * Right now we return a copy of the default queue. Queues hold contexts and devices by referece. - * So making a copy is expected to be cheap. If this is not the case, we will find a cheap solution. */ - sycl::queue getDefaultDeviceQueue() const { return default_device_queue; } + static sycl::queue& getDefaultDeviceQueue(); }; } // namespace qmcplusplus diff --git a/src/Platforms/SYCL/SYCLruntime.cpp b/src/Platforms/SYCL/SYCLruntime.cpp index d4e48fcb7..ebab17478 100644 --- a/src/Platforms/SYCL/SYCLruntime.cpp +++ b/src/Platforms/SYCL/SYCLruntime.cpp @@ -10,13 +10,10 @@ ////////////////////////////////////////////////////////////////////////////////////// #include -#include "DeviceManager.h" +#include "SYCLDeviceManager.h" #include "SYCLruntime.hpp" namespace qmcplusplus { -sycl::queue getSYCLDefaultDeviceDefaultQueue() -{ - return DeviceManager::getGlobal().getSYCLDM().getDefaultDeviceQueue(); -} +sycl::queue getSYCLDefaultDeviceDefaultQueue() { return SYCLDeviceManager::getDefaultDeviceQueue(); } } // namespace qmcplusplus From 5d38fe84d98dd48fecec9bed223962ef12be0edd Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sun, 24 Apr 2022 11:26:37 -0500 Subject: [PATCH 20/26] Fix SYCL context storage. --- src/Platforms/SYCL/SYCLDeviceManager.cpp | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/src/Platforms/SYCL/SYCLDeviceManager.cpp b/src/Platforms/SYCL/SYCLDeviceManager.cpp index 583cafbe8..e9478451b 100644 --- a/src/Platforms/SYCL/SYCLDeviceManager.cpp +++ b/src/Platforms/SYCL/SYCLDeviceManager.cpp @@ -139,10 +139,9 @@ static std::vector xomp_get_sycl_devices() devices[id].device = sycl::ext::oneapi::level_zero::make_device(sycl_platform, reinterpret_cast(hDevice)); - const sycl::context sycl_context = - sycl::ext::oneapi::level_zero::make_context({devices[id].device}, - reinterpret_cast(hContext), - true /* keep the ownership, no transfer */); + devices[id].context = sycl::ext::oneapi::level_zero::make_context({devices[id].device}, + reinterpret_cast(hContext), + true /* keep the ownership, no transfer */); } else if (omp_backend.find("opencl") == 0) { From 3f2c8b1d723483f7d700933739ce632f8d7cedf1 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Sun, 24 Apr 2022 22:39:15 -0500 Subject: [PATCH 21/26] Fix bad ifdef. --- src/Platforms/DualAllocatorAliases.hpp | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/src/Platforms/DualAllocatorAliases.hpp b/src/Platforms/DualAllocatorAliases.hpp index 291aae51c..41effcc24 100644 --- a/src/Platforms/DualAllocatorAliases.hpp +++ b/src/Platforms/DualAllocatorAliases.hpp @@ -22,17 +22,7 @@ #define QMCPLUSPLUS_DUAL_ALLOCATOR_ALIASES_HPP #include "PinnedAllocator.h" - -#if defined(ENABLE_OFFLOAD) -#include "OMPTarget/OffloadAlignedAllocators.hpp" -namespace qmcplusplus -{ - template - using UnpinnedDualAllocator = OffloadAllocator; - template - using PinnedDualAllocator = OffloadPinnedAllocator; -} -#else +#if (defined(ENABLE_CUDA) || defined(ENABLE_SYCL)) && !defined(ENABLE_OFFLOAD) #include "DualAllocator.hpp" #if defined(ENABLE_CUDA) namespace qmcplusplus @@ -53,6 +43,16 @@ namespace qmcplusplus #else #error unhandled platform #endif + +#else // ENABLE_OFFLOAD or no CUDA or SYCL +#include "OMPTarget/OffloadAlignedAllocators.hpp" +namespace qmcplusplus +{ + template + using UnpinnedDualAllocator = OffloadAllocator; + template + using PinnedDualAllocator = OffloadPinnedAllocator; +} #endif #endif From cba655f562b0cb357a943c1a4e28faf8968b5af6 Mon Sep 17 00:00:00 2001 From: Mark Dewing Date: Mon, 25 Apr 2022 10:13:47 -0500 Subject: [PATCH 22/26] Overview of input file changes for batched drivers --- docs/input_overview.rst | 2 ++ docs/performance_portable.rst | 25 +++++++++++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/docs/input_overview.rst b/docs/input_overview.rst index 6a32e06bf..3753f94ca 100644 --- a/docs/input_overview.rst +++ b/docs/input_overview.rst @@ -118,6 +118,8 @@ Batched drivers check against ``max_seconds`` and make efforts to stop the execu In addition, a file named ``id`` plus ``.STOP``, in this case ``vmc.STOP``, stops QMCPACK execution on the fly cleanly once being found in the working directory. +.. _driver-version-parameter: + Driver version ~~~~~~~~~~~~~~ The ``driver_version`` parameter selects between the new performance-portable batched drivers and the previous drivers (now referred to as the 'legacy drivers'). diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst index dad901e9e..16151c5fd 100644 --- a/docs/performance_portable.rst +++ b/docs/performance_portable.rst @@ -18,10 +18,35 @@ The new drivers that implement this flexible batching scheme are called "batched For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. + Links to more information in other sections of the manual: - **Build instructions:** :ref:`OpenMP target offload ` section of the :ref:`obtaininginstalling` chapter. - **Supported features:** :ref:`gpufeatures` section of the :ref:`chap:features` chapter. + - **Enabling batch drivers** :ref:`driver-version-parameter` section of the :ref:`input-overview` chapter. + - **Driver Inputs:** :ref:`batched_drivers` section of the :ref:`qmcmethods` chapter. + + +Input files for batched drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the following changes to update input files to use the batched drivers. + +1. Update the project block with the ``driver_version`` parameter. For example: + +:: + + + batch + + +See :ref:`driver-version-parameter` for more. + +2. Modify the QMC algorithm blocks + +The most significant change is the ``walkers`` parameter has been replaced with ``walkers_per_rank`` or ``total_walkers``. + +See :ref:`batched_drivers` for details. From eccabb37df01682cea7c49456b1cd9606ac0877a Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Mon, 25 Apr 2022 15:06:59 -0500 Subject: [PATCH 23/26] Mandate CMake 3.20 when using SYCL and add a compiler workaround. --- CMake/IntelCompilers.cmake | 7 +++++-- CMakeLists.txt | 8 ++++++-- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/CMake/IntelCompilers.cmake b/CMake/IntelCompilers.cmake index 8c0ea214a..561d4d603 100644 --- a/CMake/IntelCompilers.cmake +++ b/CMake/IntelCompilers.cmake @@ -126,8 +126,11 @@ if(NOT CMAKE_SYSTEM_NAME STREQUAL "CrayLinuxEnvironment") endif() #(CMAKE_CXX_FLAGS MATCHES "-march=" AND CMAKE_C_FLAGS MATCHES "-march=") else() #(CMAKE_CXX_FLAGS MATCHES "-march=" OR CMAKE_C_FLAGS MATCHES "-march=") # use -march=native - set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=native") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native") + # skipped in OneAPI 2022.0 when using SYCL which caused linking failure. + if (NOT (CMAKE_CXX_COMPILER_VERSION VERSION_EQUAL 2022.0 AND ENABLE_SYCL)) + set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=native") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native") + endif() endif() #(CMAKE_CXX_FLAGS MATCHES "-march=" OR CMAKE_C_FLAGS MATCHES "-march=") endif() diff --git a/CMakeLists.txt b/CMakeLists.txt index abb5f6308..101493331 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -843,8 +843,12 @@ endif(ENABLE_HIP) # set up SYCL compiler options and libraries #------------------------------------------------------------------- if(ENABLE_SYCL) - if(NOT (CMAKE_CXX_COMPILER_ID MATCHES "IntelLLVM" OR INTEL_ONEAPI_COMPILER_FOUND)) - message(FATAL_ERROR "Only LLVM-based Intel compiler supports SYCL.") + # require 3.20 to recognize IntelLLVM compiler ID and check accurate version numbers. + if(CMAKE_VERSION VERSION_LESS 3.20.0) + message(FATAL_ERROR "ENABLE_SYCL require CMake 3.20.0 or later") + endif() + if(NOT CMAKE_CXX_COMPILER_ID MATCHES "IntelLLVM") + message(FATAL_ERROR "QMCPACK only supports SYCL with LLVM-based Intel compiler (icpx).") endif() add_library(SYCL::host INTERFACE IMPORTED) add_library(SYCL::device INTERFACE IMPORTED) From 7791e2f0607976551445c8ea9bd7802af402796a Mon Sep 17 00:00:00 2001 From: Paul Kent Date: Tue, 26 Apr 2022 18:43:05 -0400 Subject: [PATCH 24/26] More explanation --- docs/methods.rst | 2 +- docs/performance_portable.rst | 36 ++++++++++++++++++++++------------- 2 files changed, 24 insertions(+), 14 deletions(-) diff --git a/docs/methods.rst b/docs/methods.rst index 87f25ee6d..7de0eb192 100644 --- a/docs/methods.rst +++ b/docs/methods.rst @@ -123,7 +123,7 @@ Batched drivers The batched drivers introduce a new concept, "crowd", as a sub-organization of walker population. A crowd is a subset of the walkers that are operated on as as single batch. Walkers within a crowd operate their computation in lock-step, which helps the GPU efficiency. -Walkers between crowds remain fully asynchronous unless operations involving the full population are needed. +Walkers in different crowds remain fully asynchronous unless operations involving the full population are needed. With this flexible batching capability the new drivers are capable of delivering maximal performance on given hardware. In the new driver design, all the batched API calls may fallback to an existing single walker implementation. Consequently, batched drivers allow mixing and matching CPU-only and GPU-accelerated features diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst index 16151c5fd..991cae123 100644 --- a/docs/performance_portable.rst +++ b/docs/performance_portable.rst @@ -3,21 +3,31 @@ Performance Portable Implementation =================================== -Under the Exascale Computing Project effort a new set of QMC drivers was developed -to eliminate the divergence of legacy CPU and GPU code paths at the QMC driver level and make the drivers CPU/GPU agnostic. -The divergence came from the the fact that the CPU code path favors executing all the compute tasks within a step -for one walker and then advance walker by walker. Multiple CPU threads process their own assigned walkers in parallel. -In this way, walkers are not synchronized with each other and maximal throughout can be achieved on CPU. -The GPU code path favors executing the same compute task over all the walkers together to maximize GPU throughput. -This compute dispatch pattern minimizes the overhead of dispatching computation and host-device data transfer. -However, the legacy GPU code path only leverages the OpenMP main host thread for handling -all the interaction between the host and GPUs and limit the kernel dispatch capability. -In brief, the CPU code path handles computation with a walker batch size of one and many batches -while the GPU code path uses only one batch containing all the walkers. -The new drivers that implement this flexible batching scheme are called "batched drivers". +The so-called performance portable implementation was developed to present a unified way to run QMC on CPU and GPU +systems, eliminate the divergence between CPU and GPU code paths that had been introduced in the past, while still +maintaining high performance. This required generalizing all the driver inputs to potentially drive larger batches of +walkers and also eliminating ambiguities in the various input blocks of QMCPACK. Internally many new code paths have +been created, including new QMC drivers for VMC, DMC, and the wavefunction optimizer. -For OpenMP GPU offload users, batched drivers are essential to effectively use GPUs. +Once this implementation is sufficiently matured and enough features are available, the old non-performance portable +drivers will be deprecated and eventually deleted. The number of changes required to old input files is usually very +small, so use of the new performance portable implementation is encouraged, particularly for new projects. +The performance portable implementation load balances the total number of walkers onto MPI tasks, as per the old +drivers. The new implementation is able to subdivide the walkers of each MPI task into multiple similarly-sized crowds. +The walkers in each crowd are then updated simultaneously. This structure enables the walkers to be efficiently mapped +e.g. to CPU threads where even a single walker can be computed efficiently with a single thread. For efficient GPU +execution, each crowd is first owned by a distinct CPU thread, which in turn executes batched operations over all the +walkers in its crowd on the GPU. This batching enables efficient GPU execution, while the use of multiple crowds can +reduce synchronization and allow higher performance to be obtained. For these reasons the new performance portable +drivers are also referred to as batched drivers, since this is the largest change from the old code. + +The new implementation currently largely uses OpenMP offload for portability, although other technologies are also used +and the implementation has flexible dispatch to help obtain high performance on every platform. + +This implementation was designed and implemented as part of the Exascale Computing Project, with a view to bringing +QMCPACK to GPUs from multiple vendors with high-efficiency while creating a more maintainable and easy to contribute to +codebase. Links to more information in other sections of the manual: From a8906110f7ed631adbeb8e9c78b64365763ac392 Mon Sep 17 00:00:00 2001 From: Paul Kent Date: Tue, 26 Apr 2022 18:50:00 -0400 Subject: [PATCH 25/26] Improve text --- docs/performance_portable.rst | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/performance_portable.rst b/docs/performance_portable.rst index 991cae123..9435ec5a6 100644 --- a/docs/performance_portable.rst +++ b/docs/performance_portable.rst @@ -4,7 +4,7 @@ Performance Portable Implementation =================================== The so-called performance portable implementation was developed to present a unified way to run QMC on CPU and GPU -systems, eliminate the divergence between CPU and GPU code paths that had been introduced in the past, while still +systems, and eliminate the divergence between CPU and GPU code paths that had been introduced in the past, while still maintaining high performance. This required generalizing all the driver inputs to potentially drive larger batches of walkers and also eliminating ambiguities in the various input blocks of QMCPACK. Internally many new code paths have been created, including new QMC drivers for VMC, DMC, and the wavefunction optimizer. @@ -14,16 +14,18 @@ drivers will be deprecated and eventually deleted. The number of changes require small, so use of the new performance portable implementation is encouraged, particularly for new projects. The performance portable implementation load balances the total number of walkers onto MPI tasks, as per the old -drivers. The new implementation is able to subdivide the walkers of each MPI task into multiple similarly-sized crowds. -The walkers in each crowd are then updated simultaneously. This structure enables the walkers to be efficiently mapped -e.g. to CPU threads where even a single walker can be computed efficiently with a single thread. For efficient GPU -execution, each crowd is first owned by a distinct CPU thread, which in turn executes batched operations over all the -walkers in its crowd on the GPU. This batching enables efficient GPU execution, while the use of multiple crowds can -reduce synchronization and allow higher performance to be obtained. For these reasons the new performance portable -drivers are also referred to as batched drivers, since this is the largest change from the old code. +drivers. The new implementation is then able to subdivide the walkers of each MPI task into multiple similarly-sized +crowds. The walkers in each crowd can then be updated simultaneously. This structure enables the walkers to be +efficiently mapped to both CPUs and GPUs. On CPU systems, they then are mapped to OpenMP threads where a single walker +can be computed efficiently by even a single thread. On GPU systems, large numbers of GPU threads must be used +concurrently for high efficiency: Each crowd is first owned by a distinct CPU thread, which in turn executes batched +operations over all the walkers in its crowd on the GPU. Provided the batches are sufficiently large, this facilitates +efficient GPU execution, while the use of multiple crowds can reduce synchronization and allow higher performance to be +obtained. For these reasons the new performance portable drivers are also referred to as batched drivers, since this is +the largest change from the older code. -The new implementation currently largely uses OpenMP offload for portability, although other technologies are also used -and the implementation has flexible dispatch to help obtain high performance on every platform. +The new implementation largely uses OpenMP offload for portability, although other technologies are also used and the +implementation has flexible dispatch to help obtain high performance on every platform. This implementation was designed and implemented as part of the Exascale Computing Project, with a view to bringing QMCPACK to GPUs from multiple vendors with high-efficiency while creating a more maintainable and easy to contribute to From 5a46ed4644924959ebf1db39a711e41949a8c550 Mon Sep 17 00:00:00 2001 From: Ye Luo Date: Tue, 26 Apr 2022 19:43:21 -0500 Subject: [PATCH 26/26] Expand test_dual_allocators_ohmms_containers to SYCL. --- .../test_dual_allocators_ohmms_containers.cpp | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/src/Containers/tests/test_dual_allocators_ohmms_containers.cpp b/src/Containers/tests/test_dual_allocators_ohmms_containers.cpp index 1605702e7..7e0506a25 100644 --- a/src/Containers/tests/test_dual_allocators_ohmms_containers.cpp +++ b/src/Containers/tests/test_dual_allocators_ohmms_containers.cpp @@ -18,6 +18,9 @@ #if defined(ENABLE_CUDA) #include "DualAllocator.hpp" #include "CUDA/CUDAallocator.hpp" +#elif defined(ENABLE_SYCL) +#include "DualAllocator.hpp" +#include "SYCL/SYCLallocator.hpp" #endif #include "OhmmsPETE/OhmmsMatrix.h" #include "OhmmsSoA/VectorSoaContainer.h" @@ -33,7 +36,10 @@ template using OffloadPinnedAllocator = OMPallocator>; #if defined(ENABLE_CUDA) template -using CUDAPinnedAllocator = DualAllocator, PinnedAlignedAllocator>; +using VendorDualPinnedAllocator = DualAllocator, PinnedAlignedAllocator>; +#elif defined(ENABLE_SYCL) +template +using VendorDualPinnedAllocator = DualAllocator, PinnedAlignedAllocator>; #endif template @@ -109,9 +115,9 @@ TEST_CASE("OhmmsMatrix_VectorSoaContainer_View", "[Integration][Allocators]") { testDualAllocator>(); testDualAllocator>>(); -#if defined(ENABLE_CUDA) - testDualAllocator>(); - testDualAllocator>>(); +#if defined(ENABLE_CUDA) || defined(ENABLE_SYCL) + testDualAllocator>(); + testDualAllocator>>(); #endif } } // namespace qmcplusplus