From: Robert Sawko (RSawko_at_uk.ibm.com)
Date: Thu Dec 01 2016 - 13:10:12 CST
Hi,
I am struggling with a strange issue. I am trying to run a GPU version
of NAMD2.12b on multiple node and over ibverbs. I am running the
following script (only relevant parts):
#BSUB -W 01:00
#BSUB -R "span[ptile=4]"
#BSUB -n 8
AFFINITY="+commap 0,8,112,120 +pemap 16-111:8.2"
charmrun +p48 ++ppn 6 \
++mpiexec ++remote-shell ./mympiexec \
\${NAMDBIN} +devices 0,1,2,3 \${AFFINITY} \
29.conf
NAMD reports correctly the bindings to each of 8 GPUs. However, when I
run nvdia-smi utitlity on the same nodes. I get perplexing output:
Node 1
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32241 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
| 1 32242 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
| 2 32244 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 86MiB |
| 3 32246 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 88MiB |
+-----------------------------------------------------------------------------+
Node 2
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 140MiB |
| 0 15989 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 115MiB |
| 1 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 1 15989 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 133MiB |
| 2 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 2 15991 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 86MiB |
| 3 15988 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 122MiB |
| 3 15993 C ...s/panther/local/apps/ibm/namd/2.12b/namd2 87MiB |
+-----------------------------------------------------------------------------+
This cannot be correct! Also, I tried the STMV 1, 20 and 210 as scaling
performance and I fail to see any scaling so I am sure that's something
is wrong, but I fail to see what I am doing wrong in my submission
script.
Please advise,
Robert
-- Dr Robert Sawko Research Staff Member, IBM Daresbury Laboratory Keckwick Lane, Warrington WA4 4AD United Kingdom -- Email (IBM): RSawko_at_uk.ibm.com Email (STFC): robert.sawko_at_stfc.ac.uk Phone (office): +44 (0) 1925 60 3967 Phone (mobile): +44 778 830 8522 Profile page: http://researcher.watson.ibm.com/researcher/view.php?person=uk-RSawko -- Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:20:51 CST