From: Josh Vermaas (
Date: Wed Nov 25 2020 - 10:11:58 CST
Hi Rene,
The expedient thing to do is usually just to go with +ignoresharing. It
*should* also be possible for this to work if +ppn is set correctly. This
is a runscript that I've used in a slurm environment to correctly map GPUs
on a 2 socket 4-GPU system, where I was oversubscribing the GPUs (64
replicas, only 32 GPUs):
#SBATCH --gres=gpu:4
#SBATCH --nodes=8
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=6
#SBATCH --gpu-bind=closest
#SBATCH --time=4:0:0
set -x
module load gompi/2020a CUDA
#This isn't obvious, but this is a Linux-x86_64-ucx-smp-CUDA build compiled
from source.
srun $HOME/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +ppn 6 +replicas 64
run0.namd +stdout %d/run0.%d.log
It worked out that each replica was able to have 6 dedicated cores per
replica, which is where the +ppn 6 came from. Thus, even though each
replica saw multiple GPUs (gpu-bind closest meant that each replica saw the
2 GPUs closest to the CPU the 6 cores came from, rather than all 4 on the
node), I didn't need to specify devices or +ignoresharing.
Hope this helps!
On Wed, Nov 25, 2020 at 6:47 AM René Hafner TUK <>
> Update:
> I am ONLY able to run both NAMD2.13 and NAMD3alpha7 netlrts-smp-CUDA
> versions with
> +p2 +replicas 2, i.e. 1 core per replica.
> * But as soon as I use cores more than 1core per replica it fails.*
> Anyone ever experienced that?
> Any hints are appreciated!
> Kind regards
> René
> On 11/23/2020 2:22 PM, René Hafner TUK wrote:
> Dear all,
> I am trying to get an (e)ABF simulation running with multi-copy algorithm
> on a multiGPU node.
> I tried as describe in
> :
> charmrun ++local namd2 myconf_file.conf +p16 +replicas 2 +stdout
> logfile%d.log
> I am using the precompiled binaries from the Download page: NAMD 2.13
> Linux-x86_64-netlrts-smp-CUDA (Multi-copy algorithms, single process per
> copy)
> And for both NAMD2.13 and NAMD2.14 I get the error:
> FATAL ERROR: Number of devices (2) is not a multiple of number of
> processes (8). Sharing devices between processes is inefficient. Specify
> +ignoresharing (each process uses all visible devices) if not all devices
> are visible to each process, otherwise adjust number of processes to evenly
> divide number of devices, specify subset of devices with +devices argument
> (e.g., +devices 0,2), or multiply list shared devices (e.g., +devices
> 0,1,2,0).
> But even with using +devices 0,1 !
> I obtain the same error. Why should the number of devices be a multiple of
> the number of processes at all?
> Shouldn't it be the otherway around? 8 cores + 1 gpu PER replica for my
> example
> Can anyone give me some support here?
> Kind regards
> René Hafner
> --
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST