Re: CPU multinode NAMD 2.14 execution

From: Antonio Frances Monerris (antonio.frances_at_uv.es)
Date: Mon Sep 19 2022 - 10:58:01 CDT

Hi Josh,

Thanks for your quick answer. Your point makes very much sense. I've tried your command, and a new error appears:

OPENING EXTENDED SYSTEM TRAJECTORY FILE
FATAL ERROR: Unable to open text file output/abf_1.xst: File exists
[Partition 0][Node 0] End of program

It seems that only happens in one of the nodes, which does not expect this file to exist. The other 9 nodes do not report any error. It seems a problem with the parallelization, but I'm not sure. Any help?

Best regards,
Antonio

 
On Monday, September 19, 2022 17:21 CEST, Josh Vermaas <vermaasj_at_msu.edu> wrote:
 
> Hi Antonio,
>
> I think its because you have both srun *and* charmrun in the execution
> line. The srun is asking for 10 tasks, each of which is going to be
> running the same charmrun arguments, so you get 10 copies of the same
> simulation, each of which is using ++n 10 and ++ppn 35.
>
> What I might try is the following:
>
> srun -n 10 -c 36 namd2 +ppn 35 +setcpuaffinity $NAMD_INPUT > $NAMD_OUTPUT
>
> This is very similar to what I use on local GPU clusters:
>
> #!/bin/bash
> #SBATCH --gres=gpu:4
> #SBATCH --nodes=2
> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=12
> #SBATCH --gpu-bind=map_gpu:0,1,2,3
> #SBATCH --time=4:0:0
> #SBATCH --job-name=jobname
>
> cd $SLURM_SUBMIT_DIR
> module use /mnt/home/vermaasj/modules
> module load NAMD/2.14-gpu
> srun namd2 +ppn 11 +ignoresharing configfile.namd > logfile.log
>
>
> -Josh
>
>
> On 9/19/22 10:40 AM, Antonio Frances Monerris wrote:
> > Dear NAMD users,
> >
> > I am trying to run NAMD 2.14 in a scientific cluster operating with the Slurm job manager. My goal is to distribute the simulation into several nodes to accelerate the simulation timings. Each node has 36 physical CPUs (2 sockets of 18 processors each).
> >
> > Some info on the software versions:
> >
> > Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
> > Info: NAMD 2.14 for Linux-x86_64-verbs-smp
> >
> > This is the command I run:
> >
> > srun -N 10 charmrun ++n 10 ++ppn 35 namd2 +setcpuaffinity +idlepoll $NAMD_INPUT > $NAMD_OUTPUT
> >
> > It runs. However, I do not obtain what I want. The output prints these sentences, 10 times each:
> >
> > Charm++> Running in SMP mode: 10 processes, 35 worker threads (PEs) + 1 comm threads per process, 350 PEs total
> > Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
> > Charm++> Warning: the number of SMP threads (360) is greater than the number of physical cores (36), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.
> > Info: Running on 350 processors, 10 nodes, 1 physical nodes.
> >
> > The two first sentences are coherent with my purpose, but not the last two. Later, NAMD prints the statistics for the same steps also ten times each. It seems that instead of running one simulation in 10 nodes, it is repeating the same simulation 10 times, one per node. This seems to be confirmed by the .dcd file, which contains only the number of frames covered by the output (they are not multiplied by 10). The time per step does not significantly change when varying the number of nodes, coherently with my diagnostic.
> >
> > What am I missing? Can someone help me with the submission, please?
> >
> > Many thanks for reading.
> >
> > With sincere regards,
> > Antonio
> >
> >
> >
> >
> >
>
> --
> Josh Vermaas
>
> vermaasj_at_msu.edu
> Assistant Professor, Plant Research Laboratory and Biochemistry and Molecular Biology
> Michigan State University
> vermaaslab.github.io
>
 
 
 
 

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST