Re: NAMD hangs with replica option

From: jing liang (jingliang2015_at_gmail.com)
Date: Mon Jan 10 2022 - 13:12:53 CST

Hi,

thanks for your comments, outputname is set to "meta" only without a
reference to replicas that you mentioned. May I ask you about the tcl
function you mentioned, where could I find its description? I get the
following output files:

mymtd-replicas.txt
meta-distance.5.files.txt.BAK
meta-distance.5.files.txt
meta-distance.0.files.txt.BAK
meta-distance.0.files.txt
meta.xst.BAK
meta.restart.xsc.old
meta.restart.vel.old
meta.restart.coor.old
meta.restart.colvars.state.old
meta.restart.colvars.state
meta.pmf.BAK
meta.partial.pmf.BAK
meta.dcd.BAK
meta.colvars.traj.BAK
meta.colvars.traj
meta.colvars.state.old
meta.colvars.meta-distance.5.state
meta.colvars.meta-distance.5.hills.traj
meta.colvars.meta-distance.5.hills
meta.colvars.meta-distance.0.hills.traj
meta.xst
meta.restart.xsc
meta.restart.vel
meta.restart.coor
meta.pmf
meta.partial.pmf
meta.dcd
meta.colvars.state
meta.colvars.meta-distance.0.state
meta.colvars.meta-distance.0.hills

plus the log file of NAMD which contains the information of the replicas I
used here. Because I requested 8 replicas I expected more output files. The
content of mymtd-replicas.txt (written by NAMD not by me) is:

0 meta-distance.0.files.txt
5 meta-distance.5.files.txt

this tells me that somehow NAMD is setting 2 replicas although I requested
8: mpirun -np 112 namd2 +replicas 8 script.inp

The colvars config file contains the lines:

metadynamics {
   name meta-distance
   colvars distance1
   hillWeight 0.1
   newHillFrequency 1000
   writeHillsTrajectory on
   hillwidth 1.0

   multipleReplicas on
   replicasRegistry mymtd-replicas.txt
   replicaUpdateFrequency 50000
   writePartialFreeEnergyFile on
}

I am running on a parallel file system for hpc. Any comment will be
appreciated. Thanks again.

El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
escribió:

> Jing, you're probably using different values for outputName if you're
> using multipleReplicas on (i.e. multiple walkers), but still, please
> confirm that that's what you are using.
>
> Note also that by using file-based communication the replicas don't need
> to be launched with the same command, but can also be run as independent
> jobs:
>
> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!u64WLSCLay-1an06P126qKju-xpHv28Hsoce6lP4-97GtF5-1FLLyxZ_QFOPWUlRSQ$
> In that framework, the main advantage of +replicas is mostly that the
> value of replicaID is filled automatically, so that your Colvars config
> file can be identical for all replicas.
>
> If you are experiencing file I/O issues also when launching replicas
> independently (i.e. not with a single NAMD run with +replicas), can you
> find out what kind of filesystem you have on the compute nodes?
>
> Thanks
> Giacomo
>
>
>
> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:
>
>> There is definitely a bug in the 2.14 MPI version. One of my students
>> has noticed that anything that calls NAMD die isn't taking down all the
>> replicas, and so the jobs will continue to burn resources until they
>> reach their wallclock limit.
>>
>> However, the key is figuring out *why* you are getting an error. I'm
>> less familiar with metadynamics, but at least for umbrella sampling, it
>> is pretty typical for each replica to write out its own set of files.
>> This is usually done with something like:
>>
>> outputname somename.[myReplica]
>>
>> Where [myReplica] is a Tcl function that evaluates to the replica ID for
>> each semi-independent simulation. For debugging purposes, it can be very
>> helpful for each replica to spit out its own log file. This is usually
>> done by setting the +stdout option on the command line.
>>
>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>> outputlog.%d.log
>>
>> -Josh
>>
>> On 1/9/22 2:34 PM, jing liang wrote:
>> > Hi,
>> >
>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>> > SLURM is being used for job scheduling, the way to run it by using 2
>> > replica on a 14 cores node is as follows:
>> >
>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>> >
>> > In fact, I have tried upto 8 replicas and the resulting pmf looks very
>> > similar
>> > to what I obtain with other methods such as ABF. The problem is that
>> > by using
>> > the replicas option, the simulation hangs right at the end. I have
>> > looked at the
>> > output files and it seems that right at the end NAMD wants to access
>> > some files
>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD throws
>> > an error.
>> >
>> > My guess is that this could be either a misunderstanding from my side
>> > in running NAMD with replicas or a bug in the MPI version.
>> >
>> > Have you observed that issue previously? Any comment is welcome. Thanks
>> >
>>
>> --
>> Josh Vermaas
>>
>> vermaasj_at_msu.edu
>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>> Molecular Biology
>> Michigan State University
>>
>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST