Re: NAMD hangs with replica option

From: jing liang (jingliang2015_at_gmail.com)
Date: Fri Feb 11 2022 - 03:21:27 CST

Hi,

The replica simulation seems to be working fine now will all your useful
comments. I explored the possibility of having more
than 200 replicas. The simulation finished but the resulting PMF looks
worse than with only 4 replicas. The simulation
ran for 50000000 steps. The part of the input file for colvars for
metadynamics looks like:

   hillWeight 0.1
   newHillFrequency 1000
   writeHillsTrajectory on
   hillwidth 1.0

   multipleReplicas on
   replicasRegistry myrep.txt
   replicaUpdateFrequency 50000
   writePartialFreeEnergyFile on

Is there any recommendation when the number of replicas is larger than 4-8?
Also, I noticed that the well tempered
metadynamics became unstable as the simulation crashed with replicas.
Thanks in advance.

El mié, 12 ene 2022 a las 15:32, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
escribió:

> Nope, Colvars already combines them for you into a single PMF, which gets
> written by all replicas:
>
> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!qf9pH_gHRXV8MauhnezvALejWSsNw1djuRXZwdyZs_whToFZ290tfFVEY9231HX2Ag$
> each according to their "outputName" prefix, and its contents will be
> the same, minus small deviations in between synchronizations.
>
> If you need to analyze the contributions of each replica, you can use
> "writePartialFreeEnergyFile on", as you have.
>
> Giacomo
>
> On Wed, Jan 12, 2022 at 4:21 AM jing liang <jingliang2015_at_gmail.com>
> wrote:
>
>> Hi,
>>
>> your suggestion of using a different name for the output files worked.
>> Thanks!
>>
>> A question derived from this simulation. In a simulation with X replicas
>> one gets X PMFs, how do you combine all of them? Do you use NAMD (somehow)?
>> Or maybe just take the average with a simple bash script?
>>
>> Have a nice day!
>>
>>
>> El lun, 10 ene 2022 a las 20:27, Giacomo Fiorin (<
>> giacomo.fiorin_at_gmail.com>) escribió:
>>
>>> Hi Jing,
>>>
>>>
>>> On Mon, Jan 10, 2022 at 2:13 PM jing liang <jingliang2015_at_gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> thanks for your comments, outputname is set to "meta" only without a
>>>> reference to replicas that you mentioned.
>>>>
>>>
>>> Please make use outputName different for each replica as suggested,
>>> otherwise they'll overwrite each other's output.
>>>
>>>
>>>> May I ask you about the tcl function you mentioned, where could I find
>>>> its description? I get the following output files:
>>>>
>>>
>>>
>>> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node9.html#SECTION00052300000000000000
>>>
>>>
>>>>
>>>> mymtd-replicas.txt
>>>> meta-distance.5.files.txt.BAK
>>>> meta-distance.5.files.txt
>>>> meta-distance.0.files.txt.BAK
>>>> meta-distance.0.files.txt
>>>> meta.xst.BAK
>>>> meta.restart.xsc.old
>>>> meta.restart.vel.old
>>>> meta.restart.coor.old
>>>> meta.restart.colvars.state.old
>>>> meta.restart.colvars.state
>>>> meta.pmf.BAK
>>>> meta.partial.pmf.BAK
>>>> meta.dcd.BAK
>>>> meta.colvars.traj.BAK
>>>> meta.colvars.traj
>>>> meta.colvars.state.old
>>>> meta.colvars.meta-distance.5.state
>>>> meta.colvars.meta-distance.5.hills.traj
>>>> meta.colvars.meta-distance.5.hills
>>>> meta.colvars.meta-distance.0.hills.traj
>>>> meta.xst
>>>> meta.restart.xsc
>>>> meta.restart.vel
>>>> meta.restart.coor
>>>> meta.pmf
>>>> meta.partial.pmf
>>>> meta.dcd
>>>> meta.colvars.state
>>>> meta.colvars.meta-distance.0.state
>>>> meta.colvars.meta-distance.0.hills
>>>>
>>>
>>> This is consistent with your set up, each of those files is being
>>> written over multiple times, but those that contain the replica ID are
>>> different (because Colvars detects the replica ID internally from NAMD when
>>> you launch NAMD with +replicas).
>>>
>>>
>>>> plus the log file of NAMD which contains the information of the
>>>> replicas I used here. Because I requested 8 replicas I expected more output
>>>> files. The
>>>> content of mymtd-replicas.txt (written by NAMD not by me) is:
>>>>
>>>> 0 meta-distance.0.files.txt
>>>> 5 meta-distance.5.files.txt
>>>>
>>>> this tells me that somehow NAMD is setting 2 replicas although I
>>>> requested 8: mpirun -np 112 namd2 +replicas 8 script.inp
>>>>
>>>
>>> Not quite: normally that list would be populated by the replicas, one by
>>> one. You ask for 8, but then because the replicas write all at the same
>>> time *onto the same files* they end up with I/O errors and the
>>> simulation doesn't seem to go on smoothly and the replicas don't get to the
>>> registration step.
>>>
>>>
>>>>
>>>> The colvars config file contains the lines:
>>>>
>>>> metadynamics {
>>>> name meta-distance
>>>> colvars distance1
>>>> hillWeight 0.1
>>>> newHillFrequency 1000
>>>> writeHillsTrajectory on
>>>> hillwidth 1.0
>>>>
>>>> multipleReplicas on
>>>> replicasRegistry mymtd-replicas.txt
>>>> replicaUpdateFrequency 50000
>>>> writePartialFreeEnergyFile on
>>>> }
>>>>
>>>> I am running on a parallel file system for hpc. Any comment will be
>>>> appreciated. Thanks again.
>>>>
>>>
>>> For now the problem seems not to have differentiated the output prefix
>>> between replicas. If the problem persists after fixing that, please also
>>> report what kind of parallel file system (NFS, GPFS, Lustre, ...).
>>>
>>>
>>>>
>>>> El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<
>>>> giacomo.fiorin_at_gmail.com>) escribió:
>>>>
>>>>> Jing, you're probably using different values for outputName if you're
>>>>> using multipleReplicas on (i.e. multiple walkers), but still, please
>>>>> confirm that that's what you are using.
>>>>>
>>>>> Note also that by using file-based communication the replicas don't
>>>>> need to be launched with the same command, but can also be run as
>>>>> independent jobs:
>>>>>
>>>>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!qf9pH_gHRXV8MauhnezvALejWSsNw1djuRXZwdyZs_whToFZ290tfFVEY9231HX2Ag$
>>>>> In that framework, the main advantage of +replicas is mostly that the
>>>>> value of replicaID is filled automatically, so that your Colvars config
>>>>> file can be identical for all replicas.
>>>>>
>>>>> If you are experiencing file I/O issues also when launching replicas
>>>>> independently (i.e. not with a single NAMD run with +replicas), can you
>>>>> find out what kind of filesystem you have on the compute nodes?
>>>>>
>>>>> Thanks
>>>>> Giacomo
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:
>>>>>
>>>>>> There is definitely a bug in the 2.14 MPI version. One of my students
>>>>>> has noticed that anything that calls NAMD die isn't taking down all
>>>>>> the
>>>>>> replicas, and so the jobs will continue to burn resources until they
>>>>>> reach their wallclock limit.
>>>>>>
>>>>>> However, the key is figuring out *why* you are getting an error. I'm
>>>>>> less familiar with metadynamics, but at least for umbrella sampling,
>>>>>> it
>>>>>> is pretty typical for each replica to write out its own set of files.
>>>>>> This is usually done with something like:
>>>>>>
>>>>>> outputname somename.[myReplica]
>>>>>>
>>>>>> Where [myReplica] is a Tcl function that evaluates to the replica ID
>>>>>> for
>>>>>> each semi-independent simulation. For debugging purposes, it can be
>>>>>> very
>>>>>> helpful for each replica to spit out its own log file. This is
>>>>>> usually
>>>>>> done by setting the +stdout option on the command line.
>>>>>>
>>>>>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>>>>>> outputlog.%d.log
>>>>>>
>>>>>> -Josh
>>>>>>
>>>>>> On 1/9/22 2:34 PM, jing liang wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>>>>>> > SLURM is being used for job scheduling, the way to run it by using 2
>>>>>> > replica on a 14 cores node is as follows:
>>>>>> >
>>>>>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>>>>>> >
>>>>>> > In fact, I have tried upto 8 replicas and the resulting pmf looks
>>>>>> very
>>>>>> > similar
>>>>>> > to what I obtain with other methods such as ABF. The problem is
>>>>>> that
>>>>>> > by using
>>>>>> > the replicas option, the simulation hangs right at the end. I have
>>>>>> > looked at the
>>>>>> > output files and it seems that right at the end NAMD wants to
>>>>>> access
>>>>>> > some files
>>>>>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD
>>>>>> throws
>>>>>> > an error.
>>>>>> >
>>>>>> > My guess is that this could be either a misunderstanding from my
>>>>>> side
>>>>>> > in running NAMD with replicas or a bug in the MPI version.
>>>>>> >
>>>>>> > Have you observed that issue previously? Any comment is welcome.
>>>>>> Thanks
>>>>>> >
>>>>>>
>>>>>> --
>>>>>> Josh Vermaas
>>>>>>
>>>>>> vermaasj_at_msu.edu
>>>>>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>>>>>> Molecular Biology
>>>>>> Michigan State University
>>>>>>
>>>>>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>>>>>
>>>>>>
>>>>>>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST