Re: NAMD hangs with replica option

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Wed Jan 12 2022 - 08:32:14 CST

Nope, Colvars already combines them for you into a single PMF, which gets
written by all replicas:
https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!pB3BonFfbKZhIVPtFu-p-pfjEH8auEmX0YHbD5uRikNyXyKxfyyriK7bLJppPy6K0Q$
each according to their "outputName" prefix, and its contents will be
the same, minus small deviations in between synchronizations.

If you need to analyze the contributions of each replica, you can use
"writePartialFreeEnergyFile on", as you have.

Giacomo

On Wed, Jan 12, 2022 at 4:21 AM jing liang <jingliang2015_at_gmail.com> wrote:

> Hi,
>
> your suggestion of using a different name for the output files worked.
> Thanks!
>
> A question derived from this simulation. In a simulation with X replicas
> one gets X PMFs, how do you combine all of them? Do you use NAMD (somehow)?
> Or maybe just take the average with a simple bash script?
>
> Have a nice day!
>
>
> El lun, 10 ene 2022 a las 20:27, Giacomo Fiorin (<giacomo.fiorin_at_gmail.com>)
> escribió:
>
>> Hi Jing,
>>
>>
>> On Mon, Jan 10, 2022 at 2:13 PM jing liang <jingliang2015_at_gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> thanks for your comments, outputname is set to "meta" only without a
>>> reference to replicas that you mentioned.
>>>
>>
>> Please make use outputName different for each replica as suggested,
>> otherwise they'll overwrite each other's output.
>>
>>
>>> May I ask you about the tcl function you mentioned, where could I find
>>> its description? I get the following output files:
>>>
>>
>>
>> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node9.html#SECTION00052300000000000000
>>
>>
>>>
>>> mymtd-replicas.txt
>>> meta-distance.5.files.txt.BAK
>>> meta-distance.5.files.txt
>>> meta-distance.0.files.txt.BAK
>>> meta-distance.0.files.txt
>>> meta.xst.BAK
>>> meta.restart.xsc.old
>>> meta.restart.vel.old
>>> meta.restart.coor.old
>>> meta.restart.colvars.state.old
>>> meta.restart.colvars.state
>>> meta.pmf.BAK
>>> meta.partial.pmf.BAK
>>> meta.dcd.BAK
>>> meta.colvars.traj.BAK
>>> meta.colvars.traj
>>> meta.colvars.state.old
>>> meta.colvars.meta-distance.5.state
>>> meta.colvars.meta-distance.5.hills.traj
>>> meta.colvars.meta-distance.5.hills
>>> meta.colvars.meta-distance.0.hills.traj
>>> meta.xst
>>> meta.restart.xsc
>>> meta.restart.vel
>>> meta.restart.coor
>>> meta.pmf
>>> meta.partial.pmf
>>> meta.dcd
>>> meta.colvars.state
>>> meta.colvars.meta-distance.0.state
>>> meta.colvars.meta-distance.0.hills
>>>
>>
>> This is consistent with your set up, each of those files is being written
>> over multiple times, but those that contain the replica ID are different
>> (because Colvars detects the replica ID internally from NAMD when you
>> launch NAMD with +replicas).
>>
>>
>>> plus the log file of NAMD which contains the information of the replicas
>>> I used here. Because I requested 8 replicas I expected more output files.
>>> The
>>> content of mymtd-replicas.txt (written by NAMD not by me) is:
>>>
>>> 0 meta-distance.0.files.txt
>>> 5 meta-distance.5.files.txt
>>>
>>> this tells me that somehow NAMD is setting 2 replicas although I
>>> requested 8: mpirun -np 112 namd2 +replicas 8 script.inp
>>>
>>
>> Not quite: normally that list would be populated by the replicas, one by
>> one. You ask for 8, but then because the replicas write all at the same
>> time *onto the same files* they end up with I/O errors and the
>> simulation doesn't seem to go on smoothly and the replicas don't get to the
>> registration step.
>>
>>
>>>
>>> The colvars config file contains the lines:
>>>
>>> metadynamics {
>>> name meta-distance
>>> colvars distance1
>>> hillWeight 0.1
>>> newHillFrequency 1000
>>> writeHillsTrajectory on
>>> hillwidth 1.0
>>>
>>> multipleReplicas on
>>> replicasRegistry mymtd-replicas.txt
>>> replicaUpdateFrequency 50000
>>> writePartialFreeEnergyFile on
>>> }
>>>
>>> I am running on a parallel file system for hpc. Any comment will be
>>> appreciated. Thanks again.
>>>
>>
>> For now the problem seems not to have differentiated the output prefix
>> between replicas. If the problem persists after fixing that, please also
>> report what kind of parallel file system (NFS, GPFS, Lustre, ...).
>>
>>
>>>
>>> El lun, 10 ene 2022 a las 17:22, Giacomo Fiorin (<
>>> giacomo.fiorin_at_gmail.com>) escribió:
>>>
>>>> Jing, you're probably using different values for outputName if you're
>>>> using multipleReplicas on (i.e. multiple walkers), but still, please
>>>> confirm that that's what you are using.
>>>>
>>>> Note also that by using file-based communication the replicas don't
>>>> need to be launched with the same command, but can also be run as
>>>> independent jobs:
>>>>
>>>> https://urldefense.com/v3/__https://colvars.github.io/colvars-refman-namd/colvars-refman-namd.html*sec:colvarbias_meta_mr__;Iw!!DZ3fjg!pB3BonFfbKZhIVPtFu-p-pfjEH8auEmX0YHbD5uRikNyXyKxfyyriK7bLJppPy6K0Q$
>>>> In that framework, the main advantage of +replicas is mostly that the
>>>> value of replicaID is filled automatically, so that your Colvars config
>>>> file can be identical for all replicas.
>>>>
>>>> If you are experiencing file I/O issues also when launching replicas
>>>> independently (i.e. not with a single NAMD run with +replicas), can you
>>>> find out what kind of filesystem you have on the compute nodes?
>>>>
>>>> Thanks
>>>> Giacomo
>>>>
>>>>
>>>>
>>>> On Mon, Jan 10, 2022 at 9:37 AM Josh Vermaas <vermaasj_at_msu.edu> wrote:
>>>>
>>>>> There is definitely a bug in the 2.14 MPI version. One of my students
>>>>> has noticed that anything that calls NAMD die isn't taking down all
>>>>> the
>>>>> replicas, and so the jobs will continue to burn resources until they
>>>>> reach their wallclock limit.
>>>>>
>>>>> However, the key is figuring out *why* you are getting an error. I'm
>>>>> less familiar with metadynamics, but at least for umbrella sampling,
>>>>> it
>>>>> is pretty typical for each replica to write out its own set of files.
>>>>> This is usually done with something like:
>>>>>
>>>>> outputname somename.[myReplica]
>>>>>
>>>>> Where [myReplica] is a Tcl function that evaluates to the replica ID
>>>>> for
>>>>> each semi-independent simulation. For debugging purposes, it can be
>>>>> very
>>>>> helpful for each replica to spit out its own log file. This is usually
>>>>> done by setting the +stdout option on the command line.
>>>>>
>>>>> mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp +stdout
>>>>> outputlog.%d.log
>>>>>
>>>>> -Josh
>>>>>
>>>>> On 1/9/22 2:34 PM, jing liang wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I am running a metadynamics simulation with NAMD 2.14 MPI version.
>>>>> > SLURM is being used for job scheduling, the way to run it by using 2
>>>>> > replica on a 14 cores node is as follows:
>>>>> >
>>>>> > mpirun -np 28 namd2 +replicas 2 namd_metadynamics.inp
>>>>> >
>>>>> > In fact, I have tried upto 8 replicas and the resulting pmf looks
>>>>> very
>>>>> > similar
>>>>> > to what I obtain with other methods such as ABF. The problem is that
>>>>> > by using
>>>>> > the replicas option, the simulation hangs right at the end. I have
>>>>> > looked at the
>>>>> > output files and it seems that right at the end NAMD wants to access
>>>>> > some files
>>>>> > (for example, *.xsc, *hills*, ...) that already exist and NAMD
>>>>> throws
>>>>> > an error.
>>>>> >
>>>>> > My guess is that this could be either a misunderstanding from my
>>>>> side
>>>>> > in running NAMD with replicas or a bug in the MPI version.
>>>>> >
>>>>> > Have you observed that issue previously? Any comment is welcome.
>>>>> Thanks
>>>>> >
>>>>>
>>>>> --
>>>>> Josh Vermaas
>>>>>
>>>>> vermaasj_at_msu.edu
>>>>> Assistant Professor, Plant Research Laboratory and Biochemistry and
>>>>> Molecular Biology
>>>>> Michigan State University
>>>>>
>>>>> https://urldefense.com/v3/__https://prl.natsci.msu.edu/people/faculty/josh-vermaas/__;!!DZ3fjg!qxoAM7sAMD7OOX4XekBXNyDSDwyL5GBEa1rt9qiV-ok0frmrn27DsCUvWPCFTfWyyQ$
>>>>>
>>>>>
>>>>>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST