Re: NAMD3 runs failing with no error - XST failing to load and jumping to negative timestep

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Thu Mar 10 2022 - 08:26:38 CST

Hi Nicole, this looks clearly like overflowing a 32-bit integer:
https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Integer_overflow__;!!DZ3fjg!uWrkpSE4JQ6OeskoNtD3xRK7GTp1U7e8twmGYVOjb88g_ptbEHsyElyqwRn6-uao0w$
and your DCD files are probably affected, too, although applications like
VMD will generally ignore the step counter contained in the DCD header, and
just count the number of frames (which is a much lower number).

Using a 64-bit integer to represent the cumulative simulation step would be
appropriate, but it needs to be done consistently throughout the code, and
in NAMD this just hasn't happened yet. Some of the code is very modern
(i.e. the fast GPU code that you're using), but other parts of NAMD haven't
been touched in two decades. This is one.

My recommendation would be *not to use firstTimeStep* in the NAMD
configuration file, so that the step number can be safely represented by a
32-bit int, and just add up the numbers of steps yourself during analysis.

Giacomo

On Thu, Mar 10, 2022 at 2:46 AM Nicole Richardson <RCHNIC009_at_myuct.ac.za>
wrote:

> Hi all,
>
> I've been running MD simulations of solvated carbohydrate-based molecules
> with NAMD3 on A100 Nvidia GPU cards. I have successfully run about 2100ns
> of simulation (broken into 100ns chunks restarting from the last timestep
> each time). I am trying to extend this simulation to 2500ns, however, when
> I try and run the next 100ns (2100-2200ns), my job completes in about 40s
> (usually takes four days) with no error message.
>
> When I look in the log file everything appears to start up and load
> normally from the last step (2100000000fs), however, when it comes to
> starting the simulation, the XST file doesn't load normally and the
> timestep jumps to -2094967296fs before completing with no error messages.
>
> What I see in the failed run log files:
>
> *TCL: Running for 100000000 steps*
> *PRESSURE: 2100000000 179.257 -48.3392 14.7691 -48.3391 -23.1117 26.2091
> 14.7699 26.2087 23.974*
> *GPRESSURE: 2100000000 58.5932 12.5161 -7.17997 -23.4519 -15.1422 7.11123
> 7.1262 -73.2121 80.6467*
> *ETITLE: TS BOND ANGLE DIHED
> IMPRP ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP POTENTIAL TOTAL3
> TEMPAVG PRESSURE GPRESSURE VOLUME
> PRESSAVG GPRESSAVG*
>
> *ENERGY: 2100000000 91254.0209 51736.6118 1220.2276
> 3.7103 -1126784.1720 124629.9070 0.0000 0.0000
> 239952.8038 -617986.8906 300.7945 -857939.6943
> -615711.6734 300.7945 60.0397 41.3659
> 2572854.6858 60.0397 41.3659*
>
> *OPENING EXTENDED SYSTEM TRAJECTORY FILE*
> *WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP -2094967296*
> *CLOSING EXTENDED SYSTEM TRAJECTORY FILE*
> *WRITING COORDINATES TO OUTPUT FILE AT STEP -2094967296*
> *COORDINATE DCD FILE
> /scratch/rchnic009/pn10b_6RU/run22/pn10b_6RU_run22.dcd WAS NOT CREATED*
> *The last position output (seq=-2) takes 0.019 seconds, 0.000 MB of memory
> in use*
> *WRITING VELOCITIES TO OUTPUT FILE AT STEP -2094967296*
> *The last velocity output (seq=-2) takes 0.015 seconds, 0.000 MB of memory
> in use*
> *====================================================*
>
> *WallClock: 39.436554 CPUTime: 38.389771 Memory: 0.000000 MB*
> *[Partition 0][Node 0] End of program*
>
> What I expect from successful runs:
>
>
>
> *TCL: Running for 100000000 steps PRESSURE: 2000000000 -220.025 89.416
> -156.942 89.4157 -79.0945 16.1138 -156.942 16.1138 -18.1251 GPRESSURE:
> 2000000000 -75.8928 40.0892 -57.6707 66.6383 38.9224 8.28696 -36.6139
> 42.0393 -89.5388 ETITLE: TS BOND ANGLE
> DIHED IMPRP ELECT VDW BOUNDARY
> MISC KINETIC TOTAL TEMP POTENTIAL
> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> VOLUME PRESSAVG GPRESSAVG ENERGY: 2000000000 91331.3257
> 51973.8169 1205.2632 2.9099 -1126485.6683 124403.0619
> 0.0000 0.0000 239372.8908 -618196.4000
> 300.0676 -857569.2907 -615924.8763 300.0676 -105.7481
> -42.1697 2579962.2125 -105.7481 -42.1697 OPENING EXTENDED
> SYSTEM TRAJECTORY FILE Info: Initial time: 1 CPUs 0.00554295 s/step 15.5874
> ns/day 0 MB memory Info: Initial time: 1 CPUs 0.00311856 s/step 27.7051
> ns/day 0 MB memory Info: Initial time: 1 CPUs 0.00311102 s/step 27.7722
> ns/day 0 MB memory Info: Benchmark time: 1 CPUs 0.00305723 s/step 28.2609
> ns/day 0 MB memory Info: Benchmark time: 1 CPUs 0.00309188 s/step 27.9442
> ns/day 0 MB memory OPENING COORDINATE DCD FILE WRITING COORDINATES TO DCD
> FILE /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP
> 2000000250 The last position output (seq=2000000250) takes 0.021 seconds,
> 0.000 MB of memory in use Info: Benchmark time: 1 CPUs 0.00367223 s/step
> 23.5279 ns/day 0 MB memory Info: Benchmark time: 1 CPUs 0.00301884 s/step
> 28.6203 ns/day 0 MB memory Info: Benchmark time: 1 CPUs 0.00303448 s/step
> 28.4727 ns/day 0 MB memory Info: Benchmark time: 1 CPUs 0.00303912 s/step
> 28.4293 ns/day 0 MB memory WRITING COORDINATES TO DCD FILE
> /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000000500
> The last position output (seq=2000000500) takes 0.017 seconds, 0.000 MB of
> memory in use*
>
> *WRITING COORDINATES TO DCD FILE
> /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000000750
> The last position output (seq=2000000750) takes 0.017 seconds, 0.000 MB of
> memory in use WRITING COORDINATES TO DCD FILE
> /scratch/rchnic009/pn10b_6RU/run21/pn10b_6RU_run21.dcd AT STEP 2000001000
> The last position output (seq=2000001000) takes 0.014 seconds, 0.000 MB of
> memory in use *
>
> I have looked through the relevant user manuals and the mailing list and
> haven't been able to shed any light on the issue. I have also experienced
> this exact same problem across four different MD simulations on four
> different A100 cards. I have also tried re-running my simulation from step
> 2000ns which runs fine to 2100ns but it fails again when trying to run the
> next 100ns. Each time it jumps to the same negative timestep.
>
>
> Has anyone else experienced and solved this issue or have any idea of what
> may fix this problem?
>
>
>
> Please reach out if there is any information I may have omitted and thanks
> in advance for your time!
> Regards
> Nicole Richardson
> Disclaimer - University of Cape Town This email is subject to UCT policies
> and email disclaimer published on our website at
> https://urldefense.com/v3/__http://www.uct.ac.za/main/email-disclaimer__;!!DZ3fjg!uWrkpSE4JQ6OeskoNtD3xRK7GTp1U7e8twmGYVOjb88g_ptbEHsyElyqwRnKRE-iFg$ or obtainable from +27 21 650
> 9111. If this email is not related to the business of UCT, it is sent by
> the sender in an individual capacity. Please report security incidents or
> abuse via https://urldefense.com/v3/__https://csirt.uct.ac.za/page/report-an-incident.php__;!!DZ3fjg!uWrkpSE4JQ6OeskoNtD3xRK7GTp1U7e8twmGYVOjb88g_ptbEHsyElyqwRm_rLQqXg$ .
>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST