From: Leandro Martínez (
Date: Tue May 22 2007 - 12:55:22 CDT
It is true that the memory paranoid binary fails to run on two processors
on the same machine, however we can run the simulations stably in a
single node. Now I think that this may be a problem with the memory
paranoid binary particularly.
Brian, sorry not telling this before. I have run all the tests of charm++
now. There is one test of charm (and mpirun)
that clearly has problems. The test is the "checkpoint" in
directory tests/charm++/chkpt
The log file of one of the compilations is at
There are some errors related to fortran 90 files, but
it says that charm++ was built successfully.
The error does not occur if we run locally with two processors (charmrun
++local +p2).
The error persists if we used binaries that run fine on other machines.
We have tested this with two different charm++ compilations, one for
net-linux-amd64 and other for net-linux-amd64-smp-tcp, and we have
also tried "mpirun". The errors are the following, and could be related
to our problem:
Using net-linux-amd64 (from tests/charm++/chkpt) directory:
The command is: ./charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh
Charm++: scheduler running in netpoll mode.
Running Hello on 4 processors for 8 elements
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
myClient. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
[0] Checkpoint starting in log
Main's PUPer. a=123(0x77a95c), b[0]=456(0x77a960), b[1]=789.
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
Stack Traceback:
[0] CmiAbort+0x55 [0x4cc66b]
[1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xc0 [0x4a7674]
[3] CkDeliverMessageFree+0x30 [0x46e818]
[8] _Z15_processHandlerPvP11CkCoreState+0x130 [0x4733c0]
[9] CmiHandleMessage+0xa5 [0x4d36c1]
[10] CsdScheduleForever+0x75 [0x4d3a82]
[11] CsdScheduler+0x16 [0x4d39e5]
[13] ConverseInit+0x2f6 [0x4d1cba]
[14] main+0x2d [0x476b61]
[15] __libc_start_main+0xf4 [0x2b94b4c03134]
[16] __gxx_personality_v0+0x91 [0x45d5a9]
Fatal error on PE 1> Failed to create checkpoint file for group table!
Using mpirun: ( mpirun n0-1 -np 4 ./hello ) from the corresponding
mpi compiled charm++ test directory.
Running Hello on 4 processors for 8 elements
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
myClient. a=123(0x70cb7c), b[0]=456(0x70cb80), b[1]=789.
[0] Checkpoint starting in log
------------- Processor 3 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Failed to create checkpoint file for group table!
Stack Traceback:
[0] CmiAbort+0x2f [0x4c1710]
[1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
[3] CkDeliverMessageFree+0x2e [0x467a3c]
[4] ./hello [0x467a99]
[5] ./hello [0x467b04]
[6] ./hello [0x46aa87]
[7] ./hello [0x46adee]
[8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
[9] CmiHandleMessage+0x7a [0x4c294b]
[10] CsdScheduleForever+0x5f [0x4c2ba6]
[11] CsdScheduler+0x16 [0x4c2b1f]
[12] ./hello [0x4c13f6]
[13] ConverseInit+0x2dd [0x4c16df]
[14] main+0x2b [0x46f1d3]
[15] __libc_start_main+0xf4 [0x2ba4999ac134]
[16] __gxx_personality_v0+0x79 [0x457c99]
Stack Traceback:
[0] CmiAbort+0x2f [0x4c1710]
[1] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallback+0xba [0x49d8f2]
[3] CkDeliverMessageFree+0x2e [0x467a3c]
[4] ./hello [0x467a99]
[5] ./hello [0x467b04]
[6] ./hello [0x46aa87]
[7] ./hello [0x46adee]
[8] _Z15_processHandlerPvP11CkCoreState+0x118 [0x46bc18]
[9] CmiHandleMessage+0x7a [0x4c294b]
[10] CsdScheduleForever+0x5f [0x4c2ba6]
[11] CsdScheduler+0x16 [0x4c2b1f]
[12] ./hello [0x4c13f6]
[13] ConverseInit+0x2dd [0x4c16df]
[14] main+0x2b [0x46f1d3]
[15] __libc_start_main+0xf4 [0x2b147a85d134]
[16] __gxx_personality_v0+0x79 [0x457c99]
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 30636 failed on node n0 ( with exit status 1.
Another information:
Just to illustrate other kind of problem we have observed, which has
now for a mpi run of the apoa benchmark. The simulation stops because
some atom is moving too fast. However, check the velocities:
TIMING: 15840 CPU: 9285.78, 0.509619/step Wall: 9285.78, 0.509619/step,
705.562 hours remaining, 126605 kB of memory in use.
ERROR: Atom 48403 velocity is -11.8617 -2.45979e+87 8.48634 (limit is 10000)
ERROR: Atoms moving too fast; simulation has become unstable.
ERROR: Exiting prematurely.
WallClock: 9291.767578 CPUTime: 9291.767578 Memory: 126585 kB
End of program
Clearly this is data corruption, not a "physical" problem.
On 5/22/07, Brian Bennion <> wrote:
> Hello Leandro,
> I sent several messages last week asking about the charm compilation.
> you get the charm++ test to work?
> The fact that memory paranoid caught a bad memory access leads me to
> believe the charm++ underlayer is not compiled correctly
> Brian
> At 06:17 AM 5/22/2007, Leandro Martínez wrote:
> Hi Gengbin, Brian and others,
> I have compiled namd for mpi, and the simulation also crashed, with
> the message given at the end of the email. The same simulation is running
> in our opteron cluster for more than four days now (more than two million
> steps).
> The apoa benchmark also crashed using mpi, this message
> was observed after step 41460 and was only
> (command line: mpirun n0-1 -np 4 ./namd2 apoa1.namd):
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
> PID 8711 failed on node n0 ( ) due to signal 11.
> Using a binary compiled with "-memory os" and "-thread context"
> (running with +netpoll) the simulation (the apoa benchmark) crashes
> the first timestep, with (same thing with our simulation):
> Info: Finished startup with 22184 kB of memory in use.
> ------------- Processor 0 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory
> Stack Traceback:
> [0] /lib/ [0x2b24659505c0]
> [1] _ZN10Controller9threadRunEPS_+0 [0x5d4520]
> Fatal error on PE 0> segmentation violation
> The best insight we had I think is the fact that the "memory
> paranoid" executable running our simulation
> does not crash in dual processor opteron
> machines, but crashes in dual-core machines when
> running with more than one process per node, before
> the first time step of the simulation. The apoa simulation
> does not crash before the first time step, but we haven't
> run it for long.
> I feel that there
> is some problem with memory sharing in dual core machines,
> I guess. Does anybody more has clusters running with this
> kind of architecture? If somebody does, which is the ammount
> of memory per node?
> Clearly we cannot rule out some low-level communication problem.
> However, as I said before, we have already changed every
> piece of hardware and software (not the power supplies of the
> cpus, I think, could be that for any odd reason?).
> Any clue?
> Leandro.
> ------------------
> Crash of our benchmark using the mpi compiled namd with:
> mpirun n0-1 -np 4 ./namd2 test.namd
> ENERGY: 6100 20313.1442 13204.7818 1344.4563
> -253294.1741 24260.9686 0.0000 0.0000
> 53056.0154 -140975.8980 298.9425 -140488.7395 -
> 299.1465 -1706.3395 -1586.1351 636056.0000
> -1691.7028 -1691.3721
> FATAL ERROR: pairlist i_upper mismatch!
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: pairlist i_upper mismatch!
> Stack Traceback:
> [0] CmiAbort+0x2f [0x734220]
> [1] _Z8NAMD_bugPKc+0x4f [0x4b7aaf]
> [2]
> _ZN20ComputeNonbondedUtil9calc_pairEP9nonbonded+0x52c
> [0x54755c]
> [3]
> _ZN20ComputeNonbondedPair7doForceEPP8CompAtomPP7Results+0x580
> [0x50a8f0]
> [4] _ZN16ComputePatchPair6doWorkEv+0xca [0x5bc5da]
> [5]
> [0x683a5d]
> [6] CkDeliverMessageFree+0x2e [0x6d2aa8]
> [7] ./namd2 [0x6d2b05]
> [8] ./namd2 [0x6d2b70]
> [9] ./namd2 [0x6d5af3]
> [10] ./namd2 [0x6d5e5a]
> [11] _Z15_processHandlerPvP11CkCoreState+0x118
> [0x6d6c84]
> [12] CmiHandleMessage+0x7a [0x73545b]
> [13] CsdScheduleForever+0x5f [0x7356b6]
> [14] CsdScheduler+0x16 [0x73562f]
> [15] _ZN9ScriptTcl3runEPc+0x11a [0x664c4a]
> [16] main+0x125 [0x4b9ef5]
> [17] __libc_start_main+0xf4 [0x2ab5f4b1f134]
> [18] __gxx_personality_v0+0xf1 [0x4b72e9]
