From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Nov 17 2022 - 10:19:43 CST
Hello
my computer GA-X79-UD3 with two 680 GPUs,
Debian10 Linux,
$ uname -r
5.10.0-19-amd64
NAMD_Git-2022-07-21_Linux-x86_64-multicore-CUDA
Driver Version: 470.141.03 CUDA Version: 11.4
can't any more run namd-CUDA
Preceded by:
nvidia-smi -pm 1
Error with both devices:
namd2 +idlepoll +p12 +devices 0,1 min.conf
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
[Partition 0][Node 0] End of program
Error with device 0:
namd2 +idlepoll +p12 +devices 0 min.conf
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was
encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal
memory access was encountered
Error with device 1:
namd2 +idlepoll +p12 +devices 1 min.conf
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was
encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was
encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
memory access was encountered
This error arose months ago, with previous versions of CUDA drived and
Linux kernel and continues with new drivers/kernel.
My question here is whether these errors may arise from wrong usage of namd
(I am using the same commands that used to be OK long ago)
Computer engineers say that these can't be hardware errors. Actually,
should my namd commands above have used selectively one GOU or the other
one, memory failure is unlikely.
Thanks for advice
francesco pietra
This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST