From: Joseph Farran (jfarran_at_uci.edu)
Date: Wed Oct 23 2013 - 13:46:00 CDT
Greetings.
We have been running NAMD successfully for many moons on our campus cluster.
We recently added checkpoint facility BLCR ( Berkeley Lab Checkpoint/Restart ).
I know that NAMD has it's own restart files, but for our user base, using BLCR with NAMD
would make it a lot easier.
NAMD appears to BLCR checkpoint just fine, but fails on restart. Checking with the BLCR
support group, they suspect that it may be a "bug" with a NAMD file descriptor leak (see
email below).
The error we get on NAMD startup with BLCR is:
- Failed to open file '/proc/58743/task/58743/stat'
- cr_restore_all_files [6446]: Unable to restore fd 3 (type=1,err=-2)
- cr_rstrt_child [6446]: Unable to restore files! (err=-2)
Restart failed: No such file or directory
Anyone in the NAMD support staff able to verify if this is a bug and if it can be fixed?
Thank you,
Joseph A. Farran
University of California, Irvine
Office of Information Technology
209 Multipurpose Science & Technology
Irvine, CA 92697-2225
-------- Original Message --------
Subject: Re: [Checkpoint] BLCR and NAMD
Date: Sun, 13 Oct 2013 15:20:34 -0700
From: Paul Hargrove <phhargrove_at_lbl.gov>
To: Joseph Farran <jfarran_at_uci.edu>
CC: checkpoint <checkpoint_at_lbl.gov>
Joseph,
I am fairly certain this *is* a BLCR limitation, because to the best of my recollection we don't do anything exceptional for the case that an application has a file open under /proc.
In principle, it might be a "bug" in NAMD if this file is not open intentionally (a "file descriptor leak"). However, the inability to restore this open descriptor is still an unexpected/unintended limitation in BLCR. Since NAMD is a very real application, having it as motivating case for fixing this limitation would be valuable.
-Paul
On Sun, Oct 13, 2013 at 3:09 PM, Joseph Farran <jfarran_at_uci.edu <mailto:jfarran_at_uci.edu>> wrote:
Thanks again Paul.
Let me check with NAMD folks before I open a bug report as it's probably not BLCR.
-- Paul H. Hargrove PHHargrove_at_lbl.gov <mailto:PHHargrove_at_lbl.gov> Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:52 CST