NOTCam temperature monitoring
As we need to put special care on not letting NOTCam warm when it
should not, there's a temperature monitor running all the time it is
not mounted. When mounted, the temperature monitor must be stopped,
as both it and NOTCam's user interfaces make use of a common resource that
cannot be shared (ie: one won't work if the other is running).
As of 3/Jul/2006 the procedure to start and stop the monitor has been
changed, along with the introduction of some internal checks to be extra
sure that it is running when it should be, and that it is stopped when it's
not needed. Previously it was more or less run "by hand" every time it was
found stopped. This is not the case anymore.
Starting the monitor
- Log into marissa as obs user
- Type the command 'start-notcam-temp-mon'
Stopping the monitor
- Log into marissa as obs user
- Type the command 'stop-notcam-temp-mon'
Normal operation
When the monitor is operating normally, you can check NOTCam's status
at the file ~notcam/NOTCAM_temp, for example using
the command 'tail -f ~notcam/NOTCAM_temp'. Every few
minutes the monitor appends to this file a line with measurements. If
everything is going right, you should see something like this at the end
of the file:
Mon Jul 3 10:43:29 2006 1151923409 -201.2 12.2 -199.7 -199.0 1.11e-04
Mon Jul 3 10:47:44 2006 1151923664 -201.2 12.1 -199.7 -199.0 1.10e-04
Mon Jul 3 11:17:00 2006 1151925420 -201.2 12.4 -199.7 -199.0 1.08e-04
Mon Jul 3 11:22:03 2006 1151925723 -201.1 12.5 -199.6 -199.0 1.08e-04
Every line starts with a timestamp, both in a way a human can read it
(Mon Jul 3 10:43:29 2006) and Unix timestamp (1151923409),
followed by temperature data (table, outer vessel, center wheel and detector).
The last value shows the internal pressure.
When you stop the monitor, the scripts will append a line stating it. This
way you can't be confused if you see an abnormal timing (eg: no updates in
the last few hours):
Mon Jul 3 10:04:59 2006 1151921099 -201.2 11.8 -199.8 -199.0 1.12e-04
Mon Jul 3 10:10:00 2006 1151921400 -201.2 12.0 -199.9 -199.0 1.11e-04
Mon Jul 3 10:15:02 2006 1151921702 -201.2 11.9 -199.8 -199.0 1.12e-04
Mon Jul 03 10:15:33 2006 1151921733 0.0 0.0 0.0 0.0 0.0 # STOP
In the same fashion, the starting script will append a line to the file.
Mon Jul 03 10:20:19 2006 1151922019 0.0 0.0 0.0 0.0 0.0 # START
Mon Jul 3 10:21:01 2006 1151922061 -201.2 11.9 -199.8 -199.0 1.12e-04
Mon Jul 3 10:28:01 2006 1151922481 -201.2 11.9 -199.8 -199.0 1.12e-04
Mon Jul 3 10:34:46 2006 1151922886 -201.2 12.0 -199.8 -199.0 1.10e-04
Scripts, checks and alarms
There's a cronjob run by obs user every 15 minutes checking the
last 3 entries on the LOGFILE (see below), meaning that it
scans three samples that span over 10 minutes. It takes the detector
temperature and checks if:
- The temperature is below -80 (if it's over that, we're probably warming
NOTCam and we shouldn't issue alarm messages);
- The temperature is over -197 (our alarm point); and,
- The temperature is higher than 10 minutes ago (if it's lower, someone
has already filled NOTCam)
If all three conditions are met, the script will start sending messages
(every 15 minutes, with every check) until someone fills NOTCam. The actual
message and recipient are defined in the /usr/local/bin/mailcmd
script (right now, lstaff will receive those messages).
What follows is mostly technical information about the way the scripts and
checks work, so you don't really need to read this if you're not going to do
maintenance work on the system. You can use this information as an "official"
protocol if you want to interact with the system from your own programs,
instead of running the scripts.
The system scripts use a number of common environment variables, defined
on the file ~notcam/tcl-uif/common-vars. Those variables describe the
location of a bunch of control files. Namely:
- STOPFILE
- Location of the "stop file". The very existence of this file tells the
system: "please, don't run the monitor if it's stopped". It's created
by stop-notcam-temp-mon. start-notcam-temp-mon removes
it.
- PIDFILE
- If it exists, contains the Process ID of the monitoring script. The
systems uses it as a first and convenient way to find the running
script in order to kill it, for example.
- LOGFILE
- The file where the monitoring program logs its data
- DAEMON
- The name of the monitoring process
- DAEMONCMD
- Command you have to type to start the process
- DEADLIMIT
- Amount of time (in seconds) that the system will wait until it starts
issuing error messages if DAEMON is not running, and
STOPFILE is not there
There's a cronjob (run on behalf of user obs) that checks every
minute a number of things, in this exact order (which affects the way it
works):
- Existence of PIDFILE. If it exists, use the value it contains
to find the running process. If there's no such process, or it is not
DAEMON, then remove PIDFILE.
- If there's no PIDFILE (could be removed in the previous step),
check if DAEMON is running, anyway, as it could have been
started manually without using the canonical scripts. If it's there,
create PIDFILE contaning the detected process' PID.
- If there's no STOPFILE but there's no PIDFILE
either, this is undestood as a crash of the monitor process. Restart
the process.
After spawning the process, run a timestamp check, comparing the last
entry on LOGFILE with the current time. If the difference is
bigger than DEADLIMIT, send a mail warning about it.
- If STOPFILE exists and there's also a PIDFILE,
this process will be killed, and the PIDFILE removed.
|