NOTCam temperature monitoring

As we need to put special care on not letting NOTCam warm when it should not, there's a temperature monitor running all the time it is not mounted. When mounted, the temperature monitor must be stopped, as both it and NOTCam's user interfaces make use of a common resource that cannot be shared (ie: one won't work if the other is running).

As of 3/Jul/2006 the procedure to start and stop the monitor has been changed, along with the introduction of some internal checks to be extra sure that it is running when it should be, and that it is stopped when it's not needed. Previously it was more or less run "by hand" every time it was found stopped. This is not the case anymore.

Starting the monitor

Log into marissa as obs user
Type the command 'start-notcam-temp-mon'

Stopping the monitor

Log into marissa as obs user
Type the command 'stop-notcam-temp-mon'

Normal operation

When the monitor is operating normally, you can check NOTCam's status at the file ~notcam/NOTCAM_temp, for example using the command 'tail -f ~notcam/NOTCAM_temp'. Every few minutes the monitor appends to this file a line with measurements. If everything is going right, you should see something like this at the end of the file:

    Mon Jul  3 10:43:29 2006 1151923409  -201.2 12.2 -199.7 -199.0 1.11e-04
    Mon Jul  3 10:47:44 2006 1151923664  -201.2 12.1 -199.7 -199.0 1.10e-04
    Mon Jul  3 11:17:00 2006 1151925420  -201.2 12.4 -199.7 -199.0 1.08e-04
    Mon Jul  3 11:22:03 2006 1151925723  -201.1 12.5 -199.6 -199.0 1.08e-04

Every line starts with a timestamp, both in a way a human can read it (Mon Jul 3 10:43:29 2006) and Unix timestamp (1151923409), followed by temperature data (table, outer vessel, center wheel and detector). The last value shows the internal pressure.

When you stop the monitor, the scripts will append a line stating it. This way you can't be confused if you see an abnormal timing (eg: no updates in the last few hours):

    Mon Jul  3 10:04:59 2006 1151921099  -201.2 11.8 -199.8 -199.0 1.12e-04
    Mon Jul  3 10:10:00 2006 1151921400  -201.2 12.0 -199.9 -199.0 1.11e-04
    Mon Jul  3 10:15:02 2006 1151921702  -201.2 11.9 -199.8 -199.0 1.12e-04
    Mon Jul 03 10:15:33 2006 1151921733 0.0 0.0 0.0 0.0 0.0 # STOP

In the same fashion, the starting script will append a line to the file.

    Mon Jul 03 10:20:19 2006 1151922019 0.0 0.0 0.0 0.0 0.0 # START
    Mon Jul  3 10:21:01 2006 1151922061  -201.2 11.9 -199.8 -199.0 1.12e-04
    Mon Jul  3 10:28:01 2006 1151922481  -201.2 11.9 -199.8 -199.0 1.12e-04
    Mon Jul  3 10:34:46 2006 1151922886  -201.2 12.0 -199.8 -199.0 1.10e-04

Scripts, checks and alarms

There's a cronjob run by obs user every 15 minutes checking the last 3 entries on the LOGFILE (see below), meaning that it scans three samples that span over 10 minutes. It takes the detector temperature and checks if:

The temperature is below -80 (if it's over that, we're probably warming NOTCam and we shouldn't issue alarm messages);
The temperature is over -197 (our alarm point); and,
The temperature is higher than 10 minutes ago (if it's lower, someone has already filled NOTCam)

If all three conditions are met, the script will start sending messages (every 15 minutes, with every check) until someone fills NOTCam. The actual message and recipient are defined in the /usr/local/bin/mailcmd script (right now, lstaff will receive those messages).

What follows is mostly technical information about the way the scripts and checks work, so you don't really need to read this if you're not going to do maintenance work on the system. You can use this information as an "official" protocol if you want to interact with the system from your own programs, instead of running the scripts.

The system scripts use a number of common environment variables, defined on the file ~notcam/tcl-uif/common-vars. Those variables describe the location of a bunch of control files. Namely:

STOPFILE: Location of the "stop file". The very existence of this file tells the system: "please, don't run the monitor if it's stopped". It's created by stop-notcam-temp-mon. start-notcam-temp-mon removes it.
PIDFILE: If it exists, contains the Process ID of the monitoring script. The systems uses it as a first and convenient way to find the running script in order to kill it, for example.
LOGFILE: The file where the monitoring program logs its data
DAEMON: The name of the monitoring process
DAEMONCMD: Command you have to type to start the process
DEADLIMIT: Amount of time (in seconds) that the system will wait until it starts issuing error messages if DAEMON is not running, and STOPFILE is not there

There's a cronjob (run on behalf of user obs) that checks every minute a number of things, in this exact order (which affects the way it works):

Existence of PIDFILE. If it exists, use the value it contains to find the running process. If there's no such process, or it is not DAEMON, then remove PIDFILE.
If there's no PIDFILE (could be removed in the previous step), check if DAEMON is running, anyway, as it could have been started manually without using the canonical scripts. If it's there, create PIDFILE contaning the detected process' PID.
If there's no STOPFILE but there's no PIDFILE either, this is undestood as a crash of the monitor process. Restart the process.
After spawning the process, run a timestamp check, comparing the last entry on LOGFILE with the current time. If the difference is bigger than DEADLIMIT, send a mail warning about it.
If STOPFILE exists and there's also a PIDFILE, this process will be killed, and the PIDFILE removed.