The Check System Condition command checks for critical system
messages and sends break messages to a list of users.
The system sends critical system messages such as 'mirroring has been
suspended' or 'a disk storage capacity threshold has been reached' to
QSYSOPR (See the later discussion for more details). Because many
messages may exist in QSYSOPR, critical messages may be overlooked.
If QSYSOPR is not in break mode, no one may be aware of a critical
system condition. Critical system messages are also sent to the
QSYSMSG message queue if it exists.
CHKSYSCND submits a never ending batch job which continually monitors
QSYSMSG.
The major advantages of CHKSYSCND are:
** It selects the critical message IDs rather than every message
that arrives in QSYSOPR.
** It uses the SHOUT TAA tool to send messages to a named list of
users so that a critical condition is more likely to be
noticed.
To use CHKSYSCND, you must first create the QSYSMSG message queue in
QSYS if it does not already exist. No other program can be reading
this message queue and it cannot be in break mode to a workstation.
Any message ID sent to QSYSMSG will be removed from the message
queue. If you want to process some of the message IDs not handled by
CHKSYSCND, see the later instructions for modifying the programs.
The QSYSMSG message queue is described in detail in the CL
Programmer's Guide. To create the queue specify:
CRTMSGQ MSGQ(QSYS/QSYSMSG)
TEXT('Message queue for critical system messages')
A typical approach to start CHKSYSCND would be to use the command in
the auto start job for the controlling subsystem. See the later
discussion for how to do this. You must have *JOBCTL special
authority to use CHKSYSCND. A typical command would be:
CHKSYSCND USERS(QSYSOPR QSECOFR JONES *FIRSTUSER)
This would cause the CHKSYSCND job to be submitted which would
continually monitor QSYSMSG.
The value *FIRSTUSER is a special value intended for the case when
none of the users specified are active. If this occurs, the active
users are checked and the user with the highest user class (e.g.
*SYSOPR, *PGMR etc.) is sent a generic message and requested to
inform the system administrator of the critical condition.
The value *ALLACTIVE may be specified (instead of *FIRSTUSER) to send
a similar message to all active users.
Because the CHKSYSCND function is assumed to be a critical job, any
failures found within the job will cause a 'critical system
condition' message to be sent. The CHKSYSCND command can be ended
externally (ENDJOB) without causing this message.
Excess interruptions
--------------------
The intent of CHKSYSCND is to bring to the attention of the system
administrator the fact that a critical system condition exists. Once
this is known and a plan has been made for correction, it is probably
desirable to re-adjust CHKSYSCND to avoid the list of users being
annoyed on an hourly basis until the problem is fixed.
For example, if you are running out of addresses, the normal recovery
action would be to cause an IPL during the evening. Or if mirroring
has been suspended, the Service representative may be scheduled in to
repair the problem later in the day. In the meantime, CHKSYSCND will
continue to send the 'critical system condition' message hourly until
the problem is fixed.
You can avoid this by ending the CHKSYSCND job. You could submit
CHKSYSCND with a smaller list of users to be notified until the known
problem is fixed and then revert back to the original list.
Because you may want to use CHKSYSCND independently of the auto start
job, it may be desirable to place the normal CHKSYSCND command in a
CL program and call it from the auto start job or when needed.
Command parameters *CMD
------------------
USERS This value is passed thru to the TAA tool SHOUT. It
is a list of up to 10 user names that will be sent a
message if a critical condition occurs. If the user
is active, the message is sent as a break message to
the workstation message queue where the user is
signed on. If the user is not active, a message is
sent to his user message queue.
For additional details and a discussion of the
*FIRSTUSER and *ALLACTIVE special values, see the
SHOUT TAA tool.
JOBQ The qualified name of the job queue to submit the
batch job to. The default is QINTER in QGPL. The
intent of QINTER is to allow the job to act like an
interactive job and not use one of the normal batch
job activity levels.
JOBD The qualified name of the job description to use for
the batch job. The default is QBATCH in QGPL. This
allows control of other job attributes for the
CHKSYSCND job.
System handling of critical conditions
--------------------------------------
You should review the discussion of QSYSMSG in the CL Programmers
Guide if you are interested in the details of the messages which are
sent. The following describes the highlights of the messages which
are sent to QSYSMSG in QSYS if it exists. For the detail message
IDs, see member TAAMSGMC2 in TAATOOL/QATTCL.
** Address threshold. The system will send a message every hour
if the addresses used percentage (either permanent or
temporary) exceeds 90%.
** Storage capacity threshold. A message will be sent if the
storage used percentage exceeds the threshold value. The
threshold value is specified in SST for each ASP. The default
is 90%.
** Disk errors. Some disk units perform a threshold checking of
recoverable error conditions. This allows the system to be
informed when the disk unit is operating, but excessive
recoverable errors are occurring. When an internal threshold
is reached, a critical system message is sent requesting that
Service be informed. In most cases, the system will re-signal
the message on an hourly basis (up to 10 times).
** Mirroring suspended. If a mirroring unit fails, the system
will send a single message when mirroring is suspended. A
similar message will also be sent every hour.
** Parity protection suspended. If parity protection exists and
a parity unit fails, the system will send a single message
when parity protection is suspended. A similar message will
also be sent every hour.
** Battery weak or failed. Some battery protection devices can
be tested to determine if they are weak or have failed. The
system sends a message if it senses either condition.
** Hardware failures. Certain hardware internal units (such as
the bus) are tested regularly and the system sends a message
if errors are found.
** Significant security violations. Certain critical security
errors are also sent to QSYSMSG. For example, a message is
sent if a user attempts to invalidly signon to a workstation
more than the allowed value for the QMAXSIGN system value.
Security messages are ignored by CHKSYSCND. You may modify
the program to process these separately.
All of the system critical messages are sent to QSYSOPR. It is up to
the user to place QSYSOPR in break mode and be sensitive to which
messages are critical and which are not. All the critical system
messages are alertable and will also appear in QHST.
Alerts coming into the system (from a remote system) are sent to
QSYSOPR. The System Management Utility supports an option to allow
you to send the Alerts to a specific queue. You can process the
queue in the same manner as QSYSMSG.
Using CHKSYSCND in an auto start job
------------------------------------
CHKSYSCND can be easily placed into an auto start job for the
controlling subsystem. This is normally the best place to ensure the
job will be active whenever the system is up and is not in the
restricted state.
The source for the system supplied auto start job can be retrieved by
use of the RTVCLSRC command and naming a source file/member where you
want the source placed.
RTVCLSRC PGM(QSTRUP) SRCFILE(yyyyyyy) MBR(zzzz)
It would be normal to place the CHKSYSCND command as one of the last
functions performed (e.g. just prior to the RETURN statement in the
startup program). Create your own version of the startup program.
Change the system value QSTRUPPGM to specify the name of the
library/program you created. You should consider a separate CL
program containing the CHKSYSCND command as described in the previous
section on 'Excess interruptions'.
Message text sent
-----------------
There are two forms of messages sent:
** List of users on the command. The list of users specified on
the command will receive the actual first level of message
text that has causes the system critical condition and the
message ID. The message is 'wrapped' with standard text
before and after the actual text. Because a 'break message'
is sent, there is no second level text available with the
message.
For example, the message text supplied in the program
produces:
***** A critical system condition has
occurred. ***** The message ID is xxxxxxx.
The text is- yyyyyyyyyyyyyyyyyyyyyyyyyyy
Contact the system administrator immediately.
** If none of the specified users are active and *FIRSTUSER was
specified, the text sent to some user is:
***** A critical system condition
has been found by CHKSYSCND. ******
No other active user has been
informed. Contact the system
administrator immediately.
** If *ALLACTIVE is specified, specific users may also be named.
Any specifically named users always receive the text described
previously. If the user is active, but is not in the named
list, the following text is sent:
***** A critical system condition
has been found by CHKSYSCND. ******
Contact the system administrator
immediately.
You may modify this text for your own requirements.
Testing CHKSYSCND
-----------------
There are two functions you can consider testing:
** SHOUT command. You can test your list of users by directly
executing the SHOUT command.
** CHKSYSCND command. You can simulate the system sending a
critical message by sending the same message ID directly to
QSYSMSG. The following TSTCHKSYS program can be used (the
source from this text can be directly copied into a CL source
member).
/* TSTCHKSYS - Test the CHKSYSCND command */
PGM PARM(&MSGID)
DCL &MSGID *CHAR LEN(7)
SNDPGMMSG MSGID(&MSGID) MSGF(QCPFMSG) +
TOMSGQ(QSYS/QSYSMSG)
ENDPGM
To test the address threshold condition, you would specify
message CPI0997 message as:
CALL TSTCHKSYS PARM(CPI0997)
Modifying the programs
----------------------
You may want to modify the program that is controlling the reading of
QSYSMSG and the message text used for the SHOUT command. The program
to review is TAAMSGMC2 in TAATOOL. If you are going to modify this
program, you should make a copy of the entire CHKSYSCND tool (See
CPYTAA2). Make the changes and then create the tool using CRTTAATOOL
and specify your source library.
The following are typical changes you may want to make:
** Not all of the messages arriving at QSYSMSG are considered
'critical'. For example, security messages are ignored as
well as the fact that mirroring has been resumed. You may
want to have some unique handling of these messages. You
should review the QSYSMSG discussion in the CL Programmer's
Guide. Any message received is removed from QSYSMSG (the
messages also exist in QSYSOPR). If you want to save any of
the messages, you would need to modify the program (e.g.
resend them to a different queue).
** The same message ID received from QSYSMSG is sent via the
SHOUT command for the critical conditions. The first level
message text is wrapped with the standard text as described
previously. You may want to alter either the standard text
that is wrapped around the message or the text for specific
conditions. Because a break message will be sent, only the
first level of the text (no message ID or 2nd level) can be
sent.
** The SHOUT command provides for the case of a generic message
to be sent to *FIRSTUSER or *ALLACTIVE. The text does not
identify the specific problem, but requests the user contact
the system administrator. TAAMSGMC2 provides a text message
(described earlier) that is designed to be sent to a typical
end user. You may want to modify this text.
Restrictions
------------
The QSYSMSG message queue must exist and it cannot be allocated to
any other program (e.g. be in break mode to a user).
You must have *JOBCTL special authority to use CHKSYSCND.
Prerequisites
-------------
The following TAA Tools must be on your system:
EXTLST Extract list
SHOUT Shout message
SNDCOMPMSG Send completion message
SNDESCMSG Send escape message
Implementation
--------------
The tool is ready to use, but you must ensure that the QSYSMSG
message queue exists in QSYS. If it has not been created, see the
earlier discussion.
Objects used by the tool
------------------------
Object Type Attribute Src member Src file
------ ---- --------- ---------- ----------
CHKSYSCND *CMD TAAMSGM QATTCMD
TAAMSGMC *PGM CLP TAAMSGMC QATTCL
TAAMSGMC2 *PGM CLP TAAMSGMC2 QATTCL
Structure
---------
CHKSYSCND command
TAAMSGMC submits batch pgm TAAMSGMC2
TAAMSGMC2 CL pgm
SHOUT TAA tool
|