The RPG Alternate Lookup Tool describes a performance oriented
solution for the typical case where a batch job reads a data base
file, loads an array using unique values from the data, and
accumulates amounts. At the end of the program, the array is usually
sorted and a report prepared with one line per unique value.
In general, this is a good technique because:
** The program can process the file in arrival sequence which is
much faster than processing by keys.
** It avoids physically sorting the data.
** It avoids many internal compares caused by the LOKUP
operation.
The sample code allows you to produce a working program in a short
period of time.
If you are only interested in summary data and the number of unique
values is 2000 or less, you should be considering this technique.
Sample code
-----------
The sample RPG code is shown in the member TAARPGFR3 in the file
QATTRPG. The normal solution would be to copy all of the source to a
new member and then begin making modifications. Instructions in the
source will assist you in how to modify the program.
The TAARPGFR3 source is a working program that is compiled when the
RPGALTLKP tool is created. The program works on the outfile created
by the DSPOBJD command. The program creates a report in object type
sequence for each unique object type. A count by object type and
average size is also included along with percentages.
You can try the program by specifying DSPOBJD on a typical library
(the outfile name must be DSPOBJDP).
DSPOBJD OBJ(xxx/*ALL) OBJTYPE(*ALL) +
OUTPUT(*OUTFILE) OUTFILE(DSPOBJDP)
CALL PGM(TAATOOL/TAARPGFR3)
Then look at the spooled output for the QPRINT file.
ANZRPGLKP Command
-----------------
To help you in determining if the Alternate Lookup solution will have
significant payoff, an Analyze command is available that can be used
on any externally described data base file. The Analyze command will
perform the normal LOKUP operation approach as well as the alternate
method and describe both results in a spooled file.
Assume you created the DSPOBJDP file as in the prior instructions.
You want a report that uses 'object type' (the field name is ODOBTP)
as a unique value. You could analyze the file with the command:
ANZRPGLKP FILE(DSPOBJDP) FIELD1(ODOBTP) +
FSTPOS(3) SECPOS(4) THRPOS(5)
The FSTPOS, SECPOS, and THRPOS fields describe the positions within
the field that will be used to provide a random value to do the
lookup. Because the 'object type' field values all start with * and
the second position value has several duplicates, the 3rd, 4th, and
5th positions are used to provide the best random solution.
You can try the ANZRPGLKP command using different positions to
determine the best solution. The closer the average number of
comparisons is to 1.0 per record, the better the solution is.
The Alternate method is only worth pursuing if you can reduce the
number of comparisons by either hundreds of thousands or millions.
The ANZRPGLKP command is general purpose in order to operate on any
externally described data. Because of this, the performance of both
the normal lookup and the alternate method will be slower than a
tailored solution using the sample provided by TAARPGFR3. The
advantage of ANZRPGLKP is that you can easily determine if the
technique will be effective.
Comments in the sample code
---------------------------
The TAARPGFR3 source has two forms of comments:
** Permanent comments. These define the logic of the program.
** Temporary comments. These provide instructions for how to
modify the program. You scan the source for +++Req to find
these temporary comments
You should not need to read or understand all of the code provided.
The temporary comments should tell you what you need to modify to get
a working result.
The best solution is usually to leave the temporary comments in the
source until you have the program working. When you are ready,
delete the temporary comments with the TAA tool RMVSRCCMT. It is
specifically provided to eliminate all of the comments that begin
with +++.
Other uses of the technique
---------------------------
The code provided for the Alternate method sorts the array in
ascending sequence. A simple change allows descending sequence. You
can sequence on the unique value or a count you are keeping of the
unique values or any other summary value. Your code could also
process the array when all the records have been read, determine
percentages or averages and then sort on a calculated field. The
sort function is optional.
You can also use the same alternate lookup method for additional
applications. For example, an array is often used to help validate
data being input.
Assume you have an array of the valid US State abbreviations (an
array of 50+ elements). If you check every transaction against the
array, you would average 25 or so compares against the state array to
determine whether a valid state had been specified.
To lower the number of compares, you could have initialization code
in the program that loads the alternate array from a standard array.
The TAARPGFR3 program supports the use of the ZAEXST field. If this
is set to *ON, the subsequent lookups will return a value of EQ or NE
in the ZASTS field and the array is not updated (it becomes a normal
lookup).
Because the alternate method requires large arrays (it may approach
65,000 bytes), the technique is not appropriate for interactive
programs if your interactive pool is memory constrained.
Alternate method technique
--------------------------
The Alternate Method requires that you extract three characters from
the field that you want for a unique value. These characters are
used to provide a 'random value'.
For example, if you are going to process the outfile of the DSPOBJD
command and want to summarize by the unique value of 'object type',
you would need to specify three characters from the field ODOBTP.
These three characters should be chosen to provide the best
randomness in the data. Normally, the first three bytes of a
character field or the last three bytes in a decimal field provide a
good solution, but it is very data dependent. In the case of ODOBTP,
every value begins with an * so you would probably choose other
characters (the 3rd, 4th, and 5th characters provide good
randomness). You can determine the best approach by using the
ANZRPGLKP command. Your program must extract three characters from
the unique value and place them in standard fields.
The alternate method forces an F zone on each of the three
characters, ensures a valid digit portion, and adds a 1 to the end of
the value to form an index value. For example, ABC would become
1231, DEF would become 4561, JKL would become 1231, etc.
Avoiding Exception Address Overflows (EAOs)
-------------------------------------------
The EAO discussion should only be considered for CISC systems. If
you are on a RISC system, you may ignore the discussion.
There is a limit on how big the second array used in the program (the
ZB array) can be in order to avoid the internal exception caused by
an address overflow (internally these are known as EAOs). The ZB
array contains both the unique value to compare against and any count
or total fields you are accumulating.
You should limit the size of the second array to 40,000 bytes or
less. For example, this could be 2000 elements of 20 bytes each or
500 elements of 80 bytes each. As long as the total is less than
40,000 bytes and you do not have many other fields or arrays in the
program, the program should not cause any EAOs.
Internal approach
-----------------
The following chart shows the general concept of the approach (the
implementation specifics are slightly different). The generated
random value is used as a direct index into the ZA array. The value
in the ZA element is assigned to the next available element in the ZB
array. The ZB array contains both the unique value to compare
against and any count or total fields you are accumulating.
ZA Array ZB Array
---------------- ------------------------------
Array value in DS
-------------------
Unique
Element Value Element Value Count Total
------- ----- ------- ----- ----- -----
1 Index 1
2 to 2
. ZB .
. Up to
9999 2000
Assume the first value read is ABC and you are using the first three
characters of the value. This would randomize to an index of 1231.
The 1231st element in the ZA array would be accessed. If it was
zero, the element has not been used and the next available element in
ZB would be assigned (this is the first record so the first element
of ZB would be assigned). The assigned element of the ZB array would
be moved to a data structure. The unique value ABC would be placed
in one of the sub fields, a count added, and totals accumulated. The
data structure is then written back to the ZB array.
Assume the second value read is DEF. This would randomize to a value
of 4561. The 4561st element in the ZA array would be accessed.
Since it is zero, the next element for the ZB array (Nbr 2) would be
assigned and updated.
Assume the next value read is ABC. This would randomize to 1231.
When the 1231st element was read from the ZA array, it would have a
value of 1 so the 1st element of the ZB array would be moved to the
data structure. A comparison would be made of the unique value.
Since they are both ABC, the count and totals would be updated and
the data structure written back to the ZB array.
Assume the next value read is JKL. This also randomizes to 1231.
When the unique value is compared in the ZB array, they are not equal
so the next element in the ZA array is accessed (Nbr 1232). Since
the value is zero, it would be assigned the next available value in
the ZB array (Nbr 3).
If many unique values randomize to the same value, there would be a
lot of 'clustering' and this would result in more compares trying to
find the correct value. If the data does not form into too many
large clusters, the performance will generally be very good.
At the end of these 4 records, the two arrays would look as follows:
ZA Array ZB Array
---------------- ------------------------------
Array value in DS
-------------------
Unique
Element Value Element Value Count Total
------- ----- ------- ----- ----- -----
1 1 ABC 2 nnn
2 2 DEF 1 nnn
. 3 JKL 1 nnn
1231 1 .
1232 3 2000
.
1241
.
4561 2
.
9998
9999
Note that the ZA array is not used at the end of the program when you
want to print the results. Normally you would want to sort the ZB
array. If you want to sort on the unique value, it must be the first
field in the data structure. The RPG SORTA operation sorts on the
entire element value so you need to define the data structure fields
in the sequence you want the array sorted.
You may want to sort on the unique value, a count field, a total
field, or an average field kept in the array. If you are going to
sort on an 'average' field, you are normally better off from a
performance viewpoint to make an extra pass through the ZB array
before sorting to calculate the average (or percent) rather than
calculating each time the data structure is updated.
Note that if you have a blank value as the unique value, it will
randomize to 0001. Blanks will be written to the unique value in the
data structure. When you print the accumulated values at the end of
the program, you want to avoid printing if the entire element is
blank. A blank unique value will have a count field or other
accumulated values that should be printed.
To reduce the total number of bytes in the first array (ZA), values
-999 to +999 are used (this allows a 2 byte packed value for 9999
elements). The values placed in ZA begin with -999 rather than the 1
shown in the example. A value of 1000 is added to the value to
determine the element number in the second array (ZB).
ANZRPGLKP Command parameters *CMD
----------------------------
FILE The qualified file name of the file. The library
value defaults to *LIBL. *CURLIB may be specified.
FIELD1 The first (or only) field to be used for the unique
value. It must be either character, packed decimal,
or zoned decimal. The length of the field cannot
exceed 20 bytes (this is a restriction of the
ANZRPGLKP command and not of the Alternate Lookup
technique).
The parameters FIELD2 and FIELD3 let you name
additional fields that make up the value to compare.
If different field names are used, the sum of the
lengths cannot exceed 20 bytes (this is a
restriction of the ANZRPGLKP command and not of the
Alternate Lookup technique).
FSTPOS The first (leftmost) position to be used for the
random value. The default is 1. If you have a
typical character field, you should probably take
the default for this parameter and the corresponding
parameters for your first attempt. Then you can try
alternative solutions depending on your data.
If you have a packed decimal field of LEN(5 0) and
you want the low order positions (3-4-5) to be used,
specify 3.
FIELD2 The second field to be used for the unique value.
The default is *FIELD1 meaning it is the same field
as FIELD1. Specifying a different field allows you
to have a combination of fields which make up a
unique value. FIELD2 must be either character,
packed decimal, or zoned decimal.
SECPOS The second position to be used for the random value.
The default is 2. If you have a packed decimal
field of LEN(5 0) and you want the low order
positions (3-4-5) to be used, specify 4. The entry
refers to where in FIELD2 you want the second
character used for the random value.
FIELD3 The third field to be used for the unique value.
The default is *FIELD1 meaning it is the same field
as FIELD1.
THRPOS The third position to be used for the random value.
The default is 3. If you have a packed decimal
field of LEN(5 0) and you want the low order
positions (3-4-5) to be used, specify 5.
NBRRCDS The number of records in the file to be processed.
The default is 10,000. You can normally determine
the value of the RPG Alternate Lookup method without
processing an entire large file. *ALL may be
specified to process all of the records in the file.
MBR The member to be processed. The default is *FIRST.
LOOKUP Whether the normal lookup should be used. The
default is *YES. The normal lookup is done by use
of a separate program to provide a comparison. Once
you determine you want to use the alternate lookup
technique and are using the ANZRPGLKP command to
determine the best random characters to use, you can
avoid the separate program by specifying *NO.
Restrictions
------------
The number of unique values that can be processed has a maximum limit
of 2000. The program may report an error at some point less than
2000 (it is data dependent). Use the ANZRPGLKP command to help you
determine if the technique will be effective on your data.
There are restrictions described for the ANZRPGLKP command relative
to the length of the field to compare (20 bytes or less). This is
not a restriction relative to the Alternate Method.
You should limit the size of the second array (ZB) to avoid EAOs.
See the previous discussion.
Prerequisites
-------------
The following TAA Tools must be on your system:
FILEFDBCK File feedback
HLRMVMSG HLL Remove message
RTVSYSVAL3 Retrieve system value 3
SNDCOMPMSG Send completion message
SNDESCMSG Send escape message
SNDSTSMSG Send status message
Implementation
--------------
None, the tool is ready to use:
Objects used by the tool
------------------------
Object Type Attribute Src member Src file
------ ---- --------- ---------- ----------
ANZRPGLKP *CMD TAARPGF QATTCMD
TAARPGFC *PGM CLP TAARPGFC QATTCL
TAARPGFR *PGM RPG TAARPGFR QATTRPG
TAARPGFR2 *PGM RPG TAARPGFR2 QATTRPG
TAARPGFR3 *PGM RPG TAARPGFR3 QATTRPG
Structure
---------
ANZRPGLKP CMD
TAARPGFC CLP
TAARPGFR RPG
TAARPGFR2 RPG
TAARPGFR3 RPG is the sample code that you should copy
to one of your members and then modify to use the technique.
|