The Scan Duplicate Records command is intended to assist you in
identifying duplicate records in a data base. For example, different
spellings of a name or address may cause multiple records for the
same individual or company. Command options on SCNDUPRCD allow you
to specify how to determine a possible duplicate group. It is
expected that you will need to try various options to determine what
best fits your situation.
A typical command would be:
SCNDUPRCD FILE(xxx)
SELFLD((ADDR *COMPL 5)(CITY *COMPL 6))
PRTFLD((CUSNBR)(CUSNAM)(ADDR) +
(CIYT)(STATE))
The Select fields parameter (SELFLD) identifies the type of selection
to be used. A maximum of 7 fields may be specified, but in the
example only 2 are used. The *COMPL selection type requests a
comparison beginning at the left (position 1) of the ADDR field for 5
bytes. All character data is translated to upper case before making
a comparison. The same type of selection of the CITY field is made
for the first 6 bytes.
Because the default is used for SELMATCH, any records in the data
base that match on all of the selections specified are printed to a
spooled file.
For example, if the following records exist:
Rcd ADDR CITY
--- ------------------ -----------------
1 117 N. Broadway Las Vegas
2 117 N. Broadway New York
3 221 N. Broadway New York
4 117 N. Broadway Las Vegas
keys would be generated as follows:
Rcd Generated key
--- ------------------
1 117 N LAS VE
2 117 N NEW YO
3 221 N NEW YO
4 117 N LAS VE
Records 1 and 4 would be matched on the generated key and would cause
both records to be printed as a possible duplicate group.
The fields listed on the PRTFLD parameter would be listed across the
page (left to right) with each field printing the full value of the
field. You should specify fields for PRTFLD that will assist you in
determining the key to the records and as much as possible to help
determine if a duplicate exists.
Note that the generated key fields are not printed. To see the
generated key fields, see the later discussion of the PRTSCNDUP
command.
Other types of selection are possible:
*COMPR Compares from the right end of the field (trailing
positions).
*NBR Compares only for the first N digits (0-9) found in
the field.
*LTR Compares only for the first N letters (A-Z) found in
the field.
*VOWELS Compares only for the first N vowels (A E I O U)
found in the field.
*CONSONANTS Compares only for the first N consonants
(non-vowels) found in the field. Any characters
that are not in a range of A-Z are bypassed.
*MULTINAME Scans for multiple pieces of a typical name field.
For example, if the name is 'James Jones', both
JAMES and JONES would be compared for. Any single
position characters or values such as period, comma,
etc are bypassed. A set of standard values exists
that may be modified to bypass certain words such as
'Mr', 'Mrs' etc.
*LASTNAME This is a subset of the *MULTINAME function to allow
comparison of the last name found in the value. For
example, 'James Jones', 'Jim Jones', or 'J. Jones'
would all be compared for just JONES. This option
is intended for individual names such as 'Jim Jones'
as opposed to business names such as 'The Jones
Foundation' where FOUNDATION would be used as the
key.
*MULTIADDR Scans for multiple pieces of a typical address
field. For example, an address of '1341 N.
Broadway St.' would cause 1341 and BROADWAY to be
compared for. Any single position characters or
values such as period, comma, etc are bypassed. A
set of standard values exists that may be modified
to bypass words such as 'St', 'Street' etc.
*ADDRTRANS The function of *MULTIADDR is used with the
exception that single character values are
considered as a separate piece. When all of the
pieces of the data have been separated, translation
occurs that will change 'N.' 'No.' 'North' to 'N/'.
Similar translations occur for South, East, and
West. Each piece is then concatenated together to
form a single key.
For example, an address of '1341 N. Broadway St.'
would cause 1341N/BROA to be compared for (assuming
a length of 10 was specified). A set of standard
translation values exists that may be modified.
PRTFLD Parameter
----------------
You may specify up to 7 fields to print to assist you in identifying
the record and whether a duplicate exists.
The default is that the full field length will be printed (maximum of
50 bytes). You may print between 6 and 50 bytes or the full value of
the field. Only one line of information will print per record. You
must decide the fields and the length that should be printed.
The maximum printer width is 200 bytes. This will allow you to
display information, but you may not be able to print hard copy. If
you attempt to print beyond 200, the data is either truncated or a
message will occur stating that you should reduce the number of
fields requested to print.
Upper Case Comparisons
----------------------
Any character fields are translated to upper case (A-Z) before making
a comparison. Thus 'Jones', 'JONES' and 'JOnes' would all be
considered a duplicate group.
Blank Generated Key Values
--------------------------
Blank key values are automatically bypassed. For example, if you
have multiple address fields and request SELMATCH(1), blank key
values could be created. Bypassing these values prevents matches on
all blanks.
Select Match Parameter
----------------------
The SELMATCH parameter defaults to *ALL meaning that all Selection
fields must match before the record is considered a possible
duplicate.
You can specify up to 7 selection fields. You can use a specific
value for the SELMATCH parameter to request that only a subset of the
selection fields be required to match in order to print the possible
duplicates.
For example, if you had 5 selection fields (NAME, ADDR, CITY, STATE,
ZIP), you could request that only 3 have to match to consider the
record a possible duplicate. This would mean that a match on NAME,
STATE, and ZIP would cause the records to print regardless of whether
the ADDR or CITY matched. Another group would print if it matched on
NAME, ADDR, and CITY.
Specifying a SELMATCH value that is less than the full set of
selection fields causes slower performance. For example, if you
specify 7 selection fields and SELMATCH(2), longer processing is
required to determine all the possible duplicates.
Play with it
------------
There is no recommended specification for finding duplicate records.
If you have a large data base, you should play with a subset of the
data (use the MAXRCDS parameter) and try various combinations to see
what produces good results.
It is probable that one specification will not find all the
duplicates. Trying different combinations is usually the best
solution.
Saving your specifications
--------------------------
If you have experimented with SCNDUPRCD and have a good set of SELFLD
statements that satisfy your requirements, you can save the command
with the FAVCMD TAA Tool.
You must first use CRTFAVCMD to create the required data base files.
Then follow your SCNDUPRCD command with the ADDFAVCMD command and
assign an ID such as:
SCNDUPRCD ...
ADDFAVCMD FAVID(xxx)
You can then re-execute the SCNDUPRCD command as:
EXCFAVCMD FAVID(xxx)
Security considerations
-----------------------
No special security considerations are provided. The user of the
command must be authorized to *USE authority of the specified file.
Technique used
--------------
The definition of the specified file is found and checked against the
Select and Print fields. The file is then read and a temporary file
(SCNDUPKP) is output to QTEMP with all of the possible key values
that should be compared.
If you specify SELMATCH(*ALL), one record would be output to the
temporary file for each input record read (see the exception for the
*MULTIxxx options).
If you specify 5 selection fields and SELMATCH(3), 6 different
records are output to the temporary file for each input record read.
The different records are required to contain all of the possible
unique combinations of values out of a total of 6 fields.
The temporary file contains a single large field for the generated
key values and a single large field to be printed to identify the
record. The key values field is specified as the key field to the
file.
If either *MULTINAME or *MULTIADDR is specified, this can increase
the number of key records generated. For example, if two selection
fields are specified with SELMATCH(*ALL), but one of the fields is
specified as *MULTINAME and 'James Earl Jones' exists as the name, 3
key records will be generated for JAMES, EARL, and JONES.
After the file is output, it is read by a second program in keyed
order. Printing occurs if duplicate keys are found.
To attempt to minimize the number of duplicate groups that are
printed, internal record IDs of each record in a duplicate group (if
less than 10 records) are stored in a large internal array. If
another duplicate group is found, it will not be printed if the
internal IDs match a group that was previously printed.
Print SCNDUP Key Records
------------------------
To assist in understanding the generated key records in the SCNDUPKP
temporary file, a separate command may be used to print the records.
When SCNDUPRCD runs, the SCNDUPKP file is created in QTEMP. By
default, the file is deleted at the end of the command as it may have
many records.
In order to see the data, you must specify DLTTMPFILE(*NO) on
SCNDUPRCD. Then you may use the PRTSCNDUP command to print the data.
PRTSCNDUP allows the printing of the generated key records in either
keyed or arrival sequence. The intended use of the command is to
assist you in understanding what keys are generated based on your
selection specifications.
Bypass values used for *MULTINAME and *MULTIADDR
------------------------------------------------
To avoid key generation of such standard values as 'Mr.' 'Ave' etc.
two data areas are provided that contain the standard bypass values.
If you want to modify these values, you should create your own
versions and place them in a library ahead of TAATOOL on the library
list.
The two data areas provided are:
TAADBITY - Bypass name values such as 'Mr' 'Inc'
TAADBITZ - Bypass address values such as 'Ave' 'Blvd'
You can review the data in these data areas with the DSPCONARR
command such as:
DSPCONARR DTAARA(TAADBITY)
To create you own versions, use CRTDUPOBJ to create a duplicate
object in a library of your choice (do not use TAATOOL). The library
must be on the library list at the time of SCNDUPRCD execution.
CRTDUPOBJ OBJ(TAADBITY) FROMLIB(TAATOOL) OBJTYPE(*DTAARA)
TOLIB(xxx) NEWOBJ(MULTINAME)
CRTDUPOBJ OBJ(TAADBITZ) FROMLIB(TAATOOL) OBJTYPE(*DTAARA)
TOLIB(xxx) NEWOBJ(MULTIADDR)
You can edit the data with the EDTCONARR command such as:
EDTCONARR DTAARA(MULTINAME)
Enter the values in upper case. Periods, commas, etc are not needed
because they are automatically bypassed by the *MULTIxxx functions.
The values in TAADBITY (or your version) are also used for the
*LASTNAME selection option.
Address translation
-------------------
The *ADDRTRANS function provides a translation capability of certain
values. The shipped default allows common names such as 'N' 'No'
'North' to translate to the value of N/. Similar translations occur
for South, East, and West.
The translation occurs after the use of the *MULTIADDR function with
the exception that single characters are retained.
After translation occurs, the pieces of the field are concatenated
together without blanks to form a single key value.
The data area provided is:
TAADBITW - Address translation values
You can review the data in these data areas with the DSPCONARR
command such as:
DSPCONARR DTAARA(TAADBITW)
To create you own version, use CRTDUPOBJ to create a duplicate object
in a library of your choice (do not use TAATOOL). The library must
be on the library list at the time of SCNDUPRCD execution.
CRTDUPOBJ OBJ(TAADBITW) FROMLIB(TAATOOL) OBJTYPE(*DTAARA)
TOLIB(xxx) NEWOBJ(ADDRTRANS)
You can edit the data with the EDTCONARR command such as:
EDTCONARR DTAARA(ADDRTRANS)
Enter the values in upper case. Periods, commas, etc are not needed
because they are automatically bypassed by the *MULTIxxx functions.
Limits
------
There are several limits placed on the length and type of fields and
comparison values. For example,
** The maximum record size is 9999 bytes.
** Maximum number of fields in the file is 998.
** The length to be compared for any one of the selection fields
cannot exceed 10.
** Character fields cannot exceed 256 bytes.
** Decimal fields cannot exceed 15 digits.
** A maximum of 25 sub fields can exist in a *MULTIxxx field.
** A numeric type field can only be specified for *COMPL or
*COMPR.
** Only character, variable length, or and decimal field types
are supported for selection or as print fields.
** Up to 200 print positions may be generated.
** The length of any field to be printed has a maximum of 50
bytes.
** A maximum of 9999 duplicate groups and their included IDs can
exist to minimize the amount of duplicate group printing. If
the limit is reached, additional records are bypassed.
SCNDUPRCD Command parameters *CMD
----------------------------
FILE The qualified name of the file to be scanned. The
library value defaults to *LIBL. *CURLIB may also
be used.
MAXRCDS The maximum number of records to be processed from
the file. *ALL may be specified to process all
records in the file.
The intent of requiring a value is because the
probable use of the command is for 'trying it out'
on a subset of a file rather than the entire file.
Large files can require significant processing time
and you should not specify *ALL until you have tried
your specifications on a subset of the file and are
satisfied with the results.
SELFLD The field to select on and how to perform the
selection. Up to 7 selection fields may be
specified. Each field must be identified with a
field name, a selection type, and a compare length.
The selection types are:
*COMPL Compare from the left side of the field
(position 1) for the length specified. Either
a character or decimal field may be specified.
If a character field is specified, all data is
translated to upper case before making a
comparison.
*COMPR Compare from the right side (high order) of the
field for the compare length specified. This
is intended for fields where you know a
complete value exists such as a 5 digit Zip
code and you want to select on the last 3
digits. Either a character or decimal field
may be specified. If a character field is
specified, all data is translated to upper case
before making a comparison.
*NBR Compare from the left side of the field, but
use only digits (0-9) found in the value. The
extraction continues thru the field until the
compare length is met. Only a character field
may be specified.
For example, if you have an address such as
'117 N. Broadway' and request a compare length
of 5, the key generated would be '117 '. A
value of '117 N. Broadway - 1st Floor' would
produce a key of '1171 '.
*LTR Compare from the left side of the field, but
use only letters (A-Z) found in the value.
Only a character field may be specified and all
data is translated to upper case before making
a comparison. The extraction continues thru
the field until the compare length is met. For
example, if you have an address such as '117 N.
Broadway' and request a compare length of 5,
the key generated would be 'NBROA'. A value of
'117 North Broadway' would produce a key of
'NORTH'.
*VOWEL Compare from the left side of the field, but
use only vowels (A, E, I, O, U) found in the
value. Only a character field may be specified
and all data is translated to upper case before
making a comparison. The extraction continues
thru the field until the compare length is met.
For example, if you have an address such as
'117 N. Broadway' and request a compare length
of 5, the key generated would be 'OAA '
*CONSONANT Compare from the left side of the field, but
use only consonants (not A, E, I, O, U vowels).
Only a character field may be specified and all
data is translated to upper case before making
a comparison. Only letters A-Z (minus the
vowels) are considered as consonants. Special
characters such as '.' or ',' are not. The
extraction continues thru the field until the
compare length is met. For example, if you
have an address such as '117 N. Broadway' and
request a compare length of 5, the key
generated would be 'NBRDW'
*MULTINAME Scan thru the field for parts of a name. For
example, 'James Earl Jones' would produce
separate keys of JAMES, EARL, and JONES. Only
a character field may be specified and all data
is translated to upper case before making a
comparison.
A field begins with a non-blank character that
is other than '.' ',' or '-'. The end of the
field is determined by a value of blank, '.'
',' '-' or the end of the data base field.
Single character fields are bypassed.
A Constant Array is provided that includes
standard values that will be bypassed such as
'Mr', 'Mrs', 'of', etc. For example, 'Mr. J.
Jones' would produce a single key of 'JONES'.
See the previous discussion for how to
determine the standard list of values and how
to modify the list with your own values.
*LASTNAME Scan thru the field for parts of a name and use
only the last piece. Only a character field
may be specified and all data is translated to
upper case before making a comparison. For
example, 'Mr. James Earl Jones' would produce
a single key of 'JONES'.
This is a subset function of *MULTINAME and all
rules for *MULTINAME processing are used with
the exception that only the last name is output
as a key.
*MULTIADDR Scan thru the field for parts of an address.
For example, '117 N. Broadway St.' would
produce separate keys of 117 and BROADWAY.
Only a character field may be specified and all
data is translated to upper case before making
a comparison.
The same rules apply for determining a part of
the field as with *MULTINAME. Single character
fields are bypassed.
A Constant Array is provided that includes
standard values that will be bypassed such as
'St', 'Street' 'Ave', etc. See the previous
discussion for how to determine the standard
list of values and how to modify the list with
your own values.
*ADDRTRANS The function of *MULTIADDR is used with the
exception that single character values are
considered as a separate piece. When all of
the pieces of the data have been separated,
translation occurs that by default will change
'N.' 'No.' 'North' to 'N/'. Similar
translations occur for South, East, and West.
Each piece is then concatenated together to
form a single key.
For example, an address of '1341 N. Broadway
St.' would cause 1341N/BROA to be compared for
(assuming a length of 10 was specified).
A Constant Array is provided that includes
standard values that will be translated to a
common value. See the previous discussion for
how to determine the standard list of values
and how to modify the list with your own
values.
SELMATCH The number of the Selection fields specified that
must match to consider the record a duplicate group.
The default is *ALL meaning all selection fields
must match.
Specifying a number that is less than the number of
selection fields will cause an increase in the
number of key records that must be generated and
processed.
For example, if you have 7 selection fields and
require that 6 match to indicate a duplicate record,
7 key records are generated for each input record
(1,2,3,4,5,6 and 1,2,3,4,5,7 ... 2,3,4,5,6,7)
However, if you have 7 selection fields and require
only 3 to match to indicate a duplicate record, 35
key records will be generated for each input record.
One record will contain keys 1,2,3, the next will be
1,2,4, then 1,2,5 ... 3,4,5, 3,4,6, ... 567. Each
combination that would cause a duplicate key
requires a separate generated key record.
PRTFLD A list of up to 7 fields that will print left to
right on each line that indicates a possible
duplicate record. The default is to print the full
value of each field up to a maximum of 50 bytes.
Because only one line will be printed per record,
you may want to print less than the full value of
certain fields.
The maximum print width is 200. This will allow you
to display information, but you may not be able to
print hard copy. If you attempt to print beyond
200, the data is either truncated or a message will
occur stating that you should reduce the number of
fields requested to print.
The minimum width of a field is 1 if you request the
full length to be printed. If you specify a length,
the minimum is 6.
You would normally include a unique key to the
record (if it exists) and fields to help you
identify whether duplicate records exist (such as
Name, address, etc).
MBR The member to be processed. The default is *FIRST.
DLTTMPFILE A *YES/*NO value for whether to delete the SCNDUPKP
file created in QTEMP at the end of the command.
The default is *YES.
*NO should be specified to retain the file which
would normally only be needed if you want to review
the data. The data may assist you in understanding
the keys that are generated by your specifications.
To review the data, use the PRTSCNDUP command.
OUTLIB The name of the library to write duplicate records
to. The default is *NONE meaning no outfile is
created.
A library name that differs from the FILE parameter
library may be specified.
If a library name is entered, the same file name is
used as specified on the FILE parameter. If the
file does not exist, it is created (no keys will
exist).
If the file does exist, the level IDs of the two
files must be the same. The CLROUTFILE parameter
determines whether the file is cleared first or an
escape message is issued.
The records that are written are duplicated from the
file named in the FILE parameter.
CLROUTFILE Whether to clear the file specified in the OUTLIB
parameter. The default is *NO.
If OUTLIB(*NONE) is specified, the CLROUTFILE value
is ignored.
If a library is named for OUTLIB and the file
already exists, the *NO value will cause an escape
message. You must either delete the file or specify
*YES.
*YES may be entered to clear the existing file in
the OUTLIB.
PRTSCNDUP Command parameters *CMD
----------------------------
SEQ The sequence in which to print the data. The
default is *KEYED meaning to print the file in keyed
sequence which is the way the data is processed to
determine whether possible duplicates exist. Each
unique key is identified.
*ARR may be specified to print the data in arrival
sequence which is the way the key records are
generated based on your file. Each group of records
is identified by use of an Internal ID field which
is a consecutive number assigned when processing the
specified file.
Restrictions
------------
Files with null capable fields are not supported.
There are several other restrictions that are discussed previously.
Prerequisites
-------------
The following TAA Tools must be on your system:
CHKOBJ3 Check object 3
EDTVAR Edit variable
EXTLST Extract list
FILEFDBCK File feedback
HLRMVMSG HLL Remove message
RTVFLDA Retrieve field attributes
RTVSYSVAL3 Retrieve system value 3
SNDCOMPMSG Send completion message
SNDESCMSG Send escape message
SNDSTSMSG Send status message
Implementation
--------------
None, the tool is ready to use.
Objects used by the tool
------------------------
Object Type Attribute Src member Src file
------ ---- --------- ---------- ----------
SCNDUPRCD *CMD TAADBIT QATTCMD
PRTSCNDUP *CMD TAADBIT2 QATTCMD
TAADBITC *PGM CLP TAADBITC QATTCL
TAADBITC2 *PGM CLP TAADBITC2 QATTCL
TAADBITR *PGM RPG TAADBITR QATTRPG
TAADBITR2 *PGM RPG TAADBITR2 QATTRPG
TAADBITR11 *PGM RPG TAADBITR11 QATTRPG
TAADBITP *FILE PF TAADBITP QATTDDS
TAADBITL *FILE LF TAADBITL QATTDDS
TAADBITW *DTAARA
TAADBITY *DTAARA
TAADBITZ *DTAARA
The TAADBITW data area contains bypass values for the *ADDRTRANS
function.
The TAADBITY data area contains bypass values for the *MULTINAME and
*LASTNAME selection options.
The TAADBITZ data area contains bypass values for the *MULTIADDR
selection option.
Structure
---------
SCNDUPRCD Cmd
TAADBITC CL pgm
TAADBITR RPG Pgm
TAADBITR11 RPG Pgm
PRTSCNDUP Cmd
TAADBITC2 CL pgm
TAADBITR2 RPG Pgm
|