Database Preparation Services

Duplicate Record Resolution

Duplicate record resolution, more commonly known as "deduping," is a key component of database preparation for libraries performing their cataloging on OCLC. RLIN users, on the other hand, may or may not need deduping depending on whether the library has requested a "transaction" tape or a "snapshot" tape from RLG. A snapshot tape contains consolidated bibliographic and holdings data in a single record and hence does not need deduping. Records derived from WLN do not need to be deduped. Deduping is less important or may not be needed at all for records exported from CD-ROM databases, or those created from single-use processing activities such as retrospective conversion or reclassification projects. Duplicate record resolution sometimes includes the consolidation of holdings and call number data from non-retained record uses into the retained record.

Why OCLC Records Need Deduping

Whenever a record is Produced or Updated on OCLC, it is written to a daily journal tape. These records are archived chronologically. Subsequent transactions do not replace earlier uses of the record. The first time a record is used for cataloging, a copy of the record, along with any editing modifications made by the library, is written to tape. If the item is later withdrawn from the library's collection and the holding symbol canceled, the Cancel/Update transaction is archived to tape. If added copies or volumes of a title are processed as Updates, complete holdings data emerges only after examining a series of transactions.

Deduping eliminates confusion that would result if a patron had to view two or more records for the same work while also reducing the amount of mass storage required to hold the database. For this reason, estimating local system mass storage requirements is best done following duplicate record resolution.

How Many Duplicates Are There?

Duplicate records typically account for 10% to 25% of an OCLC library's total records. While the average library database has a duplication rate below 20%, a large public library with many branches may find that one-third to one-half of its records are duplicates. Academic and special libraries tend to have fewer duplicates. Within duplicate groups, the distribution pattern of records is fairly consistent. Typically, 80% of the records fall into groups of two duplicate records. A much smaller number of records appears three times and so on.

Pseudo - Duplicates

Libraries that cataloged on OCLC during the 1970s may discover that the same record has been used hundreds of times. Prior to the availability of separate bibliographic formats for serials, AV, scores, manuscripts, etc., libraries cataloged these materials using a single bibliographic record that was edited over and over again. Bibliographically, these records are not duplicates. To retain them the database vendor must be able to distinguish these records from genuine duplicates. If the library's local system requires a unique control number in each record, LTI can replace the original OCLC control number in pseudo-duplicate groups with an artificial number sequence.

Deduping Options

There are three primary options for resolving multiple uses of the same record within a database: 1) retain the first use of a record and discard later uses; 2) retain the latest use of a record and discard earlier uses; and, 3) retain either the first or latest use of the record and consolidate information from other uses into the retained record. Most OCLC libraries keep the latest use of a record, or the latest use with holdings and/or call number fields merged from prior uses into the latest use.

As a rule OCLC Cancel transactions, as well as all records in duplicate groups ending with a Cancel, are deleted from the library's database. However, the library's database vendor should provide for retention of selected records in duplicate groups ending in a Cancel transaction when the canceled record represents a holding library other than that represented on prior uses. OCLC test records, which are used to verify the library's catalog card print profile options, are also deleted during deduping.

An alternative approach to duplicate record resolution is to retain records based on a hierarchy of holding library codes. Holding library deduping is necessary when multiple libraries want to share a common database. In other cases a single institution may want a collection or branch to be used as a criteria for record retention, e.g., main holding library record to take precedence over a juvenile holding library record. If necessary, secondary deduping can then be performed using the first or latest record use. LTI's software permits the library to customize deduping based on its own special requirements.

Regardless of which record is kept, the library should have the option to consolidate variable field data from unused records into the record of choice. At LTI this consolidation can be performed either on a field-by-field basis (e.g., 049 field) or by tag group, e.g., all 6XX fields. When two or more fields in records to be merged are identical, only one field is retained.

Other Deduping Considerations

The complexities of deduping account for the various types of deduping offered by vendors. The best approach for your library depends on several factors, including the purpose for which the database is being prepared, the source of the records, and the cataloging practices used in creating them. Deduping for a library that wants only a printed list of titles for insurance purposes need not be complex. On the other hand, if an institution intends to use the database to generate item fields and smart barcode labels, careful attention must be given to duplicate record resolution.

Library cataloging practices also affect deduping. The process is greatly simplified if the library has edited records so that the latest use includes all bibliographic, call number, and holdings data. The majority of OCLC libraries, however, do not update complete bibliographic/holdings information each time a record is reused. Consequently, retaining the latest occurrence can result in the loss of information if the database vendor is unable to offer record consolidation as a processing option.

OCLC libraries must understand that deduping on the 001 field control number does not mean that all duplicate records have been eliminated from the library's database. Anyone who has used OCLC knows that it is not uncommon for the same edition of a work to be cataloged on multiple records. Because these records have different OCLC control numbers, deduping on control number is not going to identify them.

019 Field Deduping

For OCLC libraries, a special area of concern in record control number deduping is caused by the dynamic nature of the Online Union Catalog. OCLC has defined the 019 field to hold OCLC control numbers of duplicate records deleted from the online system. The most frequent cause of OCLC record deletion is the "bumping" of a member-input record by an LC MARC or other national library record. While displaced records are not retained in the online system, they continue to reside on the library's archive tape.

OCLC control numbers of deleted records are found in the 019 field of the record retained in the OCLC Online System. About 5% of the records in your library's database will contain an 019 field. Only a small subset of deleted record numbers appearing in the 019 field will actually match an existing record in the library's database. However, when deduping a large database, even a small overlap can lead to hundreds of duplicate records remaining unidentified. LTI was asked once to dedupe a database that had already been deduped by another agency without reference to 019 fields. From a file of 140,783 "deduped" records, 019 field deduping identified an additional 882 duplicate groups. LTI provides 019 deduping as part of its regular duplicate record resolution service for OCLC libraries.

Other Deduping Algorithms

The most effective way to eliminate duplicates from a database composed of OCLC or RLIN records is to use the OCLC or RLIN control number found in the 001 field. The same applies for any database in which the records have a unique control number.

Two other control number deduping keys are sometimes used to eliminate multiple occurrences of the same record. LTI uses enhanced LCCN and ISBN keys to identify and eliminate duplicates from databases lacking unique control numbers. The LCCN and ISBN control number keys are supplemented with information taken from the title field and date of publication. Both are designed to reduce false matches.

For records lacking a control number, it may be necessary to adopt a non-numeric deduping key. Non-numeric deduping relies on the creation of a composite identification key. The more sophisticated the key, the greater the probability that only duplicate records will be detected and eliminated.

LTI's non-numeric deduping key contains 52 characters. It combines fixed and variable field information, including data extracted from the title, imprint, and physical description. Non-numeric deduping means making trade-offs between precision and recall and is not as effective as control number deduping. It is most useful in merging records created from different sources that have failed to match on one of the standard library control number fields.

Next: Database Clean-up