Toolkit FAQ: Probabilistic Linkage

Latest post 06-15-2007 1:22 PM by administrator. 0 replies.
  • 06-15-2007 1:22 PM

    Toolkit FAQ: Probabilistic Linkage

    Locked | Reply |Contact

    What is record linkage?

    Record linkage is the task of identifying records corresponding to the same entity from one or more data sources. Entities of interest include individuals, companies, geographic region, families, or households. 

    How is Deterministic record linkage different from Probabilistic record linkage?

    Record linkage procedures are generally divided into two broad categories – Deterministic and probabilistic. When exact match is created on one or more matching variables between data files, the linkage procedure is called deterministic.  However, the deterministic linkage is usually dependent on the availability of one or more unique identifiers such as social security number (SSN), which is often missing or not allowed to be used by regulation. It also requires that the unique identifier (ID) is error free and non-missing.

    Presence of error free unique shared ID is not very likely in large datasets either due to regulations against the use of such unique IDs, their absence altogether, or high omission rates.  In such situations, probabilistic linkage can be used to link records. Probabilistic linkage enables calculation of the likelihood of a correct match while allowing for incomplete and/or error conditions within the records. Probabilistic linkage is helpful when variables are spelled differently, names abbreviated differently, nicknames are used in one data and not in the other, two last names are used, the first name and last name are swapped, dates or parts of dates are misreported or swapped, data are missing in general, or there are errors of other kind in the data.

    Probabilistic linkage technology enables linkage of large public health databases at record level with a considerably high accuracy, even when unique identifiers are not available across datasets and when unique identifiers are present in the datasets being matched or such information is missing on a considerable proportion of records. .  Rather than looking for an exact match after comparing on a single variable or on a combination of variables, probabilistic methodology depends upon calculating probabilities of false matches and false miss-matches. 

    How is record linkage useful in public health?

    Linkage of large databases at an individual record level enables healthcare researchers to take advantage of existing information in answering questions not possible to answer from any of those databases separately. Researchers have taken advantage of the record linkage in a variety of areas in public health including surgical care (Hall et al., 2005), neonatal re-hospitalizations (Liu and Wen, 2000), cancers of various kind (Miller, Howe, and Sherman, 1989) patient safety (Nash et al, 1995), and injury prevention (Runge, 2000), among others.  Knowledge about various steps involve in modern record linkage can help epidemiologists and health services researchers prevent wastage of effort in trial and error.

    What are some broader purposes of record linking?

    Record linkage is done for a variety of reasons, such as:

    • Identifying duplicates within large (single) data file. For instance, determining complications and readmission resulting from shorter stay in hospital.
    • Bringing together information from two files so that analyses of records on same entity can be performed. There are many situations where more sophisticated new analyses are only possible through matched records.
    • Cleaning and standardizing files
    • Disease surveillance.
    • Completion rates of health registries such as cancer registry, and Immunizations completion rates
    • Computing outcomes measures.
    • Augmenting data through linkage to reduce data collection burden, e.g. link hospital characteristics file (e.g. AHA files) with hospital discharge data, or add economic or demographic data for the area

    What software packages are available for record linkage?

    Several dozens of record linkage software packages are available in the market. NAHDO is working on a research paper aimed to describe the strengths and limitations of various Probabilistic Linkage Software packages. While this work is in progress, some names include:

    AutoMatch and AutoStan by Matchware Technologies (now enhanced and repackaged by Vality)
    Integrity by Vality http:\\www.vality.com
    CHOICEMAKER www.choicemaker.com
    LINK KING http://www.the-link-king.com/
    OX-LINK http://www.oxlink.net/
    LINKAGEWIZ http://www.linkagewiz.com/
    GRLS (CanLink) Statistics Canada; tedhill@statcan.ca
    A description in Australian Journal of Statistics: http://www.stat.tugraz.at/AJS/ausg041+2/041+2Fair.pdf
    U.S. Census Bureau – GDRIVER
    User documentation at: http://nedinfo.nih.gov/docs/US%20Census%20Bureau%20Record%20Linkage%20SW%20User%20Documentation.pdf 
    FEBRL-- The prototype software Febrl (Freely extensible biomedical record linkage) is hosted on Sourceforge.Net and can be downloaded from:
    https://sourceforge.net/projects/febrl/

    Ryley Fogg NAHDO IT

Page 1 of 1 (1 items) | RSS
Powered by Community Server (Commercial Edition), by Telligent Systems