Matching records that correspond to the same entities from different data sets has been intensively used in various areas including health, national censuses, crime and fraud prevention, and national elections. The following is a list of the main application areas that use ER to improve their operations:
• Health services:Health services is one of the main areas that had an early start in matching data between multiple resources [33]. In health services, personal and medical information is collected each time a person comes into contact with any health services (such as a doctor, a clinic, or an emergency department of a hospital). If such information is matched between various health services, it can be analyzed to improve the health system [148].
Matching records from different health services provides researchers with bet- ter quality data, patients with improved services, and policy makers with more reliable evidence to support their decision making [124]. Moreover, matching health data improves the ability to detect adverse health trends, identify disease outbreaks, detect health service problems (such as procedural or managerial is- sues), and improve clinical practice [125].
Australia is one of the world leaders in health data matching. It established the Health Services Research Linked Database program in Western Australia in 1995 [47]. Until 2003, this program has supplied data for 258 projects, which produced 708 research outputs, including 172 journal articles. Moreover, the matched data from the Western Australian data matching program was directly responsible for various changes to policies and clinical practices [25]. In 2008 a study was conducted to evaluate the research output based on using matched health data from the Western Australian data matching program. It concluded that matching data between different services in the heath domain can make a substantial and quantifiable contribution to population health and policy devel- opment [24]. Similar health data matching programs were founded in different counties as well, including the UK, Canada, US and New Zealand [148]. • National statistical agencies: National statistical agencies (NSAs) are respon-
sible for collecting and publishing data related to various areas such as popu- lation, economy, health, education, culture or politics [33]. Statistical data gen- erated by NSAs are provided to governments, organizations or communities to improve decision making, evaluation, and assessment procedures [33]. NSAs have a long history of matching records when conducting statistical surveys and developing data collections [163].
Matching records allows the reuse of existing data. For example, in 2006 the Australian Bureau of Statistics1 required to produce an estimation about the number of interstate migrations. However, there were no direct data sources that could be used to measure interstate migrations and it was expensive to 1can be found at http://www.abs.gov.au/
14 Background
conduct a survey for that purpose [147]. Therefore, the Australian Bureau of Statistics used a number of indirect administrative data sources to estimate the number of interstate migrations, including electoral roll registration, fam- ily allowance payments, and Medicare registration data from Medicare Aus- tralia [147]. Matching existing data reduces costs, improves data quality, and reduces the burden of conducting surveys [163].
The US Census Bureau was one of the first to adopt data matching techniques, and it also had a key role in ER research [126, 161, 164]. Currently, most bureaus of census, including the Australian Bureau of Statistics, apply data matching techniques to improve the quality for their collected data.
• Fraud detection: Fraud is defined in the Oxford Dictionary ascriminal decep- tion; the use of false representations to gain an unjust advantage [116]. With the rapid and enormous growth of modern technology, and with the fast com- munication means currently available, fraud is increasing dramatically leading to the loss of billions of dollars worldwide [21]. Many fraud detection prob- lems involve very large data sets. For example, in the period of November 2011 to November 2012, around 1.9 billion credit card transactions were carried out in Australia [132]. This large number of transactions shows the importance of fraud detection. Assume that only 0.1% of these transactions were fraudulent, this means that 1.9 million transactions will be affected by fraudsters only in that year.
Currently, various information systems, statistics applications, and data mining techniques are used for fraud detection in numerous businesses and organiza- tions [30, 121]. ER techniques can be used to improve the quality of the data used to enhance the performance of such data mining and statistical tools. Moreover, ER can be used to match known fraudsters to other individuals in the data. This should improve fraud detection since fraudsters rarely work in isolation from each other [21].
• Other applications: ER has been applied in various other application areas. Search engines [69] use ER techniques to identify documents that cover similar topics. Many comparison shopping sites also take advantage of ER to iden- tify similar products to be able to provide price comparisons successfully [17]. Digital libraries, on the other hand, identify similar articles to improve search- ing facilities, and to de-duplicate their data sets [170]. De-duplication can also be used in businesses to remove duplicate records from their customer mail- ing lists, which will reduce the cost of sending advertisement mails to cus- tomers [113]. Moreover, business partners can benefit from matching records across their data sets to achieve successful business activities.
§2.2 The Entity Resolution Process 15