IV.- UNIDAD: SISTEMAS DE CRÍA Y MANEJO DEL CABALLO
4.5 Limpieza e higiene corporal del Caballo
4.5.11 Control del peso y manejo de la alimentación
This section describes the process of record linkage in semistructured data sets on a GPGPU or GPGPU-like parallel hardware. We adopt a data-shaping-based approach, applying appro- priate transformations to data and algorithm(s) involved in parallelizing the computations. Parallelization of computations involved helps exploit the structured and rigid architecture of SIMT/SIMD-like parallel machines in an efficient manner.
The proposed record-linking solution operates in three stages: (1) preprocessing, (2) iden- tification of candidate sets, and (3) linking records. Figure 4.9illustrates the process flow.
4.6.1 Stage 1: Preprocessing
The terms that occur too frequently in the data sets do not help in identifying the linked records as it increases the probability of false positive. Similarly, infrequently occurring terms can increase the probability of false negatives in identification of linked records. Hence, it is prudent to use terms that are neither too frequent nor too rare when looking for linked records.
Document frequency (DF), a robust searching technique from information theory, helps in defining relative occurrence of terms. We make use of DF to select useful terms, which later on helps prune the not-likely record candidates. Specifically, we use terms that fall in the pre- specified range (<MIN DF, MAX DF>) for further search. Input to the preprocessing stage is an XML-encoded data set (comprising tree model objects). Output of the pre-processing stage is a record-wise signature set. Briefly, the preprocessing stage comprises the following.
ID SignatureSet2
V1 s1: [cid0, cid1,.., cidk]1 V2 s2: [cid0, cid1,.., cidk]2 V3 s3: [cid0, cid1,.., cidk]3
Record Set1 [U1, U2, U3]
ID Candidate Set U1 V2 U2 V1 U3 - Record Set2 [V1, V2, V3] ID SignatureSet1
U1 s1: [cid0, cid1,.., cidk]1 U2 s2: [cid0, cid1,.., cidk]2 U3 s3: [cid0, cid1,.., cidk]3
ID Linked Record(s) U1 V2 U2 V1 U3 - Linked Records STAGE 1:
Input: Record Set1, Record Set2 Output: SignatureSet1, SignatureSet2, STAGE 2: Input: SignatureSet1, SignatureSet2, Output: Candidate Set
STAGE 3:
Input: Candidate Set Output: Linked Records
Preprocessing
Candidate Set Identification
Detection of Linked Records
Figure 4.9: Framework for record linkage process. The figure shows the process using two sets of records: Record Set1 and Record Set2.
4.6.1.1 Data Encoding and DF Update
Input data set is parsed using an appropriate parser (e.g., expat parser for XML-encoded data set). On encountering a leaf node, we update its DF. We do not consider the intermediate nodes (tags and textual values) when developing the signature sets for object.
4.6.1.2 Forming Signature Sets
For every object, select text terms which fall in the pre-specified range (<MIN DF, MAX DF>), and refer this as signature set of that object. A signature set tuple comprises ContentID.
ID Signature Set
U1 (The Matrix), (L. Fishburne)
U2 (Matrix), (Keanu Reeves)
U3 (Lord Of The Rings), (Peter Jackson)
Figure 4.10: Example of signature sets for three tree objects namely U1, U2, and U3. (Assume U2 and U3 similar to U1 shown above.) Note that actual signature set contains hashed values instead of text values.
For example, signature set of the ithobject comprising k-tuples is represented as si = [cid0, cid1, ... , cidk]i. Refer Figure 4.10.
We implement Stage 1, that is, preprocessing stage using a hashing approach. Hashing approach helps avoid serious processing bottleneck. For example, with data encoding, there is no need for the costly string comparison-based search operations through the table storing the string terms and the numeric codes assigned to them. Moreover, using numeric codes instead of string values helps in storage and access as well. Preprocessing stage that leads to overall efficient processing is the focus of Section 4.7and Section4.8
4.6.2 Stage 2: Identifying Candidate Records
This is data reduction stage. Input to this stage is the list of object signature sets, and out- put is a set of candidate-linked records for that object. Algorithm used for identifying a set of candidate-linked records is listed as Algorithm5.
Algorithm 5 takes a list of signature sets (S) as input. It compares signature set of all objects with each other (Line 6). Every tuple of set si is compared with every tuple of set sj, and Cand Sim Score (indicating candidate similarity score) is calculated (Line 7). A match at any stage results in assigning a new Cand Sim Score, and further matching operations are not carried out for the tuples. If Cand Sim Score exceeds the candidate similarity threshold denoted by (θ), object oj is added to Ci, the candidate set of ith object (Lines 8-9). Stage 2 is amenable to parallel processing. This stage involves identification of candidate set for all
the objects of Set1. Identification is single-process multiple-data kind of processing. Stage 2, therefore, can be and is executed on GPU.
Algorithm 5 IdentifyCandidateSet(S)
Input: S = (s0, s1, ..., sN −1), signature set of N-1 objects.
Output: C = (C0, C1, ..., CN −1), duplicate candidates for N objects. Ci: Candidate set for the ith object (0<= i <N).
Cand Sim Score: Candidate Similarity Score. θcand: Candidate Similarity Threshold.
1: for every si in S s.t. i ∈ [0, N] do
2: for every sj in S s.t. j ∈ [0, N] do
3: if i 6= j then
4: Ci ← ∅
5: Cand Sim Score ← 0
6: Compare si and sj tuple-wise
7: Assign Cand Sim Score
8: if Cand Sim Score > θcand then
9: Ci ← Ci ∪ oj
10: end if
11: end if
12: end for
13: end for
4.6.3 Stage 3: Linking Records
After Stage 2, for every object in Set1, we have identified a set of candidates belonging to Set2. In this stage, we refine the set of candidates for every object. Specifically, we carry out intensive pairwise comparison and identify the linked records. Input to this stage is the list of candidate set C, and output is a list of linked sets. The algorithm takes the list of candidate set C as input and produces a list of linked records for each object. We compare all nodes of oi with all nodes of oj, and LR Sim Score is calculated. If LR Sim Score exceeds the value of linked record similarity threshold denoted by (θLR), object oj is added to LRi, the linked record set of ithobject. Stage 3 is also amenable to parallel processing as it also involves a task that is essentially single process multiple data in nature. Stage 3 is also executed on GPU.