V.- UNIDAD: IDENTIFICACIÓN, RESEÑA, REGISTROS Y
5.1 Identificación
5.1.1 Principales métodos de identificación
5.1.1.1 Identificación mediante caracteres naturales
In this section, we first review the concept of hash table data structure. Then, we discuss a naive approach based on hash table. Then, we describe our efficient hash-based approach for data encoding and candidate selection, the design rationales, and working. We also provide a discussion on the proposed hashing-based data structure.
4.7.1 Baseline Hashing-based Approach
Hash table is a popular data structure. Typically, a hash table supports the basic dictionary operations such as insert, find, and delete. A given entry is inserted in the following way: obtain a hash value of the given entry using an appropriate hash function, and write the given entry at the address given by hash function. In the context of hash table, three design decisions are important: size of the hash table, hash function, and collision resolution. The literature suggests several approaches to handle collisions in hash table. These approaches are classified into open addressing and closed addressing [28], [85]. An example of open addressing is separate chaining. Examples of closed addressing are linear probing, quadratic probing, double hashing, perfect hashing [81], cuckoo hashing [107], etc.
A naive approach to apply hashing mechanisms for pre-processing stage, that is, Stage 1 is shown in Algorithm 6.2. The basic idea is as follows. Given a term t, obtain hash h of t. If h is outside the range of the hash table, then take a M OD of h and assign it to h. Update the DF count at address h in hash table. Update the document ID list.
The baseline approach discussed above has a serious limitation. Memory footprint of the hash table is prohibitively large and potentially a memory bottleneck. Recall that the record linkage problem in the era of Big Data must leverage the compute power of SIMT machines to address the processing time challenges. However, SIMT machines are limited in terms of memory. Thus, reducing the memory requirements of data structures and algorithms used to this end is critical to take advantage of SIMTs.
Reducing the size of hash table is an option, but it results in false collision leading to an artificial rise in the DF count of some terms. This effect, that is, the artificial rise in the DF
count of some terms, can adversely affect the record linkage task by reducing the accuracy of records linked.
Algorithm 6 BaselineApproach(t) T: A hash table of size S
1: h ← Hash(t)
2: if (h > S) then
3: h ← ( h % S)
4: end if
5: Update the DF at address h in the T
6: Update document ID list
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hash(.) Hash DF 2 1 8 1 Doc ID List 0 1 1 (b) (c) Doc ID List 0 Hash(.) | | / | \ / | \ aa | cc bb (a) Hash DF 13 1 Set0 Set1
Figure 4.11: Signature Selector data structure. (a) A document and hashing of the leaf node terms. (b) The mapping of hash values to Signature Selector data structure (without hash table). (c) The organization Signature Selector into sets Set0 and Set1.
4.7.2 Signature Selector Data Structure: A Novel Approach
While a hash table is appropriate for dictionary operations (find, insert, and delete), the requirements of data encoding and candidate selection are different. For example, the Signature Selector data structure (SignatureSel) need not perform any deletions. It only needs to carry out find and insert operations. The hash value collisions (or true collisions) affect candidate selection process by artificially inflating the DF count of few terms. We provision an auxiliary data structure, which records the IDs of the documents (trees) in which the terms occur.
Figure 4.11 depicts the design of our Signature Selector and its operation. Figure 4.11(a) shows that terms accompanying leaf nodes are considered for the signatures. It also shows that a hash value of such terms is generated using Hash() function. The hash value thus obtained is used as an entry in Signature Selector data structure. Since the size of Signature Selector is much smaller than the range of hash values obtained from Hash(), we need to map h to an appropriate line (and hash bucket) in Signature Selector. Mapping scheme is shown in Figure
4.11(b).
Figure 4.11(c) shows the Signature Selector data structure. Signature Selector has three fields: (1) Hash which holds the hash of a given term, (2) DF which is the DF of the given term, and (3) List which contains all the documents in which the given terms occur. Note that, the size of the list, which contains document IDs, can become very large if the given term occurs in a large amount of documents. This is a potential problem. We address this issue in an intelligent manner. We conjecture that if a given term occurs in too many documents, its discerning power is limited, and hence its effectiveness to help identify potential candidate records (and potential linked records) is also very limited. Thus, it is prudent not to use such terms when identifying candidate records. Consequently, such terms have no relevance in the Signature Selector data structure. Hence, there is no need to maintain a list of documents for such terms. Formally, if the DF of a given term exceeds the MAX DF threshold, we do not maintain the list of documents in which the given term occurs.
Algorithm 7 describes the mapping and update procedure. For a given term t, we obtain hash h of t in Line 1. In Line 2, we derive the ID of hash bucket, bucketID, to which h is written into. In Line 3, we derive address (in bucketID) at which h is to be written. In Line 5, DF of h is incremented. If DF is below DF M AX threshold, the document ID list is updated (Lines 6 and 7).
4.7.3 Forming Signature Sets
For every object, we select text terms that fall in the pre-specified range (<MIN DF, MAX DF>), and we refer this as the signature set of that object. A signature set tuple comprises ContentID. For example, signature set of the ith object comprising k-tuples is rep-
resented as si = [cid0, cid1, ... , cidk]i. Algorithm6 depicts this process, which operates in a sequential manner on CPU. We discuss a parallel implementation of this process on GPU in Section4.8
Algorithm 7 updateSignatureSel(t)
1: h ← Hash(t)
2: bucketID ← FLOOR(numBucketsh )
3: if (bucketID > numBuckets) then
4: bucketID ← bucketID % numBuckets
5: end if
6: lineID ← h % sizeBucket
7: Write h at line lineID in bucket bucketID
8: Increment DF
9: if (DF < M AX DF ) then
10: Update document ID list
11: end if
Algorithm 8 buildSignaturesSeq(t) T: A hash table of size S
1: h ← Hash(t)
2: if (h > S) then
3: h ← h % S
4: end if
5: Update the DF at address h in the T
6: Update document ID list
4.7.4 Discussion
While designing the Signature Selector data structure, our goal is to enable a memory- efficient implementation. We do not consider linked list-based implementation because such an implementation is not efficient for GPGPUs. Moreover, our design favors efficient memory access while performing operations on Signature Selector data structure. Specifically, our design considers the following aspects: (1) parallelization, (2) scalability, and (3) collision management.
4.7.4.1 Parallelization
Signature Selector data structure is amenable to parallel processing. The task of identifying relevant critical terms, that is, terms that belong to the range (<MIN DF, MAX DF>), can be
done in parallel. All the hash buckets can be processed concurrently, thus utilizing the parallel processing capabilities of modern computing systems such as GPGPUs and MICs.
4.7.4.2 Scalability
Our design is scalable in terms of memory footprint. The hash-based candidate selector data structure reduces the size of memory footprint to a factor of 1/k. Hence, the size of data structure grows sub-linearly in terms of the size of hash table.
4.7.4.3 Collision Management
Note that, the hash tables are subject to collisions. A collision occurs when the hash function produces the same hash value for two given distinct terms. We refer to such type of collision as a true collision. We surmise that such collisions do not have adverse effect on the record linkage process. Collision can also be a result of folding (or aliasing) effect. Folding (or aliasing) occurs when the hash value of a term is outside the range of the hash table used and an M OD of the hash value needs to be used. We refer to such a collision as false collision. Since we use a hash range, which is large enough, we avoid aliasing effects.