• No se han encontrado resultados

CAPÍTULO I: MARCO TEÓRICO

1.2. Categorías conceptuales

1.2.3. Currículo de Filosofía

The exact-match look up over the proper name dictionary that is used in [8, 19, 62] may provide incorrect result in personal name matching. This is because personal names have multiple representations. This study improves the searcher performance by adding two steps: PNTM and candidate generator before generating a set of candidate entities.

PNTM

Personal name variations or referential ambiguity is a fundamental problem in personal name matching. Furthermore, this problem reduces the performance in text similarity met- rics (e.g. cosine similarity, edit distance) [3]. To solve this problem, we introduced PNTM,

104 Personal Name Entity Linking

which was explained in Chapter 4 to transform personal name variations into a uniform representation before matching using a text similarity metric.

The PNTM step aims to prepare a set of mentioned names to be ready for generating a set of candidate entities by reducing numerous personal name formats to a uniform rep- resentation. The scope of our module is to transform an alternative name, a nickname or a name that starts with a last name into a uniform format.

For example, a set of personal names mentioned in a web document are shown below:

X = {Timberlake, Veronica Finn, Diaz, Britney Spears, Lou Pearlman, Star}

will be transformed into:

X =                               

x1= Timberlake = {Craig Timberlake, Justin Timberlake} x2= Veronica Finn = {Veronica Finn}

x3= Diaz = {Cameron Diaz} x4= Britney Spears = {Britney Spears}

x5= Lou Pearlman = {Lou Pearlman, Lucille Pearlman, Lucinda Pearlman, Louis Pearlman, Louise Pearlman}

x6= Star = {Jeffree Star, Sunshine Dizon}

The output values in personal name transformation describe that the module can gen- erate a list of personal names from one input. For example, two personal names: Craig Timberlakeand Justin Timberlake are generated from the short name Timberlake or a list of given names: {Lucille, Lucinda, Louis, Louise} are generated from the nickname: Lou. Definition 5.2. Given Gx = {g1, g2, · · · , gm} is a set of standard names that are generated from a mentioned name x. For example, the standard names for a mentioned name Timber- lakeare

GTimberlake= {Craig Timberlake, Justin Timberlake}.

The personal name transformation can boost the performance in a text similarity function because it provides a uniform pattern of mentioned names.

Candidate Generator

The candidate generator step is based on a text similarity measurement between the standard name and personal name surface forms that we described in Chapter 3. The Jaro-Winkler

5.3 Personal Name Entity Linking Framework (PNELF) 105

function is used to calculate the similarity score and the person who has a matching score of higher than 97% will be considered to be a candidate entity. The similarity score 97% is an acceptable number from our implementation. This is because the candidate generator is designed to redress the balance between precision and recall in generating a set of candidate entities. We allow a single typography error to be a candidate entity.

Jaro-Winkler is suitable for first name and last name matching [25, 74] and is faster than other basic edit distance algorithms [60]. Furthermore, the experimental results in Chapter 4 showed that personal name transformation and Jaro-Winkler provide the highest accuracy value in text similarity matching. As a result, we used Jaro-Winkler in our module to search for a set of candidate entities in each mentioned name.

Definition 5.3. Given P(x) = {p1, p2, · · · , pm} is a set of candidate entities for a mentioned

name x and p ∈ P is a set of personal names in our data catalogue. For example, a set of candidate entities for a mentioned name Timberlake is:

P(Timberlake) = {Craig Timberlake, Justin Timberlake}.

Finally, the NIL value prediction is generated by returning a matching value 1 or 0 to x. It returns 1 if a mentioned name x can be linked to a personal name in our catalogue or it returns 0 if a mentioned name x cannot be linked to any personal name in our catalogue.

x=    0 if Px= /0 1 otherwise

Finally, from the standard names for a personal name mentioned in the document w, the candidate generator module returns:

X=                         

P(x1) = P(Timberlake) = {Craig Timberlake, Justin Timberlake}

P(x2) = P(Veronica Finn) = {0}

P(x3) = P(Diaz) = {Cameron Diaz}

P(x4) = P(Britney Spears) = {Britney Spears}

P(x5) = P(Lou Pearlman) = {0}

P(x6) = P(Star) = {Jeffree Star, Sunshine Dizon}

As a result, the number of candidate entities in each set is reduced (it returns a set of specific candidate entities). In the next section we propose a new technique to handle the lexical ambiguity problem. In this step, the two mentioned names x3= Diaz and x4= Britney

106 Personal Name Entity Linking

the two mentioned names x1 = Timberlake and x6 = Star are lexical ambiguity because it

returns more than one personal name and the two mentioned names x2= Veronica Finn and

x5= Lou Pearlman are absent.

To reduce the workload and boost the disambiguator performance by adding a function for proving a set of candidate entities before going to the disambiguator process. This function removes closely similar entities if we find that one of them exactly matches the mentioned name.

For example, the two mentioned names: Christian Bale and Tony Scott, in the personal name transformation process, the first names Christian and Tony can be transformed into Christopherand Anthony because both names can be a nickname. This process returns a set of results as follows:

x1 = Christian Bale = {Christian Bale, Christopher Bale} x2 = Tony Scott = {Tony Scott, Anthony Scott}

The outputs above become inputs in the candidate detecting process. In this process we use the Jaro-Winkler similarity function and minor spelling mistakes can be passed. The two sets of candidate entities for Christian Bale and Tony Scott are shown below:

P(Christian Bale) = {Christian Bale, Christopher Hale, Christian Abel, Christopher Blake, Christopher Gable}

P(Tony Scott) = {Tony Scott, Tony Scotti, Tony Scott (footballer)}

Then these sets of candidate entities will go to the candidate proving function. After this process a set of candidate entities should be:

P(Christian Bale) = {Christian Bale}

P(Tony Scott) = {Tony Scott, Tony Scott (footballer)}