• No se han encontrado resultados

The personal name concept is an abstract occupation category that is used to represent the character of person. In this thesis, the occupation taxonomy is used to demonstrate the character of personal name entity. An occupation is the primary feature used to distinguish lexical ambiguity [48]. This subsection is now ready to introduce how to construct the conceptualisation of each personal name, which consists of two steps:

• Building occupation taxonomy based on the architecture of occupation taxonomy. • Building individual entity concept trees that are derived from occupation taxonomy.

Occupation Taxonomy Tree

The occupation taxonomy is a typical tree which represents the relationship between the super class and the lower class. A class in occupation taxonomy is a single occupation; it becomes a node in the hierarchical structure of the tree.

The main components of occupation trees are nodes and edges. A starter node without a parent refers to a root node; it is an ancestor of all nodes in the tree. The root node in the occupation tree is Person. A node without children is a leaf node; most of these are

50 Personal Name Surface Form and OAPnDis

derived from Wikipedia categories. Siblings are child nodes that have the same parent. The connections between nodes are called edges.

We use the Modified Preorder Tree Traversal algorithm (MPTT) [39, 68] to solve the problem about how to collect hierarchical data in a database. The issue is that a database uses a flat structure to store data. The MPTT approach is to uses "lft" and "rght" attributes (as "left" and "right" are the reserved keywords in SQL) to store the relationships between parent and child nodes. The MPTT algorithm is shown in Figure 3.6.

This thesis uses an example in Figure 3.6 to describe the MPTT algorithm. The algo- rithm travels starting from Root node A, from left to right, one level at a time, going down along the edges of tree and assigning a value on the left and right side to every nodes in the tree. The final value is assigned to the right side of the root node.

Fig. 3.6 A Modified Preorder Tree Traversal(MPTT) algorithm

A great deal of MPTT algorithm with "lft" and "rght" values returns the path of node within a single query. For example, if we want to display the path of node E, the SQL query could be:

”SELECT class FROM tree W HERE l f t < 5 AND rgt > 6 ORDER BY l f t ASC; ” The return values of this SQL query are A, B and E. We thus adopt the MPTT algorithm to our work for constructing entity concepts in each individual instance.

We will now describe how to create the personal name concept. The concept is based on OAPnDis, as described in section 3.3.1. The algorithm has two steps to create the personal name concepts in each instance.

1. Building occupation tree in each instance. Given O(p) = {01, 02, ..., 0n} is a set of oc-

cupation categories in each entity. For example, in an instance Arnold Schwarzeneg- gerhas five occupations:

3.3 Occupation Architecture for Personal Name Disambiguation(OAPnDis) 51

O(Arnold Schwarzenegger) = {Politicians, actor, American film producers, American film actors, American film directors}.

The algorithm uses the classes of layer 1 to be a root node of the personal name con- cept because it can classify people in overview. The occupation categories in O(p) are used to generate an occupation tree in each instance using the SQL query in Section 3.3.1. The occupation node oi is the leaf node, and it can inherit from their parent

including the root node. For example,Arnold Schwarzenegger is an actor, so he can be a performer and an Entertainer and Artist. As the result, when the occupation trees are generated, the number of trees is equal to the number of occupation categories in O(p).

Let O(p) = {o1, o2, ..., on} is a set of occupation trees in each personal name entity.

Where tiis a set of occupation categories hierarchy in each ci. For example, T(Arnold

Schwarzenegger) will have five trees below: o1= {Politicians}

o2= {Entertainer and Artist, performer, actor}

o3= {Entertainer and Artist, creator, producer, film maker, American film producers}

o4= {Entertainer and Artist, performer, actor, American film actors}

o5= {Entertainer and Artist, creator, producer, film maker, film director,

American film producers, American film directors}

2. Building the personal name concepts. After all the occupation, trees in each personal name entity have been created and the personal name concepts for each person have been generated under these trees. The root nodes of each occupation tree are used to identify whether or not these occupation trees have the same concept.

All nodes in the occupation tree that have the same root node merge, and the du- plicate nodes are removed to make the node unique in each concept. Given O(p) = {o1, o2, ..., on} is a set of occupation trees in each personal name entity. Note that any

oiin O(p) are similarity consistent if their root node is equal. Let C(p) = {c1, c2, ..., cn}

is a set of personal concepts for each personal name entity.

ci=

n

[

i=1

oi

where root of all oiare equal.

We use an example of the personal name entity Arnold Schwarzenegger occupation trees to explain how to generate personal name concepts. In general, a conceptual tree

52 Personal Name Surface Form and OAPnDis

is produced as follows:

• Matching the root node in O(p). After this step, a set of root nodes in O(p) is generated. Hence, T(Arnold Schwarzenegger) has two root nodes: {Politicians, Entertainer and Artist}.

• A tree oiwhich has the same root node merges to create personal name concepts.

A process loops until the final oimerges. For example, C(ArnoldSchwarzenegger)

= {c1, c2} , where c1and c2describe below:

c1= {o1}

c2= {o2∪ o3∪ o4∪ o5}

As a result, the personal name entity Arnold Schwarzenegger has two personal name concepts. In the first one he is a Politician, and in the second one he is an Entertainer and Artist. The details of the concepts which are generated in these steps are shown in Figure 3.7.

Fig. 3.7 Examples of personal name concepts of Arnold Schwarzenegger: (a) Politician and (b) Entertainer and Artist