• No se han encontrado resultados

La naturaleza de los derechos de la personalidad.

LOS DERECHOS DE LA PERSONALIDAD: DEL PARADIGMA PATRIMONIALISTA A LA DIGNIDAD DE LA PERSONA.

1. Concepto de persona, personalidad y naturaleza humana.

1.5. La naturaleza de los derechos de la personalidad.

For each Wikipedia entry obtained in the entity disambiguation step, the fol- lowing information extraction process is performed. The process extracts infor- mation from the entry (description, infoboxes, and categories), and also from entries linking to that entry, as found in the “What links here” section. For each entry, there are two lists to fill with as many values as possible. These lists are: Names, where all official names, nicknames, and aliases are included; and Properties, indicating the most descriptive aspects or qualities of the entity. The information extracted is added into one of these two lists. A description of the process followed to extract information at each part of the process is given below.

Description. The first paragraph of a Wikipedia entry is considered the de- scription of the entry, excluding eventual elements such as tables of contents and infoboxes that may be placed before the description. The description typically starts with the complete name of the entity, some aliases, and the most descrip- tive properties of the entity. Figures 6.3 and 6.4 show the first paragraphs of the entries for UPC and Bill Clinton.

The Technical University of Catalonia, sometimes called UPC-Barcelona Tech, is the largest engineering university in Catalonia, Spain. The objectives of the UPC are based on internationalization, as it is Spain’s technical university with the highest number of international PhD students and Spain’s university with the highest number of international master’s degree students. The UPC- Barcelona Tech is a university aiming at achieving the highest degree of engineer- ing excellence and has bilateral agreements with several top-ranked European universities.

Figure 6.3: Description of UPC in Wikipedia.

William Jefferson “Bill” Clinton (born William Jefferson Blythe III; August 19, 1946) is an American politician who served as the 42nd President of the United States from 1993 to 2001. Inaugurated at age 46, he was the third- youngest president. He took office at the end of the Cold War, and was the first president of the baby boomer generation. Clinton has been described as a New Democrat. Many of his policies have been attributed to a centrist Third Way philosophy of governance, while on other issues his stance was center-left.

Figure 6.4: Description of Bill Clinton in Wikipedia.

The description is preprocessed to obtain tokenization, parts-of-speech, NEs, and dependency parsing. The first named entity is then extracted as the official name. Moreover, the official name is usually boldfaced. Next, a set of patterns combining strings and parts-of-speech extract the aliases that are typically found just after the official name. For example, in Figure 6.3 the pattern is “sometimes called <alias>.” Official names and aliases are added to the Names list. After names and aliases, a set of patterns extract the most descriptive qualities or aspects of the entities. The patterns are basically the verbs “be” and “become,” followed by a NP. That NP is extracted as a descriptive NP. In addition, the head and the term of each descriptive NP are also extracted. All three (NP, term, and head) are added to the Properties. Figures 6.5 and 6.6 show the properties extracted for UPC and Bill Clinton, respectively.

Infoboxes and categories are the most structured part of Wikipedia’s con- tent, and therefore the easiest from which to extract information. From in- foboxes, all the contents of the following fields are extracted: fullname, name, office, title, profession, company name, playername, occupation, nickname,

Descriptive NP Term Head

the largest engineering

university in Catalonia largest engineering university university Spain’s technical university

with the highest number of international

PhD students technical university university

a university aiming at achieving the highest degree of engineering

excellence university university

Descriptive NP Term Head an American politician who served as

the 42nd President of the United States

from 1993 to 2001 American politician politician

the third-youngest president third-youngest president president the first president of the baby

boomer generation first president president

center-left center-left center-left

Figure 6.6: Properties extracted from the description of Bill Clinton.

official name, native name, settlement type, type. The values of the fields related to names and aliases are added to the Names list, while the others are added to the Properties list. All categories are also added into Properties.

What links here is a special page of Wikipedia that lists the entries that link to the current entry. Note that the information gathered from the entry is the official information, but it is not always the description that people most commonly use to refer to that entity. For instance, extracting information from the description, infoboxes, and categories of the entry “Samsung,” we find that Samsung is “a South Korean [multinational conglomerate [corporation]]” from the description and a company taking into account the categories. However, looking into the entries that link to Samsung, new properties can be found such as manufacturer, competitor, and electronics company.

The methodology to extract information from the entries linking to the cur- rent entry is as follows. First, sentences including a link to the current entry are selected and the rest of the document is discarded. For each sentence, a set of patterns are matched in order to extract new information. The patterns are as follows:

• Anchor text. The text used to link to the entry, which is typically the name or an alias. All the anchor texts used to link the entry are added to the Names of the entity. The pattern takes advantage of the wiki format, where links are annotated inside brackets. Pattern: [[entry|<anchor text>]].

• Left term. The set of nouns and adjectives to the left of the anchor text are added to the Properties list. Pattern: (NNP?|JJ)* [[entry(|*)?]]. • Such as. In some cases, the entry is linked in the middle of a comma- separated list of other similar or related entries. In many cases, these lists are introduced by a sentence including some information about the follow- ing listed entries, and an expression such as include, such as, or like. The pattern is then defined as follows: “<property> such as entry1, entry2,..., entryN” where one of the listed entries is the current entry. • Appositions. Similar to coreference resolution, a document linking to an entry that has a NP in apposition is probably describing some property of

Nouns and adjectives at the left of the anchor text:

...the logo of electronics company Samsung, and the logo of the engineering consultancy Atkins...

include and such as patterns:

Cash register manufacturers include CHD, ELCOM, SAM4S, Casio, NCR, IBM, Panasonic, Samsung, Wincor-Nixdorf, Uniwell, RCH S.p.A., United Bank Card, Sharp, ...

Major competitors today include, in the main business, Alcatel-Lucent, Huawei, Nokia Siemens Networks and ZTE, with Cisco, IBM, EDS, Accenture, Nokia, Motorola, Samsung, LG Electronics, NEC, Sharp and most recently Apple Inc., competing with aspects of the business.

Korean companies such as LG, Hyundai and Samsung have established...

Figure 6.7: Properties of Samsung extracted from Wikipedia entries linking to Samsung.

the linked entry. So, NPs in apposition to the link are also added to the Properties list. Pattern: [[entry(|*)?]], <noun phrase>.

Figure 6.7 shows some examples of the extraction patterns for the entry Samsung. These sentences have been taken from entries in Wikipedia linking to the entry Samsung.

All the NPs, terms, and heads extracted are added to the Names or Proper- ties list with an associated counter. In the case that an expression was already in the list, the counter is increased. This value is analogous to a confidence value associated with each expression—the most repeated expressions are the most reliable. In order to avoid incorrect information as much as possible, we define a threshold below which all the Names and Properties are discarded.