• No se han encontrado resultados

– Modificaciones en la cobertura de medicamentos recetados de la Parte D

We perform our corpus study on a sample of math resources from our math corpus (as described in Section 4.3.1) from which we identify concepts and ex- pressions as well as the possible types of relations between them.

We have randomly selected 10 Wikipedia pages from our corpus on 10 differ- ent math concepts, such as absolute value and Fourier transform. The complete list of the Wikipedia pages is presented in Table5.1.

For each Wikipedia page, we identify the contained concepts and expressions through the follwing semi-automatic process:

• First, we automatically tokenize all the text (including the alternative text for images), assign POS tags to the tokens, and perform text chunking. • Afterwards, for each phrase detected through the chunking process, we au-

tomatically check whether it contains a sub-phrase which appears in the math concept list (same as the one described in Section4.3.1). If so, this phrase is marked as a candidate concept that can be linked to the ex- pressions. For example, after chunking, noun phrases such as “a degree 0 polynomial” and ”ancient history” may be detected. Since “polynomial” appears in the math concept list while neither “history” nor ”ancient his-

Name Definition Example Count Representation The expression denotes the

math representation of the concept.

A complex number is

a number which can be put in the form z = a + bi.

906 (59%)

Property The property of the expres- sion is specified by the con- cept.

For any real numbers x and y, ...

294 (19%)

Argument The expression serves as

the argument of the con- cept.

Divide 3 by 4... Sub- stitute y with x2+ 1...

50 (3%)

Context The expression sets the

context of the concept.

The absolute value of

x...

176 (11%) Co-reference The expression is referred

to by the concept.

...32 + 42 = 52. The previous equation...

128 (8%)

tory” does, we mark the former as a candidate concept but not the latter. • We then automatically mark all the non-word and non-punctuation text tokens, as well as the LaTeX expressions in the alternative texts of the expression images on the pages as math expressions.

• Lastly, we manually go through the pages to identify the concepts and expressions that have been missed in the automatic marking process and correct the errors in marking as necessary.

An example of the identified concepts and expressions on a page is as follows: (The concepts are in Bold while the expressions are in italics.)

If we let c be the length of the hypotenuse and a and b be the lengths of the other two sides, the theorem can be expressed as the equation: a2+ b2 = c2.

In total, we have identified 8,121 concepts and 2,434 expressions from the selected pages.

After the identification step, we examine how the concepts are semantically related (i.e., linked) to the expressions, when applicable.

In our domain study of math, we have coded five distinct types of semantic relations altogether, as summarized in Table 5.2.

lation most important for domain-specific IR. It can be used to resolve concepts to their representations and implement the features mentioned at the beginning of this chapter. Therefore, the extraction of this relation is the focus of this chapter and forms the basis of the problem of Text-to-Construct Linking.

The other relations are not directly relevant to domain-specific IR; however, they are still useful in their own ways in other contexts. For example, they can be useful in document understanding and expression analysis: The property relation keeps track of the properties of the variables (e.g., whether a particular variable is positive/negative). When these variables are used later in some other expressions, these properties may serve as descriptions to individual variables for the users or clues for deciding whether two variables are of the same nature during indexing. As another example, while the argument relation is not very common, it provides information about how one expression can be transformed to another. By consolidating and analyzing such information, we will be able to know whether two seemingly different expressions are equivalent (up to a few steps of transformation) or indeed different. Such knowledge would allow the search systems to cluster related expressions together during indexing and improve the recall of retrieval. The context relation may seem uninformative in isolation, but when coupled with the representation relation, can assist in connecting related expressions. For example, in the sentence “The absolute

value of x is denoted as |x|.”, we would be able to correctly establish the fact

that|x| is related to x as a way to express its absolute value through the context relation “The absolute value↔ x” and the representation relation “The absolute value↔ |x|”. Last but not least, the main objective of the co-reference relation is not to relate a concept to an expression or vice versa. Instead, it is meant to introduce an expression into another part of a resource so that more relations can be established for it. Therefore, the detection of this relation can be done as a preprocessing step to facilitate the detection of other relations.

Aside from identifying and coding the possible semantic relation types, we have also surveyed our dataset for two sets of statistics to characterize the nature of the representation relation.

To None To One To Many

Concept 7,368 (91%) 652 (8%) 105 (1%)

Expression 1,554 (64%) 854 (35%) 27 (1%) Table 5.4: Distance between related concepts and constructs.

Adjacent One to three

words apart

Four or more words apart

396 (45%) 300 (34%) 189 (21%)

The first statistic collected we term multiplicity, which specifies how many expressions are related to one concept in a sentence through the representation relation and vice versa.

As shown in Table 5.3, most of the concepts (91%) are not related to any constructs through the representation relation. This is expected since concepts are often mentioned in text without their representations in expressions. In contrast, more than one third of math expressions are related to exactly one concept. This indicates, whenever an expression appears, there is a good chance that the concept it represents can be found in the same sentence. Moreover, it is possible (although unlikely) for one expression to be related to multiple concepts. This happens when multiple names are introduced as different ways to call the same expression.

We have also analyzed representation distance, which measures how far two related concept and construct are apart from each other, in number of words.

As shown in Table 5.4, when a concept is related to an expression through the representation relation, they are often (79% of the time) adjacent or within one to three words apart. Given the close proximity of a concept and its related expression, it is likely that distance information is useful in extracting the rep- resentation relation. Note that we do not consider a concept and an expression to be related by the representation relation if they are not in the same sentence. In such cases, a text phrase would have been used to introduce the concept or the expression into the sentence of the other. Therefore, co-reference resolution is required to resolve the text phrase to the concept or expression it refers to before relation extraction can be performed on the resulting pair of concept and

Documento similar