• No se han encontrado resultados

1.5. Casos particulares

1.5.2. Otros modelos de la literatura de inventarios

The task of deriving a DOP grammar from a treebank by extracting sets of fragments

from the treebank and computing their frequencies, as well as storing and compiling these grammars, is computationally expensive. Fortunately, as the two-phase algorithm used

to compute the parse space for each input string requires only an indication as to which

fragments each underlying CFG rule appears in, it is not necessary to explicitly extract

and store the fragment set. Rather, we store only the treebank trees themselves and

establish the fragment set on the fly.

This is accomplished by first explicitly applying the root operation to the treebank

trees, yielding a set of ‘intermediate’ fragments the size of which is linear in the number

of nodes in the treebank. The frontier operation is then applied by assigning to each node

nin each intermediate fragment a set of fragment identifiers such that if its left and right child nodesnlandnrare present in a fragment, then the corresponding fragment identifier

appears in the node’s identifier set. Either bothnl and nr are present in the fragment or

neither are present, in which case node nis itself either a substitution site or not in the fragment. Thus, the presence of fragment identifier fid at node n signifies that the CFG

rule n−→nlnr occurs in fragment fid.

If n’s left and right child nodes nl and nr are present in a fragment, each of these

child nodes can be either internal to that fragment (nli,nri) or a substitution site of that

fragment (nls,nrs). Thus, we can partition the set of identifiers at node n into four sets

representing the four possible combinations of internal and external child nodes<nls,nrs>,

<nls,nri>,<nli,nrs>and <nli,nri>.

3 Extracting these partitioned sets of fragment iden-

tifiers along with each CFG rule extracted gives us the correspondence between the frag-

ment set and the CFG underlying it which is required to perform the transition from

representations are to be constructed. Examples in this thesis are the paired tree representations used for translation and the trees associated with f-structures for LFG parsing. The relative merits of each algorithm for these types of representations are discussed in the relevant chapters.

3We can also partition according to whethernis a root or internal node, creating eight partitions rather

(A)

Root-generated ‘intermediate’ fragment which has been converted to ECNF (through which nodeB xhas been inserted) and each node annotated with the number of different subtrees it yields when the frontier operation is applied:

A B C D b E F d e f ⇒ A(20) B(1) B x(10) b C(4) D(1) E(1) F(1) d e f (B)

Node annotations representing all possible frontier operations where the total number of frontier operations possible is 20 and the fragments corresponding to each of these frontier operations have been allocated identifiers from the set of integers 1 - 20:

A(20) <Bs,B xs>:{}<Bs,B xi>:{1-10}<Bi,B xs>:{}<Bi,B xi>:{11-20}

B x(10) <Cs,Ds>:{1,11}<Cs,Di>:{2,12}<Ci,Ds>:{3-6,13-16}<Ci,Di>:{7-10,17-20}

C(4) <Es,Fs>:{3,7,13,17}<Es,Fi>:{4,8,14,18}<Ei,Fs>:{5,9,15,19}<Ei,Fi>:{6,10,16,20}

B(1) <b>:{11-20}

E(1) <e>:{5-6,9-10,15-16,19-20}

F(1) <f>:{4,6,8,10,14,16,18,20}

D(1) <d>:{2,7-10,12,17-20}

Figure 3.1: The ‘intermediate’ fragment in (A) was generated by the root operation. (B) gives the node annotations representing all possible frontier operations where the total number of fron- tier operations possible is 20 and the fragments corresponding to each of these frontier operations have been allocated iden- tifiers from the set of integers 1 - 20.

parse phase 1, in which the CFG parse space is constructed, to parse phase 2 in which the

corresponding DOP parse space is constructed.

The process of building compact fragment representations is illustrated in Figure 3.1.

Firstly, node Ais selected by the root operation and all nodes not equal to or dominated by A are deleted; this yields the ‘intermediate’ fragment given in Figure 3.1(A). This intermediate fragment is converted to ECNF as described in section 2.4.4 through the insertion of the new nodeB x, and the number of frontier operations which can be carried out at each of its nodes is calculated. For example, 10 different sets of frontier nodes

can be selected at node B x; note, however, that as B x must always be an internal node and, therefore, never a substitution site, the number of different frontier node sets

which can be selected at node A is 20.4 Fragment identifiers are then assigned to each node in the intermediate tree as shown in Figure 3.1(B): an identifier appears at a given

node if both that node and all of its child nodes appear in the corresponding fragment.

These sets of identifiers are further partitioned as described above. For example, the sets

corresponding to node A indicate that node B x is internal to all fragments 1–20 but nodeB is a substitution site in fragments 1–10 and internal to fragments 11-20. Similarly, the sets corresponding to node B x indicate that nodesC and Dare, for example, both substitution sites in fragments 1 and 11 (where node B is a substitution site in 1 and internal to 11) and both are internal to fragments 7–10 and 17–20 (where node B is a substitution site in 7–10 and internal to 17–20). As terminal symbols can only be frontier nodes, they are either present or absent in each fragment and, thus, no partitions are

imposed on their fragment identifier sets. For example, terminal symbol b is internal to all fragments to which its parent node B is also internal, i.e. fragments 11-20.

This method of representing fragments allows us, when extracting the CFG underlying

the treebank, to also extract for each rule the (partitioned sets of) identifiers of fragments in

which that rule occurs. This annotated grammar can be used during the two-phase analysis

process to (i) establish the CFG parse space for the input string and (ii) transition to the

DOP parse space for that string as described in section 2.4.2. In addition, it allows us to

read off the fragment corresponding to any identifier by simply checking for its absence or presence (as an internal node or substitution site) at each node in the intermediate

tree.5 For example, consider the situation where we wish to extract the fragment whose

identifier is 13. The sets corresponding to nodeAindicate that nodesB andB xare both internal to fragment 13. Trivially, the annotation at node B indicates that the terminal

b is a frontier node of fragment 13. The sets corresponding to node B x indicate that while node C is internal to fragment 13, node D is a substitution site. Finally, the sets corresponding to nodeCtell us that nodesE andF are both substitution sites in fragment 4If nodeB xwas allowed to be a substitution site, two further fragments would be possible; both these

fragments would have nodeB xas a substitution site but one would also have node B as a substitution site whereasB would be internal to the other one.

5Sima’an’s two-phase parsing algorithm does not require this facility as fragments are rebuilt using the

CFG rules which characterise them. However, this facet of the compact fragment representation process will prove important when tree-based representations encoding more information than simple context-free phrase-structure trees are considered. This issue is discussed further in sections 5.2.3, 7.5.2 and 8.3.

13. Thus, fragment 13 corresponds to the fragment shown in (3.1): A B B x b C D E F (3.1)

Calculating relative frequencies from compact fragment representations Calculation of relative frequencies (and the removal of identifiers corresponding to du-

plicate fragments) over these compact fragment representations is straightforward. Two

intermediate trees Ix and Iy encode duplicate DOP fragments if connected portions of

those trees which start at their root nodes are identical. Minimally, these connected

portions must comprise the intermediate tree root nodes and their daughter nodes. Ad-

ditionally, for two minimal portions to be identical, all node categories must be the same

and, in the case of the daughter nodes, appear in the same order. In example (3.2), we see

that intermediate trees I1 and I2 have the same minimal portions, i.e. their root nodes

are of the same category and the children of those root nodes are of the same categories

and in the same order. In contrast, the minimal portion of treeI3 has the same root node

category and daughter node categories asI1 andI2 but those daughter node categories do

not match with respect to order and so the depth 1 fragment extracted fromI3 is not the

same as the depth 1 fragment extracted from bothI1 and I2.

I1 I2 I3 A B C D b c d A B C D E F G H I J e f g h i j A C B D c b d (3.2)

Extending the portions of intermediate trees which yield identical DOP fragments is

a recursive process: for each node already in the identical portion of each tree, we simply

intermediate tree to which it is being compared. In example (3.2), the identical portions of I1 and I2 can be extended no further as none the daughter nodes of B, C and D in I1 match the corresponding daughter nodes in I2.6 In contrast, in example (3.3) we see

that intermediate trees Ix and Iy have root node category A, and A’s daughter nodes in

both trees are (from left to right) B and C. Accordingly, those depth 1 fragments with root nodeA and substitution sitesB andC are duplicates of each other. In addition, the identical portions ofIx and Iy can be extended to include nodesDandE as the daughter

nodes of C also correspond. However, the identical portions can be extended no further as the children ofD and E do not correspond.

Ix Iy A A B C B C b D E b D E F G e d H I f g h i (3.3)

Once we have identified the tree nodes included in the identical tree portions, we have

established exactly which fragments are duplicates of each other: all boundary identical

nodes (i.e. those nodes which are included in the identical tree portions but whose chil-

dren are not) are either substitution sites of those fragments which are duplicates, or not contained in duplicate fragments. When we have identified these fragments, we simply

increment their counts in one intermediate tree and delete their identifiers from the other.

Documento similar