• No se han encontrado resultados

2.1 Antecedentes Históricos del Desarrollo Nuclear de Irán

2.1.1 Desarrollo Nuclear de la República Islámica de Irán

T h e m ain aim of th is experim en tal system is to do a u to m a tic concept classification of doc­ um en ts and to com pare th e results of this system w ith th e m an u al classification system .

A te st collection consisting of 54 docum ents, ra n g in g over a wide variety of topics, is used for th e experim en t. A list of over 3200 keyw ords, taken from th e Licensing S u p p o rt S ystem T h esa u ru s [10], served as a dictionary for th is ex p erim en t. A dditionally tw o su b ject e x p e rts1 served as a source of in p u t for th is experim ent.

T h e experim ent m ainly deals w ith th e generation of d ocum ent vectors, g en eratio n of concept vectors, com puting th e sim ilarity betw een these vectors, classifying th e docum ents u n d er th e concepts an d com paring this a u to m a tic docum ent classification w ith th e m an u al classification for perform ance evaluation. In th e following sections each of these topics are

described in m ore detail.

3 .1 .1 D o c u m e n t V ecto r G e n e r a tio n

In a vector space a docum ent is viewed as a collection of term vectors. T h e te rm s which describe th e d ocum ent content are know n as keyw ords or lead-term s. In ord er to find a list o f term s th a t represent a docum ent vector, th e n a tu ra l language te x t of each docum ent is analyzed for th eir occurrence. T h ere are basically two types of term s - stop words and

go words. S top w ords such as ‘a n d ’, ‘n o t’, ‘o f ’, ‘b u t ’ have high frequency of occurrence in th e docum ents. On th e o th er h an d , th e go words th a t actually re p resen t docu m en t content occur w ith varying frequencies in th e docum ent. In fa ct, th e frequency o f occurrence is used to assign weights to th e keywords. T h e process of docum ent vector g en e ratio n can be best described as follows:

• th e docu m en t te x t is read for co n ten t analysis and to find th e occurrence of th e 3200 lead -term s in th e te x t.

• th e sto p w ords are elim inated by consulting a list of sto p w ords2.

• each of th e rem aining go words are reduced to th eir word stem s in o rd e r to calculate th e correct frequency of occurrence of each term . R educing th e w ords to th e ir word stem s reduces th e anom alies occurring due to th e various form s th a t th e w ord appears in th e docum ent. For exam ple, th e lead -term shaft and th e te rm in th e docum ent sh a fts do n o t m atch exactly. T h e stem m ing algorithm is a d a p te d form C hris Paice[7]. T his alg o rith m is iterativ e in n a tu re and uses a tab le of rules th a t specify w h a t has

to be done if th e w ord ends in a p a rtic u la r form . D epending on th e final le tte r of th e suffix th e rules are grouped in to sections. T his m akes th e search in ta b le m uch easier because th e rule can be accessed by looking a t th e final le tte r of th e w ord. A ty p ical rule in th e ta b le would look like “sei3y” which im plies t h a t if a w ord ends in “-ies” th en replace th e la st th re e le tte rs by “-y” . For exam ple, supplies is changed to

su p p ly. T h e algorithm for th e stem m er is given below:

1. inspect th e final le tte r o f th e form ; consider th e relevant section and select th e

first rule; if no section corresponds to th a t le tte r th e n te rm in a te .

2. if th e final le tte rs of th e form do no t m a tc h th e reversed ending in th e rule th en

goto 4; if th e w ord is n o t in ta c t goto 4; if th e w ord does n o t satisfy th e conditions for a set of predefined acceptable conditions th en g o to 4.

3. delete th e right end of th e form th e num ber of c h a rac te rs specified in th e rule; if th e co n tin u atio n strin g is th en te rm in a te ; o therw ise goto 1.

4. move to th e n ex t rule in th e table; if th e section le tte r has changed th en te rm in a te ; otherw ise goto 2.

• th e term frequency (f r e q i j ) of a keyword j in a d o cu m en t i is calculated.

• th e docum ent frequency (d o c fr e q j) of each term j is calcu lated . T h a t is, th e num ber of d ocum ents in which it occurs. T h e process of calculating th e te rm frequency in a d ocum ent and in th e whole collection can be described as follows:

— define a block as th e n um ber of lines contained in th e d ocum ent for th e fr e q ij

— re ad each line from th e docum ent te x t, delete th e stopw ords an d reduce th e go w ords to th e ir word stem s using th e stem m in g algorithm .

— for each w ord in th e list of lead -term s, m a tc h it w ith th e stem m ed w ord. For all m atch es keep tra c k of th e count in a counter.

— if a p a rtic u la r m atched stem m ed word has re ap p ea red in eith er th e sam e line or in a different line increm ent th e count in its counter.

— if th e block size is declared as th e nu m b er of lines in a d o cu m en t th e n o u tp u t

fr e q ij, which is th e value contained in th e counter o f each m atch e d stem m ed w ord.

— if th e block size is declared as th e num ber of lines of all th e d o cu m en ts in the collection th e n o u tp u t d o c fr e q j, which is th e value contained in th e co u n ter for each m atch e d stem m ed word.

• using th e collection w eighting alg o rith m , w eights are assigned to th e keyw ords identi­ fying each docu m en t. T h a t is, weight w t,j for a term j in a docu m en t i is calculated as

w tij = fr e q ij • [log(A^) - log (d o c fr e q j) -f 1]

• T h e w eight values generated by using th e above form ula are betw een 0 an d 1. T he docu m en t vectors are binarized by using a threshold value. T h e th resh o ld considered here is th e average of all th e nonzero w eights of the term s. A n y th in g above this th resh o ld value is assigned a value of 1 and an y th in g below is assigned a value of 0 in th e docu m en t vector. B inarizatiori o f d ocum ent vectors is done m ainly because th e concept vectors g en erated , which will be discussed in th e next section, are also binary.

A ty p ical d ocum ent vector g en e rated using th e above p ro cedure looks like

Doci = (0,1,0, . . . , 0,1)

T h e len g th of each docum ent vector is over 3200 because th e list o f lead -term s tak en form [10] contains over 3200 term s. T o ta l of 54 d ocum ent vectors are g e n e ra te d for th is te st collection.

3 .1 .2 C o n c e p t V e c to r G e n e r a tio n

T h e list o f lead -term s contained over 3200 term s and associated w ith each te rm , th ere is a b ro a d er te rm , n arrow er te rm , re la ted te rm , used-for term and a category field u n d er which th e te rm falls. A typical list o f lead -term s would look like:

A i r P o l l u t i o n B T P ollution N T A ir Em ission A ir Pollution C ontrol A ir Releases RT A ir Q uality C lean Air A ct Gaseous W aste P a rtic u la te s Plum es P o llu tan ts CA A tm osphere A i r Q u a li t y U F R adiological A ir Q uality B T Aii- RT Air P ollution Clean Air A ct CA A tm osphere

T h e acronym s B T , N T , R T, U F an d CA sta n d for b ro ad er te rm , n arrow er te rm , related te rm , used-for term an d a category respectively. T h e category field (C A ) is the predefined concept into which th e lead -term is categorized. T h ere are a b o u t 16 concepts under which all th e 3200 lead-term s would a p p e ar. T hese 16 concepts are given below:

• atm osphere • biology • engineering • equipm ent • geography • geology • governm ent • hydrology • m anagem ent • m aterials • m odeling • processes-properties • safety • socio-econom ic-factors • tra n s p o rta tio n

• w aste

C oncept files are built by accum ulating all th e lead -term s th a t a p p e a r u n d er each concept. Sixteen such files are built by pulling ou t th e term s form th e category field and grouping th em in to a concept file. Each of th e concept file built in th is fashion h ad a varying num ber of lead-term s associated w ith it.

T h e process o f g en eratin g th e concept vectors m ay be b est described as follows: for each of th e lead te rm in th e list of lead-term s a search is m ade for its presence o r absence in th e concept file. A zero o r one is placed in th e concept vector depending on w h eth er th e lead te rm is ab se n t or p resent in th e concept file. By re p eatin g th e search process for all th e term s in th e list we o b tain a bin ary vector containing 0 ’s an d l ’s. T h e size of th is vector being equal to th e n u m b er of lead term s in th e list. T h a t is, th e size o f th e concept vector g en e rated is over 3200. A typical concept vector would look like

Corii = (0,1,0, . . . , 1,0)

T h ere are to ta l of 16 concept vectors g en erated for th e 16 predefined concepts.

3 .1 .3 S e le c tio n o f A S im ila r ity C o effic ien t

A sim ila rity coefficient is a resem blance coefficient for which th e larg er th e value, th e m ore sim ilar th e two o b je c ts being com pared are. A dissim ila rity coefficient is also a resem blance coefficient for which th e sm aller th e value, th e m ore sim ilar th e tw o o b je c ts are. T h e objects here in th is case are docum ent and concept vectors.

A dissim ilarity coefficient is selected for th is experim ent. If th e dissim ilarity is considered as th e distan ce betw een ob jects to be clustered th en it satisfies th e E uclidean properties such as th e distan ce betw een tw o o b jects should be g re a te r th a n zero, d istan ce between th e o b je c t an d itself should be equal to zero an d distance betw een o b je c t A an d object

B should be th e sam e as the distan ce betw een o b ject B and o b je c t A . T h e dissim ilarity coefficient used in th e experim ent satisfies all th e p ro perties m entioned above an d is closely re la ted to th e Dice coefficient and is m onotone w ith respect to (1 - Ja c c a rd coefficient). If

th e concept vector is represented as Con,- = (®i, x%, X3, . . . , xn ) an d th e d ocum ent vector as

D oc, =

(yi,V

2

,

2/3> • • • ?

Vn)

th en th e dissim ilarity coefficient ( D C ) for Con,- and D oci can be co m p u ted as

D C = ^ Xi ^ ]~yi )+7^ y* t1 ~Xi 1

E Xi+ l >

T h e dissim ilarity coefficient value o b tain ed by using th e above fo rm u la can be used w hen th e vectors considered are b in ary and can be converted in to a sim ilarity coefficient (S C ) as

Using th e above form ula for sim ilarity coefficient, th e sim ilarity values are calculated b e­ tw een each docum ent vector and th e 16 concept vectors.

3 .1 .4 C la ssific a tio n P h a s e

In th is section th e tw o classification schem es - autom atic classification and m anual classifi­ cation are described. T his phase is th e m ost im p o rta n t p a r t of th e ex perim ent because th e perfo rm an ce of th e system depends on this phase. A u to m atic classification is an im p o rta n t notio n by which we can group logically related docum ents to g e th e r so th a t th ey can be very efficiently retrieved.

A u to m a tic C oncept C lassification o f D ocum ents

T h e process of au to m a tic concept classification o f d o cum ents is described in th is section. From th e previous section we have th e sim ilarity coefficient values betw een each docum ent a n d th e 16 concepts. In this classification scheme, th e concepts are divided in to th re e groups - strong, related and weak groups. T h e th ree concept groups are g en erated for each doc­ u m en t based 011 th e sim ilarity value betw een them . A stro n g concept group identifies all

tho se concepts, o u t of th e 16 available concepts, which are strongly re la ted to a p a rtic u la r docum ent. Sim ilarly th e related an d weak concepts groups identify th e re la te d an d weak concepts for t h a t docu m en t. T h e process of categorizing th e 16 concepts into 3 groups can be b est described by th e following algorithm :

A lgorithm to classify th e concepts into S T R O N G , R E L A T E D , and W E A K

groups

1. For each docu m en t so rt th e 16 values of sim ilarity coefficients o b ta in e d by com paring it w ith th e 16 concepts in descending order.

2. F in d th e pairw ise difference betw een each ad jacen t so rted sim ilarity values.

3. Pick two highest difference values.

(a ) G ro u p th e concepts above th e first difference value into stro n g concepts.

(b) G ro u p th e concepts betw een th e first an d second difference values in to related