Plan de Manejo de Zonas de Riesgo: - LOCALIZACION Y DIMENSIONAMIENTO DE LOS ELEMENTOS ESTRUCUTR

OBLIGACIONES DEL URBANIZADOR O CONSTRUCTOR

3. COMPONENTE RURAL

3.7. LOCALIZACION Y DIMENSIONAMIENTO DE LOS ELEMENTOS ESTRUCUTRANTES DEL SUELO RURAL

3.7.2. Plan de Manejo de Zonas de Riesgo:

The formalisation presented in section 5.1 above prescribed the construction of feature vectors from a concatenation of values corresponding to a set of chosen features. It was also noted that the set of features comprises one or more groupings (subsets) of particular types of features. There are a number of different types of feature that can be included in the representation. The different categories considered in this thesis are as follows:.

Hyper links Hyper links are an obvious candidate for inclusion in any feature space describing a collection of www pages. The theory is that web pages that are related may share many of the same hyper links. The shared links may be other pages in the same website (e.g. the website home page) or significant external pages (most links a point to a related set of pages with some common topic). The hyper link based features were constructed by extracting all of the hyper links from the collection of web pages. Each hyper link was then considered to be an individual binary-valued feature.

Image links The use of image links was prompted by the observation that web pages that link to the same images were likely to be related, for example a common set

of logos or navigation images could imply a relationship. The image links were processed in a similar fashion to the hyper-links, as described above.

Mailto links If a set of web pages includes identical mail links this might indicate a relationship between these web pages. The links were extracted from the HTML code using the same method as described above, but by searching for theM ailto tags.

Page Anchor links Page anchors are used to navigate to certain places on the same page, these can be helpful for a user and can very often have meaningful names. It was conjectured that if the same or related names are used on a set of web pages it could imply related content. The Page Anchor Links were extracted by parsing the HTML code as above and identifying the number of possible occurrences. Resource links The motivation for using resource links was that the styling of a

page is often controlled by a common Cascading Style Sheet (CSS) which could therefore imply that a collection of pages that use the same style sheet were related. In this case the feature subset was obtained by extracting the appropriate resource links from the HTML code.

Script links It was observed that some scripting functions that are used in web-pages can be written using some form of common script file; if pages have common script links then they could be related. The script links vector was constructed by extracting all of the script links (for example Java script links) from each of the pages in W.

Title text It is conjectured that the title text used within a collection of web pages belonging to a common web site is a good indicator of “relatedness”. The title group of features was constructed by extracting the title from each of the given web pages. The individual words in each title were then processed to produce a “bag of words” (a common representation used in text mining). Each word represented a binary valued dimension in the feature space. Note that when the textual information was extracted from thetitle tag non-textual characters were removed, along with words contained in a standard “stop list”. This produced a group of feature values comprised only of what were deemed to be the most significant title words.

Body text Another key indicator of web page “relatedness” was considered to be content, as reflected by the text contained in WWW pages. Textual content was extracted from each web page using a html text parser/extractor (http:

//htmlparser.sourceforge.net/). This type of tool extracts the text as it

a user would use to judge a pages topic/subject. Stop words (same list as used to process the title text as described above) were then removed and a bag of words produced similar to that used in the case of the title feature sub-set.

URL Web page URLs are likely to be an important factor in establishing whether subsets of web-pages are related or not. URLs should not be considered to be a unique indicator for establishing a web site boundary (referring to URL filtering methods, see section 2.2.1). In each case the page URL was split into “words” using the standard delimiters found in URL’s. For example the URL http:

//news.bbc.co.ukwould produce the sub-set of features{news, bbc, co, uk}. Non

textual characters were removed (no stop word removal was undertaken). The individual feature subsets listed above could be used, either in isolation or in various combinations, to produce a feature space which in turn could be used to describe individual web pages. It should be noted that the combination, for example, of the hyper links and the image links feature could created a very large vector space, in which potentially useful information could be swamped. The use of certain elements of these features could provide useful information; a hyper link to the home page, and image links contained on the home page, could be the most important sub features of this concatenation. Although the further processing of sub elements of features could be explored with the use of feature extraction techniques, such techniques are beyond the remit of this thesis.

In document SANTUARIO MARIANO DEL HUILA (página 41-46)