ANÁLISIS DE RESULTADOS - LA IMPLEMENTACIÓN REDUCCIÓN

LA IMPLEMENTACIÓN REDUCCIÓN

3. ANÁLISIS DE RESULTADOS

Learning resources often contain text, and this text is often unstructured. So finding and retrieving relevant documents can be challenging. Document retrieval usually relies on the content of the document for making predictions about the relevance of the document. One way of addressing this challenge is by creating suitable representations for the text to enable the retrieval of relevant documents. Some approaches often used for representing text are presented.

Bag of Words representation

Bag of Words (BOW) refers to a collection of the words extracted from a document. The order that the words appear is not considered. The Vector Space Model (VSM) can be applied to the extracted words to create a feature vector for the document, where each word is a feature that describes the document, and the value of the word is its weight. There are 3 commonly used weighting schemes in the VSM. First, the binary value that captures the presence or absence of a word, as a 1 or 0 respectively. Second, the Term Frequency (TF), which captures how many times a term appears in a document. Finally, the Term Frequency - Inverse Document Frequency (TF- IDF) weight which combines TF and IDF (Sparck Jones 1972). The IDF computes the number of documents in a collection that contain the given word. The IDF captures the importance of a word in a document collection by reducing the weight of common terms and emphasising the weight of rare terms. Hence the use of the TF-IDF weighting in many Information Retrieval systems.

A key step before creating a representation for text is often that of pre-processing which helps to prepare a document for indexing. Pre-processing often involves stages such as stop word re- moval, stemming or lemmatisation, and tokenisation. Stopwords are common words such as “a”, “is”, “the” that are are found across all documents but do not contribute to important information in documents, hence stopwords are often removed during pre-processing. The English stopwords

1_{and SMART stopwords (Salton 1971) are 2 sets of stopwords that are often used.}

Stemming and lemmatisation aim to reduce words to their base. Stemming takes a harsh approach to this task, while lemmatisation tries to ensure that its output is still a meaningful word in its dictionary form. A common stemming algorithm often used is the Porter stemming (Porter 1980), and a common lemmatiser used is the WordNet Lemmatiser. Applying Porter stemming to the word “organising” produces “organis”, while applying the WordNet Lemmatiser produces “organise”. Stemming is often harmful for precision but can increase recall. Manning, Raghavan

2.3. Addressing e-Learning Recommendation Challenges 35

& Sch¨utze (2008) suggest that the advantage of performing lemmatisation is minimal for retrieval. However, choosing the technique to use depends on the task one is performing. Tokenisation entails splitting up a sequence of text from a document into individual words referred to as tokens. For example: the sentence “cat sat down” is split as “cat” “sat” “down” containing 3 tokens. The tokens can then be used for indexing the document.

Apache Lucene (Hatcher, Gospodnetic & McCandless 2004) is a commonly used framework for indexing text documents. A more recent framework developed using Lucene is Elasticsearch (Ku´c & Rogozi´nski 2015). The relevance scoring for matching documents is computed by applying Lucene’s practical scoring function, which uses ideas from the Boolean model, VSM and TF-IDF weighting. Lucene’s scoring function as implemented in Elasticsearch is shown in Equa- tion 2.1. Given a query, q the relevance score of a document, d to the query is given as:

Score(q, d) = coord (q, d) · queryNorm (q) ·

_∑

t∈q

t f(t, d) · id f (t)2· boost(t) · norm(d) (2.1)

The following aspects can be used to describe the scoring function:

Term importance: t f (t, d) is the term frequency, measuring how often a term, t occurs in a document, d. Normally, t f gives a lot of importance to long documents, so in Elasticsearch the square root of t f is taken as a scaling measure to cope with this effect. id f (t) is the inverse document frequency, measuring how often term, t occurs in the document collection. The weight of a fre- quently occurring term is reduced using idf. The id f for a term is given as the logarithm of the number of documents in the collection divided by the number of documents the term occurs in. Overlap of terms: coord (q, d) is the coordination factor that counts the number of terms from the query, q that appear in a document, d. So documents with a higher percentage of query terms are rewarded, as such documents have a higher chance of being a good match for the query. Weighting: boost (t) is the boost factor used to increase the importance of each term, t in a field. queryNorm(q) is a factor used for normalizing the query, q. norm (d) is the field-length normal- ization factor that is based on the number of terms in a field of a document, d. The norm considers the length of the fields in a document, so shorter fields such as “title” are given higher weights than longer fields, such as “description”. The norm is computed as the inverse square root of the number of terms in the field. The position of a term does not affect the computation of the norm. The function is designed so that using default values of the parameters is still effective for scoring.

In document Implementación de un sistema de inventarios en FD Filtros y Repuestos. (página 154-166)