De la ley 1448 de 1997 a la ley 1421 de 2010

1.8. Fundamentos constitucionales de las políticas de priorización

1.8.5. De la ley 1448 de 1997 a la ley 1421 de 2010

Named Entity recognition (NER) is one of the popular information extraction approaches, which allows extracting the predefined real-world entity or words, including as the title/name of person, location, or organisation. This approach helps all entities in the sentence or phrase so it helps people to understand the topic, which the author would like to talk about on the sentence/phrase.

NER can be worked by two main approaches, which are Rule-based approach and Statistical-based approach. Rule-based approach applies manually hand-written grammar- based linguistics, which are added by experts/linguists. Statistical-based approach applies Machine Learning (ML) techniques to extract named entities. We will now deal with the detailed previous work in the methods of named entity recognition, including rule-based named entity recognition and statistical named entity recognition approach.

Rule-based Named Entity Recognition

Rule-based Named Entity Recognition can be called as Linguistics approaches. This is because it applies the manually written linguistics itself in order to extract the named entity. Rule-based approach was traditionally proposed to obtain the higher prediction. Grishman firstly developed one of successful rule-based NER systems in 1995. The system is developed with predefined named-entity dictionary, which includes title or name of persons, cities, countries, organizations, and places. The set of rules in this system was predefined those named-entity as a text. Various rule-based NER system was developed and used for almost 20 years. However, rule-based entity recognition has fatal disadvantages. Since most of all rule-based NER system require the manually written Named entity dictionary, which is completed by the human experts (for this system, linguists). Finding highly educated and experienced linguists and providing all the related cost are very difficult. Moreover, it is almost impossible for one or two linguists to define all required grammatical knowledge of languages and convert those to computational words.

Statistical-based Named Entity Recognition

Statistical–based NER approach usually applies various Machine Learning techniques, including Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM) and Support Vector Machine (SVM). Also, it requires a large amount of annotated training data.

Hidden Markov Model: Hidden Markov Model (HMM) is one of the statistical models, which it models the pattern based on the hidden parameter. For example, it can describe as ‘Pattern Recognition’. Generally, regular Markov model allows observers to check the status

3.2 Related Work 23

regularly so state transition probability is the only parameter that can be used. However, Hidden Markov Model is based on the outputs. Each status has possible output tokens based probability distribution. It is not possible to check the order of status based on the order of created tokens. Since it can be checked output but the status flow, the model is called as ‘Hidden Markov Model (HMM). Therefore, HMM has been used in various research fields, such as Natural Language Processing, Speech Recognition, or Optical Character Recognition. For last 15 years, HMM has been dominated in Natural Language Processing and Speech Recognition. As Named Entity Recognition is a research field in both Natural Language Processing and Information Extraction, it is necessary to review the usage history. HMM has been worked based on the regular observation and labeled sequence. It is inevitable to prepare large amounts of training set. The basic idea of HMM is very simple so it is easy to implement and understand. Moreover, it uses the positive data only so it is very easily scaled.

Maximum Entropy Markov Model: Maximum Entropy Markov Model (MEMM) is also a well-known advanced conditional statistical sequence model. MEMM can be called as ‘Conditional Markov Model’. The model is a graphical model, which merges the advantages of Hidden Markov Model and Maximum Entropy model. It has been known as the most convenient model to extract the named entity. As mentioned before, HMM was the most successful approach in the last 15 years but MEMM has received a lot of attention these days. Compared to the HMM, MEMM provides the increased order in choosing features to represent the observations. It is very successful in using domain knowledge to extract the required tokens. Moreover, while HMM requires applying the forward-backward algorithms in training, MEMM estimate the parameters based on the transition probabilities. Therefore, for the efficiency of cost and time, MEMM is very good approach for training all the data and tagging the features. The model has been proved that it provides increased recall and greater precision than any other NER approaches. However, several researchers pointed out that it has the bias issues in labeling.

Conditional Random Field: Conditional Random Field (CRF) is a statistical modeling approach, which is mainly used in pattern matching and natural language processing field. It is usually used in predicting structured data. While most prediction approaches uses the label of a certain sample for predicting, CRF applies content itself. In natural language processing, including named-entity recognition, CRF has been used in predicting label sequences for input data by using linear chain. The field has all necessary benefit of using MEMM but does not have to deal with labeling bias issue. Since CRF is undirected linear model, which is applying calculating the conditional probability, it is able to use as alternative approach rather than HMM. Even though it requires highly cost and time in computation the data, it offers higher-order in modeling long-range dependencies.

In document La importancia de implementar la investigación de análisis de contexto para crímenes de lesa humanidad perpetrados en Colombia (página 106-110)