Instituto Tecnologico y de Estudios Superiores de Monterrey

(1)

Instituto Tecnologico y de Estudios Superiores de Monterrey

Monterrey Campus

School of Engineering and Sciences

Intent Discovery from Conversational Logs to Prepare a Student Admission Chatbot for Tecnol´ogico de Monterrey

A thesis presented by

Rolando Trevi ˜no Lozano

Submitted to the

School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Science

Monterrey, Nuevo Le´on, June, 2021

(2)

(3)

Instituto Tecnologico y de Estudios Superiores de Monterrey

Campus Monterrey

The committee members, hereby, certify that have read the thesis presented by Rolando Trevi˜no Lozano and that it is fully adequate in scope and quality as a partial requirement for the degree of Master of Science in Computer Sciences.

Neil Hern´andez Gress, Ph.D.

Tecnol´ogico de Monterrey Principal Advisor

H´ector Gibr´an Ceballos Cancino, Ph.D.

Tecnol´ogico de Monterrey Co-Advisor

Joanna Alvarado Uribe, Ph.D.

Tecnol´ogico de Monterrey Committee Member

No´e Alejandro Castro S´anchez, Ph.D.

Centro Nacional de Investigaci´on y Desarrollo Tecnol´ogico (CENIDET) Committee Member

Rub´en Morales Menendez, Ph.D.

Associate Dean of Graduate Studies School of Engineering and Sciences Monterrey, Nuevo Le´on, June, 2021

i

(4)

(5)

Declaration of Authorship

I, Rolando Trevi˜no Lozano, declare that this thesis titled, Intent Discovery from Conversa- tional Logs to Prepare a Student Admission Chatbot for Tecnol´ogico de Monterrey and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree at this University.

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Rolando Trevi˜no Lozano Monterrey, Nuevo Le´on, June, 2021

iii

(6)

(7)

Dedication

To my family, who support me on my decisions, motivate me, and always show me there are no limits in life but our imagination itself.

To my friends, who make difficult times go by as easy as possible.

To past and future me, for never surrendering when pursuing my dreams.

v

(8)

(9)

Acknowledgements

This research would not have been able to be done without the guidance and support from outstanding people at Tecnol´ogico de Monterrey.

Firstly, I want to show my gratitude to Dr. Neil Hernandez for allowing me to experience this research research and introducing me to the area of natural language processing that I had not idea about at the beginning of the research.

I wish to express special thanks to Dr. H´ector Ceballos for his outstanding guidance and deepest support throughout this research. Thanks for your patience and your lessons provided during this time.

Also, I would not have completed this research without the support from Dr. Joanna Alvarado. Thank you for demonstrating that all things are possible to be done, for your unique friendship, and for dedicating your time to assisting me during difficult tasks.

I also wish to express my deepest appreciation to my closest friends who are always there for guiding and listening to me. Also, my colleagues I met during this master program:

Emmanuel V´azquez, Andree Vela, Ra´ul Mart´ınez and Miguel Lara, thanks for all the laughs, lessons, and motivation; we were able to go through each of our projects during an unprece- dented time that represented the quarantine due to Covid-19. I am proud of each one of my friends.

Special recognition goes to my family who showed me their support, motivation, patience during this research which not all times were easy to go through, but if it was not for their love and belief in me, I would not be where I am at each stage of my life. Thanks for always being there for me.

Lastly, I would like to express deep gratitude to Tecnol´ogico de Monterrey and CONA- CyT, for their financial aid on tuition and support, allowing me to grow academically and contribute to the scientific community.

vii

(10)

(11)

Intent Discovery from Conversational Logs to Prepare a Student Admission Chatbot for Tecnol´ogico de Monterrey

by

Rolando Trevi ˜no Lozano Abstract

Online chat services allow companies to serve and attend to their customers to resolve problems or doubts about a specific concept. Lately, conversational bots have been adapting to this domain, allowing a broader attention capacity while easing interactions between users and the company while also easing work for agents, increasing productivity and service quality. To design a chatbot is a time-consuming task as the designer has to provide the core key concepts known as intents that the conversational bot will respond to and provide example sentences and their respective answers. We propose a framework that receives as input data corresponding to conversational transcripts between prospects and agents and transform them through the use of regular expressions into a tabular dataset of the conversations in log format easing their analysis and representation to be converted into a convenient word representation of TF-IDF which serves as input for applying unsupervised machine learning algorithms as Non-Matrix Factorization for Topic Modeling and K-Means for utterance clustering to discover possible intents, which can then be passed on to the design of a knowledge base, which this last step of intent discovery allows an iterative process to process new conversations and identify changes in the intents or the addition of new ones. Results demonstrate that it is possible to cluster the utterances and find clusters that align to a possible intent out of a list of possible intents and such list is subject to change in time for continuously improving intent discovery. A cosine similarity threshold was set at 0.47 to differentiate correctly aligned clusters from those not aligned; 18 intents out of 55 were able to be correctly aligned with an initial intents list, and a total of 35 different intents were able to be captured by the clustering process. No exact similar research was found in the literature, as other works on the domain imply an already curated and labeled dataset to being working on classifying the intents rather than discovering them during the knowledge base design, also they do not take into account the whole process of transforming the raw conversations into a tabular and processed dataset.

ix

(12)

(13)

List of Figures

2.1 General text-based chatbot architecture [46] . . . 14

3.1 General framework for discovering intents . . . 17

3.2 Research methodology . . . 19

3.3 Top 25 countries from users . . . 24

3.4 Top 25 Mexican states from users . . . 25

3.5 Messages by department . . . 26

3.6 Messages by month . . . 26

3.7 Messages by day of week . . . 27

3.8 Messages by hour . . . 27

3.9 Tecbot conversation duration in contrast to all conversations . . . 28

3.10 Average messages sent by an agent and Tecbot per conversation . . . 28

3.11 Non-Negative Matrix Factorization Visualized . . . 35

3.12 K-means algorithm extracted from [48] . . . 35

4.1 Message count by month SOAD department only . . . 41

4.2 Message length box plot (raw) . . . 42

4.3 Message length box plot (cleaned) . . . 43

4.4 Word clouds before and after text preprocessing . . . 44

4.5 Word frequency raw text without stopwords filter (Top 25) . . . 46

4.6 Word frequency lemmatized text with stopwords filter (Top 25) . . . 47

4.7 BoW - NMF (Frobenius norm) . . . 50

4.8 BoW - NMF (Kullback-Leibler divergence) . . . 51

4.9 TF-IDF - NMF (Frobenius norm) . . . 53

4.10 TF-IDF - NMF (Kullback-Leibler divergence) . . . 54

4.11 Silhouette Score k=50-1000 (BoW) . . . 56

4.12 Silhouette Score k=50-1000 (TF-IDF) . . . 57

4.13 Histogram of similarities between utterances and intents . . . 57

4.14 Similarity thresholds and their respective percentage of utterances that are able to be found . . . 58

4.15 Clustering to intent alignment for 100 clusters with TF-IDF . . . 59

4.16 Clustering to intent alignment for 300 clusters with TF-IDF . . . 63

4.17 Histogram for the similarity values of the 300 clusters to the intents . . . 63

4.18 Process to reproduce the proposed methodology . . . 65

xi

(14)

(15)

List of Tables

2.1 Comparison between text classification models, as mentioned in [34] . . . 12

3.1 Features of reports regarding conversations . . . 21

3.2 Features of log format conversations . . . 22

3.3 Example Bag-of-Words representation . . . 32

3.4 Example TF representation . . . 33

3.5 Example IDF representation . . . 33

3.6 Example TF-IDF representation . . . 34

4.1 Example conversation in log-format . . . 39

4.2 Example of texts and their lemmatized results . . . 40

4.3 Statistical information of message length in filtered data (raw) . . . 42

4.4 Statistical information of message length in filtered data (cleaned) . . . 43

4.5 Statistical information of message length in filtered data (cleaned messages with length greater than one) . . . 44

4.6 Statistical information of BoW NMF topics (Frobenius norm) . . . 50

4.7 Statistical information of BoW NMF topics (Kullback-Leibler divergence) . . 52

4.8 Statistical information of TF-IDF NMF topics (Frobenius norm) . . . 53

4.9 Statistical information of TF-IDF NMF topics (Kullback-Leibler divergence) 55 4.10 Intents found in correctly aligned clusters (k=100) . . . 60

4.11 Cluster #4 examples for alignment validation . . . 60

4.12 Cluster #6 examples alignment validation . . . 61

4.17 Intents found in correctly aligned clusters (k=300) . . . 64

4.18 Conversational Bots Services Comparison . . . 66

xiii

(16)

(17)

Chapter 1 Introduction

Artificial Intelligence (AI) has been growing at an incredible rate in the past years [16, 54].

Text Mining is the process of discovering information by computer, which was previously unknown due to the unstructured representation of raw text. [20]. Implementation of text mining in business has transitioned from novelty to common usage; in fact, the most popular application in customer service is chatbots [32]. A chatbot system is a software program that interacts with users via conversations using natural language [49].

The very first chatbot dates to 1966 at the Massachusetts Institute of Technology (MIT) Artificial Intelligence Laboratory, ELIZA, created by Joseph Weizenbaum. It consisted of a simulation of a psychotherapist character named “Eliza Doolittle”, it maintained conversations using pattern matching and substitution methodology; that is, it recognized certain keywords allowing it to answer in the form of question simulating a context. ELIZA gave the illusion that it was capable of understanding context while having no built-in framework for contextualizing events [50].

The next major chatbot creation is A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) which won the Loebner Prize, which is awarded to computer programs considered to be the most human-like, three times (2000, 2001, and 2004). It was brought to life in 1995.

Even though it was not able to pass the Turing Test, it served as the base to the creation of other chatbots thanks to its AIML (Artificial Intelligence Markup Language) implementation [49].

One of the latest innovation in artificial intelligence and conversational systems is IBM Watson, which is a program that is able to answer questions in a conversational form, that is, using natural language. It became popular in 2011 after participating in a TV contest show called: Jeopardy! [30] and winning against the program’s champions. Watson contributed to healthcare by analyzing critical issues in which an AI like Watson could help physicians and finding out the best way to achieve a successful interaction between human-machine in order to provide optimal assistance [3].

Correspondingly, in 2014 bots completed about 15% of all the edits in the Wikipedia encyclopedia. They completed tasks such as cleaning malicious intended modifications in the documents, enforcing bans, inter-linking language links, importing content automatically, identifying copyright violations, among others [52].

As an example on the impact that chatbots may have in a society, Microsoft launched Tay in 2016. It was implemented on Twitter, intending to simulate a teenage girl. Tay was

1

(20)

set to interact and learn from interactions with other users in the platform. After a few hours, Tay began to express derogatory and racist comments [16], obligating Microsoft to turn down such a bot.

Nowadays, messaging services like Facebook Messenger [16], Telegram, and Whatsapp hold thousands of chatbots in their platforms, allowing many businesses to increase their sales and customer services.

To illustrate, customer support is an intimate experience as the customer’s fidelity to the company depends on how well his/her needs are treated. The customer must always be the center of any decision done by companies, and every implementation of an AI is to be for the benefit of the customer first, and then the company, second. All businesses should take the opportunity of delivering a better experience to their customers. One of the best ways to achieve such delivery of a caring corporate image is through an automated service of a chatbot [23]. Thus, online customer support has been a more demanding department for enterprises as Internet has become more accessible for users [8], and it is more practical for a user to log into a chat room with the company’s support team rather than assisting personally to the offices requesting for help.

In fact, human-agent to user interactions means an investment for each of the agents attending the users’ inquiries. A conversational chatbot service could help the industry to reduce the number of agents, and it would result in minimizing investment into such a department. A modern AI-based chatbot’s process consists of receiving data, giving meaning to it by applying natural language understanding techniques, and then acting according to its knowledge base [46], which provides a proper response following the rules set by the chatbot designers. The knowledge base that is a component of a modern AI-based chatbot can be seen in terms of entities and intents. Entities are the subjects that a user is referring to, and intents are the actions that the user is seeking to do with such entities [46]. For example, in the sentence “tell me the weather for Monterrey city”, Monterrey city is the entity, and the user wants to know the weather for such city, that is: the intent. The whole sentence can be reduced to weather(Monterrey city). The design of such entities and intents can be of hard ef- fort. A poorly designed knowledge base will result in the chatbot not being able to understand user utterances, or even worse, in giving wrong answers to such utterances.

1.1 Problem Statement and Motivation

This research represents an internal use case for the Tecnol´ogico de Monterrey. The goal of this work is to allow a smooth transition for an online chat service driven by human agents to a conversational agent set up in a chatbot service incorporating natural language processing and machine learning.

Tecnol´ogico de Monterrey currently counts with 93,168 students in total, from which 27,402 are in high school, 58,782 undergraduate students, and 6,984 graduate students. This data means that a huge amount of students can consult and make use of the admissions chat service [51].

A total of 12 agents working in the chat service can only attend to a limited amount of users. An automated service of a chatbot will help such a scenario by providing service twenty-four hours a day, seven days a week to support the users asking for assistance in the

(21)

1.2. HYPOTHESIS AND RESEARCH QUESTIONS 3

chat service [32]. At the moment, such admissions department provides a decision-based chatbot in order to serve as a front line for online support, meaning that the most common queries can be answered by such bot TecBot and allowing transfers to human agents for a more concise problematic or question. However, a decision-based bot limits the domain of questions that can be asked to the conversational system while also providing a computerized experience [36] that may not comply with user satisfaction evaluations as mentioned in [44, 49], which are solved by implementing an AI-based bot which create a natural communication flow with the user.

We are ought to analyze, clean, process, and apply text mining methods to conversations carried out from January 2019 until the end of December 2019. This in order to identify the set of similar dialogues that when grouped together based on their similarities will help to identify the set of intents to be added to the knowledge base.

One challenge we face is language. Natural language processing research for Spanish is less common than it is for English. In order to perform a considerable analysis and processing of the data, we need to provide further attention to what is going to be done, and also add or create custom functionalities in order to reach a specific goal.

1.2 Hypothesis and Research Questions

It is possible to group a text corpus of conversations between users and agents of a student admission online chat service into a set of clusters to discover intents through unsupervised machine learning algorithms to aid in the design of an AI-based chatbot.

The research questions are:

• What is the process to transform the unstructured data into a representation to be used by machine learning algorithms?

• How can we identify similar utterances and cluster them?

• Which machine learning models are appropriate to gather the sets of similar utterances?

• How can we evaluate the results and determine if a cluster is associated to an existing intent? How can we resolve new intents?

• What is the similarity threshold to distinguish clusters aligned to an intent? How can such value be established?

1.3 Objectives

Conversational agents have become a key focus for customer services in many industries. The objective of this thesis is to help construct a knowledge base consisting of intents extracted from log data in an unsupervised manner. Natural language conversations from the student admission online chat services of Tecnol´ogico de Monterrey served as input for our methodology to extract information to setup a conversational chatbot.

To fulfill this thesis’ main objective, several specific objectives are proposed:

(22)

• Collect data from the student admissions department containing conversations between users and agents during a determined period.

• Analyze the collected data and create a general report of insights of the conversations.

• Transform the unstructured text data into a convenient representation to be processed by machine learning algorithms.

• Apply natural language processing techniques to the new representation of the text data to find clusters of similar utterances.

• From the sets of similar utterances, compare them against existing intents and gather their similarity measures.

• Evaluate the alignment of the clusters with their intent similarity, determine those correctly aligned with such intent and those that are not and establish the similarity threshold to differentiate them.

1.4 Main Contributions

The contributions of this research to the scientific community are as follows:

• Our research shows the practical approaches to transform unstructured text data belonging to conversations into a log format that clearly presents the structure of the conversations, and also provides the necessary steps in order to process text, clean it, standardize it and be able to apply machine learning algorithms to the resulting texts in a convenient word representation.

• We apply two different techniques in order to discover insights of the intents found in the texts via topic modeling and clustering. The former serves as an insight guide to discover those texts that are not yet identified as related to an intent, and the latter is able to capture the set of groups of similar questions given a distance metric.

• Our solution aids in the process of manual labeling conversational logs by providing the groups that explain a possible user intent, which when done manually represent a time-consuming activity. This contributes in the task of implementing conversational bots using cloud services such as Microsoft Azure, Google Dialogflow or IBM Watson, where a set of intents and their respective examples are requested, which such information of intents and examples are the results of our framework.

1.5 Summary

For this work, we will develop a framework that will output a well-designed knowledge base for a conversational chatbot. The input for such a framework will contain a text file: logs from real conversations, containing specific fields as columns to maintain a standardized format.

Such files will be pre-processed, analyzed, and then will be applied to algorithms belonging to

(23)

1.5. SUMMARY 5

text mining, machine learning, and natural language processing in order to output the desired intents to construct the knowledge base that will make up a chatbot knowledge base.

The scope of this research is as follows: to work with chat logs from an online customer support system which for the means of this research being from the admissions department of the Tecnol´ogico de Monterrey. Another aspect of the scope is that the technologies applied to such input will be of text mining, machine learning, and natural language processing. We will seek to group similar dialogues in order to construct the knowledge base of intents.

(24)

(25)

Chapter 2 State of the Art

This chapter provides the concepts that will support the presented work. It is divided into different sections of theory. First, machine learning general concepts are shown. Next, a more specific theory is presented about text mining and natural language processing (NLP). Then, the theory about the processing of text is described, later showing the procedure that is used for the completion of this research. Finally, concepts regarding chatbots are presented.

2.1 Machine Learning

Aur´eline G´eron [19] states that machine learning is the science of programming computers to learn from data. In his work Hands-On Machine Learning with Scikit-Learn, Keras &

Tensorflow, he supports the concept of Machine Learning with two more definitions:

• “Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.”, Arthur Samuel, 1959.

• “A computer program is said to learn from experience E concerning some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”, Tom Mitchell, 1997.

The importance of machine learning is the broad possibility of aiding humans on decision processes that would normally consume a considerable amount of time if done manually [35].

Machine Learning systems can be classified according to the following characteristics, as mentioned in [19]. Such definitions are as follows:

• Depending if the system was trained while maintaining human supervision.

– Supervised learning: refers to when the training set being fed to the algorithm includes the outcome of each data point, also known as labels [19].

Two main tasks on this supervised approach are:

* Classification: given a training dataset with known classes (labels corresponding to a sort of classification of such data points), a classification model will learn how to classify new data points and classify them accordingly.

7

(26)

* Regression: a task in which the training dataset contains features along with the desired predictor or target variable (the label). A regression system needs to be fed many examples of the data in context in order to be able to predict accordingly.

Some of the most important examples of supervised learning algorithms are:

* k-Nearest Neighbors

* Linear Regression

* Logistic Regression

* Support Vector Machines (SVMs)

* Decision Trees and Random Forests

* Neural Networks (which some architectures can also be used in unsupervised and semisupervised contexts)

– Unsupervised learning: contrary to supervised learning, the training set being fed to the algorithm does no include the label. Unsupervised learning algorithms try to learn without given guidance of the data [19].

The most important unsupervised approaches along with their algorithms are:

* Clustering: Helps detect groups with similar features within the data.

· K-Means

· DBSCAN

· Hierarchical Clustering Analysis (HCA)

* Anomaly detection and novelty detection: Helps on determining data points that do not follow a pattern present in the rest of the dataset.

· One-class SVM

· Isolation Forest

* Visualization and dimensionality reduction: Helps to reduce the dimensions of a dataset (commonly to 2D or 3D representations) while maintaining structure and integrity in order to be able to visualize them in a plot and appreciate how the data is separated within space.

· Principal Component Analysis (PCA)

· Kernel PCA

· Locally Linear Embedding (LLE)

· t-Distributed Stochastic Neighbor Embedding (t-SNE)

* Association rule learning: Helps on finding specific patterns or associations within the data, revealing insightful information which may explain the problem in context.

· Apriori

· Eclat

– Semisupervised learning: located most of the time between supervised and unsupervised learning, semisupervised learning deals with training datasets that are

(27)

2.2. TEXT MINING AND NATURAL LANGUAGE PROCESSING 9

partially labeled. Given a few examples of good labeling of data points, a semisupervised algorithm propagates labels across similar data points, saving the tasks of manually labeling many examples that are time-consuming [19].

– Reinforcement learning: the context represents an agent in an environment, which is able to select a choice of actions leading to rewards or penalties accordingly depending on if such desired action improves or decreases a given performance measure, which as it performs actions updates its internal strategy of action selec- tion depending on a context, called policy. The goal is to maximize such a policy [19].

• Depending if the system can incrementally learn while it’s running in deployment.

– Batch learning: also known as offline learning, the system is unable to incrementally learns from new incoming data. The workflow is: first, the system is trained on all available data, then it is launched or deployed into a production environment and keeps running without continuous learning.

– Online learning: the system is able to continue learning as data streams keep incoming, allowing the system to learn from new data incrementally.

• If the systems are done by comparing new data to previously known data, or if the task represents to find patterns in available data and be able to build a predictive model.

– Instance-based learning: the system assigns new data labels to those in the training data set that are closely following a similarity measure.

– Model-based learning: the system is trained on a data set, and it is able to fit a specific algorithm’s parameters to fit best the data, then when new data points arrive, by applying such model following the parameters as stated during training, the system will predict the result of the labels on the latest data.

2.2 Text Mining and Natural Language Processing

As stated previously, text mining allows users to discover information by transforming unstructured raw text into a structured representation easing the extraction of valuable information that was not possible to be known in the previous model [20]. Text Mining uses Natural Language Processing (NLP) in order to transform the unstructured data into a respective structured form. NLP is an area of computer science in charge of dealing with methods to analyze, model, and understand human language [53].

The main difference between both concepts is that text mining (or text analytics) focuses on obtaining insights from textual data in order to aid in decision-making process, while natural language processing looks after the data in order to extract key information or create a model to comply with a specific task, be it classification, generation or clustering of texts.

2.2.1 Text Mining Scopes

Kowsari et al. [34] mention the following scopes of text mining:

(28)

• Document-level: Defines categories present in a full-text document.

• Paragraph-level: Defines categories present per paragraph in a document.

• Sentence-level: Defines categories present per sentence found in a paragraph.

• Sub-sentence-level: Defines categories present per each of the expressions found in a sentence.

2.2.2 Natural Language Processing Pipeline

Vajjala et al. [53] define a step-by-step processing pipeline of text. Such pipeline is the following:

• Data acquisition: some strategies for acquiring data are: using public datasets, scraping data from the web, data augmentation, among others.

• Text cleaning: states the processing of cleaning up the data, like HTML parsing, Uni- code normalization, spelling, and system-specific error correction.

• Pre-processing: defines the procedure to clean the data at a sentence or word level, such as sentence and work tokenization, stop word removal, stemming and lemmatization, removing digits, removing punctuation, lowercasing, among others, such as normalization, parsing, Parts-of-Speech (POS) tagging.

• Feature engineering: capture the characteristics of text into a numeric vector representation that can be understood by ML algorithms. This can be statistic measures of text in terms of word presence, such as One-Hot Encoding, Bag-of-Words, Bag-of-NGrams, term frequency-inverse document frequency (TF-IDF), or even distributed representations such as word embeddings.

• Modeling: refers to the application of ML algorithms in terms of the task that is to be solved given the context.

• Evaluation: in order to determine the ”goodness” of a model, we can apply metrics such as accuracy, precision, recall, F1 score, AUC, among others, which are to be chosen according to an algorithm chosen.

• Deployment: most machine learning deployments are part of a larger system, and serve as web services, taking the specific input for the task and returning the result to serve as part of a broader pipeline in a system.

• Monitoring and model updating: it is important to keep monitoring the behavior of the model given its modality (either offline or online learning), in order to ensure that the performance of the model on the new data given as input is as desired and update the model if necessary.

(29)

2.2.3 Applications of Text Mining

Text mining is an interdisciplinary field as it covers many areas of computing in order to process textual data and output expected results. Some of its areas are mentioned in [55]:

Text Classification

This area corresponds to train a model which learns to classify documents according to their training labels, so new records can be classified as they are given to the trained model. Several classification algorithms are mentioned in [34]. Some of them are stated in Table 2.1.

Model Advantages Disadvantages

Logistic Re- gression

• Easy to implement

• It does not require too many computational resources

• It does not require input features to be scaled

• It does not require any tuning

• It cannot solve non-linear problems

• Prediction requires that each data point be independent

• Attempting to predict outcomes based on a set of independent variables

Na¨ıve Bayes Classifier

• It works very well with text data

• Easy to implement

• Fast in comparison to other algorithms

• A strong assumption about the shape of the data distribution

• Limited by data scarcity for which any possible value in feature space, a likelihood value must be estimated by a frequentist

• Attempting to predict outcomes based on a set of independent variables

K-Nearest Neighbor

• Effective for text data sets

• Non-parametric

• More local characteristics of text or document are considered

• Naturally handles multi-class data sets

• Computational of this model is very expensive

• Difficult to find the optimal value of k

• A constraint for large search problems to find nearest neighbors

• Finding a meaningful distance func- tion is difficult for text data sets

Support Vector Machines

• SVM can model non-linear decision boundaries

• Performs similarly to logistic regression when linear separation

• Robust against overfitting problems (especially for text data set due to high-dimensional space)

• Lack of transparency in results caused by a high number of dimensions (especially for text data)

• Choosing an efficient kernel func- tion is difficult (susceptible to overfitting/training issues depending on the kernel)

• Memory complexity

(30)

Model Advantages Disadvantages

Decision Tree

• Can easily handle qualitative (cate- gorical) features

• Works well with decision boundaries parallel to the feature axis

• A decision tree is a very fast algorithm for both learning and prediction

• Issues with diagonal decision boundaries

• Can be easily overfit

• Extremely sensitive to small pertur- bations in the data

• Problems with out-of-sample prediction

Random Forest

• Ensembles of decision trees are very fast to train in comparison to other techniques

• Reduced variance (relative to regular trees)

• It does not require preparation and pre-processing of the input data

• Quite slow to create predictions once trained

• More trees in forest increases time complexity in the prediction step

• Not as easy to visually interpret

• Overfitting can easily occur

• Need to choose the number of trees at the forest

Deep Learning

• Flexible with features design (re- duces the need for feature engineering, one of the most time-consuming parts of the machine learning prac- tice)

• Architecture that can be adapted to new problems

• Can deal with complex input-output mappings

• Can easily handle online learning (It makes it very easy to re-train the model when newer data becomes available)

• Parallel processing capability (It can perform more than one job at the same time)

• Requires a large amount of data (if you only have small sample text data, deep learning is unlikely to outper- form other approaches

• It is extremely computationally expensive to train

• Model interpretability is the most important problem of deep learning (deep learning most of the time is a black box)

• Finding an efficient architecture and structure is still the main challenge of this technique

Table 2.1: Comparison between text classification models, as mentioned in [34]

Text Clustering

Similar to document classification, the task of document clustering is to organize the documents into groups of similar ones. In contrast to document classification, in this task the respective document label is not provided.

• K-Means: groups the data into k different groups by iterating multiple times an assig- nation of different centroids and then calculating which are the closest ones for each of the data points, then calculating an average centroid for each group and starting the iteration again until no changes are made to each of the assigned clusters [4].

(31)

• Hierarchical: can be agglomerative (AGNES), where every single instance is a cluster and similarities between them allow for grouping iteratively aggregating them until a single group is reached, or divisive (DIANA), where the data starts with one single set, and starts to be divided until finding single groupings of the data points [4].

• Density-Based (DBSCAN): is able to handle noise and find a shape to unstructured data. It also does not need an indicator equivalent to the k amount of clusters as seen in the K-Means algorithm [9].

Another approach to organizing documents by their contents is through the technique known as Topic Modeling. It is an unsupervised learning technique to represent topics present in a set of documents according to their representative contents [4]. Algorithms for topic modeling are described as follows:

• Latent Semantic Analysis (LSA) [13]: in contrast to the statistics behind LDA, LSA returns groups of documents that contain the same words.

• Probabilistic Latent Semantic Analysis (PLSA) [24]: compared with LSA makes use of mixture decomposition inferred from a latent class model.

• Latent Dirichlet Allocation (LDA) [6]: topics are expressed as probability distributions in which a given set of terms may occur. It allows for flexibility of topics as non-distinct words may be encountered in different themes.

• Non-Negative Matrix Factorization (NMF or NNMF) [40, 37]: states that a term frequency- inverse document frequency matrix can be decomposed into two factors: terms per topics, and documents per topics.

The general recommendation stated in [4] is that LSA works better when dealing with larger corpora for learning descriptive topics, and LDA, as well as NMF, can be used when dealing with shorter and more compact textual data. Chen et al. [10] performed a set of experiments comparing LDA and NMF for short texts such as tweets, news headlines, and forum questions and concluded that NMF is able to produce higher-quality topics than LDA.

Information Retrieval

Given a set of documents, the task of information retrieval is given a query, search through the documents and return those matching such query. One basic example of such is document similarity between the set of documents. Sch¨utze et al. mention in [48] that such a concept is to find material which nature is unstructured (referring to the raw text form), satisfying a need for information coming from within a large collection stored in computers.

Information Extraction

Information Extraction refers to extracting specific information and transforming the textual representation into the respective structured model, such as translating a sales document into a spreadsheet format with the report of the stated textual information of sales. Jurafsky et al. [31] define information extraction as turning unstructured data embedded in raw text files

(32)

and transform it into structured information, allowing storage, for example, into a relational database. Common tasks include relation extraction, temporal data extraction, event extraction, and template filling.

2.3 Conversational bots

A chatbot is a software program that interacts with users via conversations using natural language [49]. Users are able to access data and services by exchanging natural language communication with such chatbot [16].

Humans are able to communicate by signs, text, and voice. Human-machine interactions are currently limited to either text-based conversations or voice-based. The most popular form in which a human can interact with a machine is through text. Some examples of such channel are IBM Watson and QA platforms. Voice-based dialogue platforms have been developed in the previous decade, such as Apple Siri, Amazon Alexa, and Google Assistant [17].

Chatbots have become more popular in sectors such as automotive, customer support, education, entertainment, finance, healthcare, marketing, manufacturing and retail in systems that help workers interact with their virtual assistants in order to automate complex workflows with support 24 hours a day, 7 days-a-week. [17, 32].

2.3.1 Chatbot Architecture

S´anchez-D´ıaz et al. [46] show the general architecture of a text-based chatbot as seen in Figure 2.1. A description of each step of the architecture is as follows:

Figure 2.1: General text-based chatbot architecture [46]

(33)

2.4. RELATED WORK 15

• User input: an utterance in the form of text is received by the user stating the desired message to be sent to the conversational bot. This is done through an interface that allows the connection between a user and the conversational bot service.

• Raw data: refers to extracting the information from the utterance focusing on its contents and also decomposing features such as context.

• Language understanding: specific to the branch known as natural language understanding, refers to be able to extract the information that the utterance is presenting, such as a possible intent (an action desired to be done), or an entity adding to the context (such as specifying a place or time). Information such as grammar, semantics, pragmatics and logic present in the text are processed in this step in order to consult the corresponding information in the knowledge base, mentioned next.

• Knowledge base: The chatbot knowledge base is the core of the whole bot. It is the database in which all the information that is going to be handled for the responses is stored. Knowledge base designers are responsible for the correct management and in- terconnection of information [46] that is going to be used by the bot in order to produce a set of answers to given user inputs.

Definition of intents and entities, as described in [7]:

– Intents: Represent the actions that the user intends to accomplish with the chatbot.

– Entities: Domain-specific information items found in the utterances associated with an intent.

• Response: once that the knowledge base has been consulted, scores are assigned to the most relevant intent found in the utterance, along with its suitable and related answer to be provided back to the user via the same channel as the input text was received from.

2.4 Related Work

Works related to the scope of this research are described next. Literature analysis cover clustering techniques applied to textual domains as well as a relation with conversational bots functionality, such as the creation of a knowledge base or natural language understanding dealing with interpreting conversational data.

Intent classification tasks have been applied on web domain [28, 39, 38, 56], where there exist the use of keywords on query data as well as analyzing the user’s clicks through a website in order to make a prediction on the intent of the user. Similar to this task, but applied to a commercial domain is online commercial intent (OCI) [11, 27, 25, 21], where the main objective is to identify the commercial intent (find, buy, among others) on a commercial aspect.

Jizhou Huang [29] proposed an automatic extraction of knowledge from online dis- cussion forums by extracting <question, answer> pairs by using Support Vector Machines (SVM) and an evaluation that required human interaction by scoring how related is an answer to a topic before training. Kim et al. [33] explored different classifiers (Support Vector

(34)

Machines for Sequence Tagging, Na¨ıve Bayes and Conditional Random Fields), along with varying representations of words (Bag-of-Words, TF-IDF) and additionally testing extra information over one-on-one conversational data. The results showed that using dialogue structure and inter-utterance dependency provided an increase in performance, and also concluded in which using lemmas rather than raw words increases accuracy of classifiers. Haponchyk et al.

[22] makes use of labeled datasets to propose a supervised clustering methodology to identify user intents and improving results by comparing their approach against semantic classifier.

Deepak et al. [12] also work on supervised clustering with a proposal to cluster pair of questions and their respective answers in order to create and further curate a questions archive.

Zhao et al. [58] as well as Aggarwal et al. [2] analyze clustering algorithms for the domain of textual documents. Zhao et al. also reach a conclusion over a comparison between partitional clustering algorithms and agglomerative clustering algorithms and states that the first kind always leads to better solutions by providing higher-quality results as well as them not requiring many computational resources as compared against the agglomerative kind, which makes them become the recommended implementation for large collections of documents. As for cluster validation, Dudoit et al. [14] make use of the silhouette index in order to estimate the optimal number of clusters to be found. More validation metrics are shown in [48].

Additional tools used in the industry exist to identify intents, Microsoft LUIS [57] and Wit.ai [15] are capable of processing conversations and identify the intents that are present in a set of utterances and provide to the user a broader insight of how a conversational bot is to be configured to respond to the desired intents.

In the literature revised on this domain, no clear procedure on how to transform data and standardize it in order to account for a text preprocessing pipeline is present, also often conversational tasks imply an existing labeled and curated dataset, which by removing the target class and applying a machine learning algorithm to generate a prediction that is compared with the actual target to define the performance of the classification or clustering model.

The scope of our research does not account for a label per each of the utterances, instead is a framework to discover the possible intents that may be contained on the utterances, starting from the transformation of unstructured data into a tabular dataset and finalize with the application of an unsupervised machine learning technique to establish clusters of similar utterances, aiding on lowering costs of manual labeling of data and providing insights on the discovery of intents in conversational data on the domain of academic online chat services, specifically from the admissions department.

(35)

Chapter 3 Intent Discovery from Conversational Logs

This chapter comprises the methodology followed during the development of this research, which covers its definition, data collection, data preprocessing, data analysis, data preparation, modeling and evaluation.

3.1 Proposal Solution

The proposed framework will allow the maintenance of the content in the knowledge base for it to keep updated as new unknown questions are asked in the chatbot. The proposed framework is shown below in Figure 3.1, representing the general concepts that will cover the framework for the part of the intents.

Figure 3.1: General framework for discovering intents

1. Data Preprocessing: the conversations dataset goes through a data preprocessing pipeline which consists of two main steps, which are transforming the data to a log format (which allows to separate each dialog of the conversations), and then the textual cleaning of the

17

(36)

text in order to prepare it by removing noise and standardizing the format to be able to convert it to a viable word representation.

2. Word Representation: after the conversations have gone through the data preprocessing pipeline, a viable word representation is then chosen in order to use in machine learning algorithms.

3. Intent Discovery: the selected word representation is then used for the application of a machine learning method. In this case we make use of unsupervised techniques such as clustering.

4. Intent Mapping: Each of the clusters are then mapped against existing intents by using a similarity measure. For the case of this research, there exist a collection of options provided for the user to select in the menu-based chatbot implementation, which are selected to represent possible existing intents in the dialogues previous to such implementation.

5. Intent Evaluation: in this step, the groups of clusters and their similarities to existing intents are evaluated. A sample of dialogues from each cluster is extracted and the intent mapping is validated allowing to state if such mapping is correctly aligned with the intent or not. For those intents not aligned, this step allows the statement if such clustering can be categorized as a new intent, if so, it is then added to the intent collection. It is also possible to define a similarity threshold to automatize the definition of an aligned intent, by calculating the mean similarity measure of those clusters correctly aligned with their intents.

Steps 4 and 5 are two-way since updating the structure of the intents modifies the similarities between the utterances and the intents themselves. If an intent could be defined in a more suitable way, it may describe better the utterances that are to be similar to it.

The process is also iterative in the sense that new conversations are to be analyzed further in time. Intents are prone to be modified as users have new questions about services that integrate into the organization, and also intents may change in their own structure in order to specify a variant of an intent that would no longer be of use.

As a final result of each iteration over our proposed framework are intents accompanied by their respective examples that trigger such intents. These serve to configure and train a conversational bot which receives a set of intents along with examples that are associated with a respective intent.

3.2 Methodology

In this section, we will explain the research methodology carried on to develop this research.

A methodology helps us define how to set up a road map in order to fulfill a completion goal.

Several steps on the research methodology are explained in this section.

(37)

3.2. METHODOLOGY 19

3.2.1 Methodology Definition

In order to reach the completion of this research, the methodology, which is shown in Fig- ure 3.2, is composed as follows:

Figure 3.2: Research methodology

• Data Collection: chat conversations are retrieved from the admissions department of ITESM for them to be treated.

• Data Preprocessing: textual data needs to be parsed to a structural form. We follow the pre-processing data pipeline stated in Section 2.2.

• Data Analysis: before diving into the modeling of a natural language processing model for the data, first, we need to understand the behavior of the resulted cleansed dataset by visualizing descriptive statistics, defining a story explaining what the data represent.

• Data Preparation: after the analysis, we can move further into changing the structured form of the data into a representation that can serve as input to a natural language processing model.

• Application of Machine Learning Methods: during this phase, natural language processing methods will be applied to the chat logs in their respective representation form in order to obtain the desired results, that are the intents represented in the texts.

• Evaluation: this phase consists of manually verifying a sample of the results from the modeling phase. This allows to generalize the performance of the model and escalate it to the explanation of the model.

3.2.2 Data Collection

The dataset utilized for this research was given by the admissions department of Tecnol´ogico de Monterrery. The acquiring of this data was split into two batches. At the very beginning

(38)

of this research, we had access to the first half, belonging to the dates from January 2019 to August 2019. As we evaluated the data and the time was advancing, when we reached the year of 2020, we were able to have access to the other half of the batch of discussions belonging to the dates from September, 2019 to December 2019, covering finally the whole historical conversations that took place in the year of 2019. First insights on the data gave us the discovery that from the month of July 2019, the admissions department started a deployment of a menu-based chatbot in their online chat service, it started giving intermittent service along with the regular work from the agents, until December 2019 when the presence of attention providers was only from such bot.

The format of such original data was of monthly reports in a comma-separated value (CSV) file each. The columns of such files are described in Table 3.1, covering technical specifications of the user device as well as date and time, along with the transcript of the conversation in a single string. Such transcripts format varied with time; thus data preprocessing represented a challenging task to perform.

Feature Name Data Type Brief Description

Case number Numeric Non-unique identification number for specific prospect’s tickets

Chat transcript name Numeric Unique identification number for the conversation

Chat visitor number Numeric Unique identification number of a prospect given the login-to-chat information

Visitor IP address String Hashed IP address of prospect when connected to chat

Maximum response time agent Numeric Maximum time in seconds which an agent took to respond

Maximum response time visitor Numeric Maximum time in seconds which a prospect took to respond

Messages total visitor Numeric Number of messages sent by the prospect Messages total agent Numeric Number of messages sent by the agent

Ended by String Indicates who ended the conversation (sent

the last message)

Chat button String

Indicated from which website the prospect ac- cessed the chat, representing the respective department

Location String Prospect’s location when connected to chat Average response time agent Numeric Average response time by an agent

Site reference String Unused field

Screen resolution String Screen resolution of prospect’s device when connected to chat

ISP String Name of prospect’s Internet Service Provider

(ISP) when connected to chat

Platform String The operative system used in the prospect’s device when connected to chat

(39)

3.2. METHODOLOGY 21

Browser String Web browser used by the prospect when con-

nected to chat

Chat duration Numeric Amount in seconds of the conversation total duration

Abandoned after Numeric Unused field

Creation date Date Date in which the conversation took place

Init date Time Time of the day in which the conversation

took place

Last modification date Date Date in which if any field in conversation was modified

Body String Whole conversation transcript

Owner Numeric Name of agent attending the conversation

(states if attended by bot) Table 3.1: Features of reports regarding conversations

We also had access to the chatbot implemented during mid-year of 2019. The structure of the menus provided by the bot to the user extends up to 5 levels in depth. Each group leads to more specific information about a subject.

3.2.3 Data Preprocessing

This step can be broken down into two major phases. First, the transformation from raw texts into a log format. And the second one deals with cleaning the text itself.

Transformation to Log Format

An anonymized example of the original raw string covering the whole transcript of a conversation:

Chat has begun: Friday, January 1, 2019, 12:12:12 (-0500) Chat origin: SOAD Agent One ( 0s ) Agent One: Hi, how can I help you? ( 42s ) Prospect Two: Hi, thanks for your support. ( 1m 7s ) Agent One: Bye.

As it can be seen, a series of regular expressions might aid in identifying the groups in which a transcript can be broken down into. Once regular expressions are set, we can convert such transcript into a data frame with the following columns, as shown in Table 3.2:

CaseID Numeric Non-unique representation for specific

prospect’s tickets

ConversationID Numeric A unique number indicating the transcript identifier

Sequence Numeric A sequence number of the message corre-

sponding to a conversation

(40)

Department String Department from admissions agent

DateAdjusted Datetime Date and time in which the message was sent

Timezone String The timezone of the conversation

Emitter String

Represents who sends a specific message.

It can be SYS (System), AGENTE (Agent), PROSPECTO (Prospect), or TECBOT

Body String Message sent by the sender

BeginsTecBotBool Boolean Indicates if the chat was first attended by bot

TransferBool Boolean

Indicates if the corresponding message indicates a transfer (either from agent to an agent, or from bot to an agent)

TransferTecBotBool Boolean

Indicates false until a conversation with TecBot requests for a transfer to an agent, all messages after such transfer will be marked as true

MessageNotUnderstood-

TecBotBool Boolean

Indicates if such message will be such that TecBot will return an automated message indicating it was not able to understand a prompt (natural language)

MessageNotUnderstoodPrompt-

TecBotBool Boolean

Indicates that TecBot responded with automated message indicating it was not able to understand a prompt (natural language) Table 3.2: Features of log format conversations

Text Preprocessing

In order to process raw textual data into a machine readable format, the following are the steps carried out during the preprocessing phase for the presented work:

• Load unstructured data: text data can be found in many formats, such as comma separated values (CSVs), JSON, or distributed in folders containing text files. The reading of such data can be made through direct read functions or by creating custom ones that follow a specific structure depending on how the documents are stored [47].

• Creating the corpus: a corpus is the collection of related documents containing natural language [4]. In order to clean and produce a structured dataset from the documents, it is important to work with a corpus object that contains the whole set of documents to process.

• Pre-processing Pipeline: in Practical Natural Language Processing [53], Vajjala et al.

denote the following pre-processing steps:

– Sentence segmentation and tokenization: every document is presented as a vector.

Each element represents a word in the sentences found in such a document.

(41)

3.2. METHODOLOGY 23

– Text normalization: characters must be standardized into a specific format. In the case for this work, characters were set to UTF-8, removing accents and also transforming each word to its lower-case setting. More processing and replacements benefit by reducing many words or characters that can be represented as a single idea. For example, different written URLs can be standardize to indicate that what such word means is a URL. The same applies for phone numbers, identification numbers. This action’s benefit can be seen in the creation of a document- term matrix, where, e.g., multiple phone numbers can be represented by the word phone number. Regular expressions are an important component as well, as they can aid on identifying key patterns, such as dates, specific input messages, places, among other principal aspects of the given corpus.

– Stemming and lemmatization: The former refers to removing suffixes and reducing a word to a single base from which a single form can represent all variants of such word. The latter is to replace each of the tokens to their equivalent lexeme [18]. This removes noise and variety of words that belong to a single root in terms of what is found in a dictionary.

In order to clean the respective text column (Cuerpo), we followed the next text preprocessing pipeline:

1. Lowercasing: change all textual data to lowercase

2. Word segmentation: also called tokenization, split each of the messages into lists, where each word is an element of such list

3. Encoding standardizing: remove all characters that are out of range from the alphabet and transforming them into their equivalent alphabetical representation if possible, e.g., turn ´e into e

4. String normalization: change a representation of a variety of strings that can be gener- alized by a single form.

• Names of Agents

• Names of Prospects

• Numbers

• Student IDs

• General structure (start, transfer)

• Emails

• URLs

5. Stop word removal: removing words that do not provide any information nor alter the context if removed

6. Lemmatization: changing the form of a word to its equivalent lexeme

This allowed creating a dataset containing the clean texts ready to be passed later to more advanced techniques to be used in machine learning algorithms.

(42)

3.2.4 Data Analysis

Before continuing with machine learning processes, we first need to understand the data.

So far, we have prepared the dataset to be then used in text mining and natural language processing modules. We will see an analysis of the conversations in a graphical manner.

The analysis consists on exploring the data found in the conversational logs to discover the contents of the writing, how long are each of the messages sent by the prospects. Graphical visualizations are shown among their respective statistics for a deeper understanding.

Understanding the data helps in realizing what tasks and preparation are to be done in order to move a step further into the modeling section. The conversational nature of the data collected requires a better understanding and also is to be settled with a tighter scope in order to align its structure to what is wanted to be achieved as stated on the objectives.

General Analysis

The total number of conversations covered in the year 2019 was 65,200, representing 1,281,222 interactions between users, agents, and the menu-based bot.

The conversation data included features such as location from where the user is connected to the online chat service. Figure 3.3, which makes the exclusion of Mexico (91%) and the United States (3.13%) due to the high amount of users from those countries. The graph highlights a predominant presence of Latin American countries, 14 out of the total of 25 countries with the most visits.

Figure 3.3: Top 25 countries from users

As previously stated, Mexico occupies the majority of conversations. Figure 3.4 highlights the states from Mexico where most users are contacting the most to the online chat service of admissions from Tecnol´ogico de Monterrey. Distrito Federal (currently Mexico

(43)

3.2. METHODOLOGY 25

City) (22.42%), Nuevo Le´on (17.74%), Jalisco (5.43%), Chihuahua (4.45%) cover half of the conversations attended by the chat service.

Figure 3.4: Top 25 Mexican states from users

Proceeding by analyzing the conversations, Figure 3.5 shows the difference in the number of conversations by each department. SOAD (Solicitud de Admisi´on) covers the majority of attendance of the online chat service with 42,079 conversations, followed by Admi- sion Profesional (10,553) and Admision Preparatoria (6,334).

(44)

Figure 3.5: Messages by department

Figure 3.6 reveals a seasonality. The year begins with January containing 6,341 messages, and maintains a decreasing trend until the months of June, July, and August, which hold a similar consistency on their messages, the bottom being August. From there until Novem- ber, there exists an increasing trend with a top of 8,966 conversations in November and ending the year with December having 5,840 conversations.

Figure 3.6: Messages by month

The graph shown in Figure 3.7 reveals a decreasing trend from the beginning of the week until the end of it, with a lower demand for conversations in the service beginning on Fridays with a further decrease on weekends.

(45)

3.2. METHODOLOGY 27

Figure 3.7: Messages by day of week

Analyzing the conversations displayed by the hour of the day as shown in Figure 3.8, it’s possible to note the work schedule from 8 AM to 9 PM, having the busiest hours at 4 and 5 PM. The conversations then show a major decrease in out of working schedule (10 PM to 7 AM).

Figure 3.8: Messages by hour

Tecbot was first deployed in the month of July. Figure 3.9 helps to visualize the participation of Tecbot in terms of average chat duration. The peak in November, as seen in Figure 3.6, shows Tecbot had a participation of 25% out of the total duration of chats.

(46)

Figure 3.9: Tecbot conversation duration in contrast to all conversations

Figure 3.10 reveals the participation of Agent and Tecbot in terms of average message counts. Tecbot maintains a constant number of messages sent to users of around 10 and 11 per conversation, while agents are experiencing a decreasing trend in the number of messages that are sent to the users.

Figure 3.10: Average messages sent by an agent and Tecbot per conversation

(47)

3.2. METHODOLOGY 29

3.2.5 Data Preparation

The total amount of conversational data is mixed. For the first part of the year, the chats were between agents and users. At mid-year, the admissions department started with the deployment of the menu-based bot, but it was not until December when the bot covered all incoming conversations.

Also, the mix of multiple departments meant a different design for the chatbot and a slightly different structure that constantly changed how the transcript was represented.

In order to be able to work with the data, it needed to be delimited into a set of conversations that would serve the purpose of this research.

3.2.6 Model

Once the desired collection of messages has been cleaned, it is then possible to move into the modeling of a machine learning algorithm that can output a result that aligns and fulfills this research’s objective. In this phase, we make use of the techniques described in 3.3.

For the case of word representations, we implemented two kinds: Bag-of-Words and TF-IDF with an extra parameter for more specificity:

• Bag-of-Words: minimum word appearance (min df) = 2, maximum word appearance (max df) = 0.95 (as it is a decimal number between 0 and 1, it refers to a percentage), and total maximum of features to extract (max features) = 500.

• TF-IDF: minimum word appearance (min df) = 2, maximum word appearance (max df)

= 0.95 (as it is a decimal number between 0 and 1, it refers to a percentage), and total maximum of features to extract (max features) = 500, plus another parameter indicating how many n-grams to account for the analysis (ngram range) = (1, 3).

After the implementation of the word representations, as our approaches suffer from sparsity, we include a dimensional reduction denominated as Single Value Decomposition (SVD), and by providing a percentage of the variance to establish for the result, we iter- ate through the features starting from n − 1 features decreasing by one, and obtaining the explained variance on each iteration until the desired percentage is reached. A percentage of 90% is established during this step. This step was taken into account as the number of features are related to the complexity of the calculations of a given machine learning algorithm with such sparse matrix. In order to reduce complexity, as computational resources are limited, the 90% variance was set in order to maintain the greatest amount of words and information while significantly looking to reduce the size of the matrix.

For the application of a topic modeling we implemented a Non-Negative Factorization Matrix, with both of its possible variants that the sci-kit learn library allows to select: Frobe- nius norm and Kullback-Leibler divergence. We analyzed these variants for each of the two word representations with a number of topics equal to ten. Such value was arbitrary selected after experiments on understanding the contents of the topics, as no measurement is available for NMF to evaluate how good or bad is a selected number of topics, so ten gave an understandable and broad insight on the contents but nothing beyond the surface.

Instituto Tecnologico y de Estudios Superiores de Monterrey

Instituto Tecnologico y de Estudios Superiores de Monterrey

Monterrey Campus

School of Engineering and Sciences

Intent Discovery from Conversational Logs to Prepare a Student Admission Chatbot for Tecnol´ogico de Monterrey

Rolando Trevi ˜no Lozano

Master of Science

Computer Science

Instituto Tecnologico y de Estudios Superiores de Monterrey

Declaration of Authorship

Dedication

Acknowledgements

Intent Discovery from Conversational Logs to Prepare a Student Admission Chatbot for Tecnol´ogico de Monterrey

by

Rolando Trevi ˜no Lozano Abstract

List of Figures

List of Tables

Contents

Chapter 1 Introduction

1.1 Problem Statement and Motivation

1.2 Hypothesis and Research Questions

1.3 Objectives

1.4 Main Contributions

1.5 Summary

Chapter 2

State of the Art

2.1 Machine Learning

2.2 Text Mining and Natural Language Processing

2.2.1 Text Mining Scopes

2.2.2 Natural Language Processing Pipeline

2.2.3 Applications of Text Mining

2.3 Conversational bots

2.3.1 Chatbot Architecture

2.4 Related Work

Chapter 3

Intent Discovery from Conversational Logs

3.1 Proposal Solution

3.2 Methodology

3.2.1 Methodology Definition

3.2.2 Data Collection

3.2.3 Data Preprocessing

3.2.4 Data Analysis

3.2.5 Data Preparation

3.2.6 Model