• No se han encontrado resultados

Instituto Tecnologico y de Estudios Superiores de Monterrey

N/A
N/A
Protected

Academic year: 2022

Share "Instituto Tecnologico y de Estudios Superiores de Monterrey"

Copied!
124
0
0

Texto completo

(1)

Instituto Tecnologico y de Estudios Superiores de Monterrey

Monterrey Campus

School of Engineering and Sciences

Comparing Databases for the Prediction of Student’s Academic Performance using Data Science on the Novel Educational Model Tec21

at Tecnol´ogico de Monterrey

A thesis presented by

Miguel Andr´es Lara Castor

Submitted to the

School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Science

Monterrey, Nuevo Le´on, June, 2021

(2)
(3)

Instituto Tecnologico y de Estudios Superiores de Monterrey

Campus Monterrey

The committee members, hereby, certify that have read the thesis presented by Miguel Andr´es Lara Castor and that it is fully adequate in scope and quality as a partial requirement for the degree of Master of Science in Computer Sciences.

PhD. Neil Hern´andez Gress Tecnol´ogico de Monterrey Principal Advisor

PhD. H´ector G. Ceballos Cancino Tecnol´ogico de Monterrey Co-Advisor

PhD. Sara Elena Garza Villarreal Universidad Aut´onoma de Nuevo Le´on Committee Member

PhD. Rafael Batres-Prieto Tecnol´ogico de Monterrey Committee Member

PhD. Rub´en Morales Men´endez Associate Dean of Graduate Studies School of Engineering and Sciences Monterrey, Nuevo Le´on, June, 2021

i

(4)
(5)

Declaration of Authorship

I, Miguel Andr´es Lara Castor, declare that this thesis titled, Comparing Databases for the Prediction of Student’s Academic Performance using Data Science on the Novel Educational Model Tec21 at Tecnol´ogico de Monterrey and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree at this University.

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Miguel Andr´es Lara Castor Monterrey, Nuevo Le´on, June, 2021

©2021 by Miguel Andr´es Lara Castor All Rights Reserved

iii

(6)
(7)

Dedication

To my mother and father who supported me since the beginning of this journey. I remember the day I shared with them my decision to go to Monterrey to pursue my master’s degree. At the beginning they were shocked because I would be far from home, but they accepted my decision and told me that they were very happy. At the end I came back to home to finish my studies.

To my sister and brother, who supported me when I needed help while I was struggling to live in a new city and learning things from scratch. I knew that if I had a problem, I could make a video call and they would be listening to me and making me laugh.

Finally, I would like to dedicate this thesis work to Tec de Monterrey. I believe that this work could help to the students of this magnificent institution to improve their academic performance.

v

(8)
(9)

Acknowledgements

I would like to express my gratitude to my Advisors PhD. Neil Hern´andez Gress and PhD.

H´ector G. Ceballos Cancino for the feedback they provided during the time I work on this thesis.

Also, I would like to thank PhD. Rafael Batres-Prieto and Jos´e Enrique Montemayor Gallegos who kindly agreed to listen to my thesis presentation and provided feedback.

In addition, I would like to thank Tec de Monterrey Admission’s department for sharing the data used in this thesis.

I extend my gratitude to my friends and colleagues Andre´e Vela, Rolando Trevi˜no, Em- manuel V´azquez and Ra´ul Mart´ınez for their help, advice, explanations, laughs and deep conversations about our future after the master’s degree.

I owe a deep sense of gratitude to my parents Pedro and Laura, and my sister and brother, Laura y Mauricio, for their constant encouragement and strong support throughout this great period of my life.

Finally, I am extremely thankful to the Tecnol´ogico de Monterrey for the support on tuition and CONACyT with the support for living.

vii

(10)
(11)

Comparing Databases for the Prediction of Student’s Academic Performance using Data Science on the Novel

Educational Model Tec21 at Tecnol´ogico de Monterrey by

Miguel Andr´es Lara Castor Abstract

Many studies have been made on the prediction of student’s academic performance using Data Science. The students with poor academic performance as well as dropout students make a huge impact on the graduation rates, reputation, and finances of an educational institution.

These studies take the advantage of the digitization of the admission and academic data of the students and the increasing computational power. However, since August 2019 Tecnologico de Monterrey has been doing it using entrance tests called Initial Evaluations. Unfortunately, the Initial evaluations did not provide useful predictions for the students of the fall semester in 2019. Therefore, this study aimed to compare the Initial Evaluations and the admissions data using Data Science models to predict the student’s academic performance. The admission data was composed of five databases: Initial Evaluations, Emotions, Curriculum, Admission Exam and Grades of the first semester. A similar methodology to Cross Industry Standard Process for Data Mining was used to compare the models based on admission data and the models based only Initial Evaluations. A large number of experiments were carried out com- bining different data of admissions, feature reduction techniques and classification models.

The experiments showed that the models based on admission data predicts the student’s aca- demic performance with higher accuracy than the models based only on Initial Evaluations.

Nevertheless, some variables of the Initial Evaluations were relevant to the models based on admission data. Moreover, the accuracy of the experiments was in the range of the results from the related studies. The results of this study indicates that the Initial Evaluations provide useful information for the prediction of student’s academic performance in the domain of Data Science.

ix

(12)
(13)

List of Figures

2.1 List of Objectives . . . 8

2.2 List of Problems . . . 8

2.3 List of Variables . . . 9

2.4 Outliers using boxplot [35] . . . 14

2.5 Comparison of data distributions [35] . . . 15

2.6 Decision Boundary [84] . . . 16

2.7 Logistic regression’s estimate of class probability as a function of f(x) [84] . . 17

2.8 Maximal margin classifier [84] . . . 17

2.9 Euclidean distance [84] . . . 18

2.10 Classification made by KNN [84] . . . 19

2.11 Decision Tree example [84] . . . 20

2.12 Feature importance with Information Gain [84] . . . 21

2.13 Boosting framework [18] . . . 22

2.14 Example of Conditional Probability [63] . . . 22

2.15 PCA example with two variables [20] . . . 23

2.16 Confusion Matrix for two classes [84] . . . 25

3.1 CRISP Methodology [84] . . . 30

3.2 Proposed methodology . . . 31 xi

(14)

3.4 Description of the continuous data of the Initial Evaluation’s database . . . . 33

3.5 Boxplot of continues variables of the Initial Evaluation’s database . . . 34

3.6 Correlation between variables of the Initial Evaluation’s database . . . 34

3.7 Summary of the PAA’s database . . . 35

3.8 Description of the continuous data of the PAA’s database . . . 36

3.9 Boxplot of continues variables of the PAA’s database . . . 36

3.10 Correlation between variables of the PAA’s database . . . 37

3.11 Summary of the Curriculum’s database . . . 38

3.12 Description of the continuous data of the Curriculum’s database . . . 39

3.13 Barplot with CurrSportsordandCurrCultureord . . . 39

3.14 Barplot with CurrStudentClubordandCurrCommServiceord . . . 39

3.15 Barplot with CurrLeaderordandCurrWorkord . . . 40

3.16 Barplot with CurrAcaAwardsordandInternationalord . . . 40

3.17 Summary of the Personal Essays’ database . . . 40

3.18 Summary of the High school GPA and grades of the first semester courses’ database . . . 42

3.19 Description of the continuous data of the High school GPA and grades of the first semester courses’ database . . . 43

3.20 Boxplot of continues variables of the High school GPA and grades of the first semester courses’ database . . . 43

3.21 Correlation between variables of the High school GPA and grades of the first semester courses’ database . . . 44

3.22 Summary of Emotions’ database . . . 45

3.23 Description of the continuous data of Emotions’ database . . . 46

3.24 Boxplot of continues variables of Emotions’ database . . . 46

3.25 Correlation between variables of Emotions’ database . . . 47 xii

(15)

3.26 Distribution of average score for each area of study . . . 49

4.1 Engineering’s results averaging databases . . . 60

4.2 Engineering PCA’s results averaging databases . . . 60

4.3 Business’ results averaging databases . . . 61

4.4 Business PCA’s results averaging databases . . . 61

4.5 Creative Studies’ results averaging databases . . . 61

4.6 Creative Studies PCA’s results averaging databases . . . 62

4.7 Health’s results averaging databases . . . 62

4.8 Health PCA’s results averaging databases . . . 62

4.9 Social Sciences’ results averaging databases . . . 63

4.10 Social Sciences PCA’s results averaging databases . . . 63

4.11 Built Environment’s results averaging databases . . . 63

4.12 Built Environment PCA’s results averaging databases . . . 64

4.13 SHAP values for a class 0 (average score below the median) of Built Environ- ment’s student . . . 70

4.14 SHAP values for a class 1 (average score above the median) of Built Environ- ment’s student . . . 70

4.15 SHAP values for a class 0 (average score below the median) of Social Sci- ences’ student . . . 70

4.16 SHAP values for a class 1 (average score below the median) of Social Sci- ences’ student . . . 71

4.17 3-D plot of Engineering’s PCA . . . 71

4.18 3-D plot of Business’ PCA . . . 72

4.19 3-D plot of Creative Studies’ PCA . . . 73

4.20 3-D plot of Health’s PCA . . . 73

4.21 3-D plot of Social Sciences’ PCA . . . 74 xiii

(16)

A.1 A student . . . 87

A.2 Tecnol´ogico de Monterrey . . . 87

A.3 Admission’s department . . . 87

A.4 Application . . . 88

A.5 Machine Learning models . . . 88

A.6 Naive Bayes models . . . 88

A.7 Positive academic outcome . . . 89

A.8 Negative academic outcome . . . 89

A.9 SHAP values interpretation . . . 89

xiv

(17)

List of Tables

2.1 Studies related to this research. . . 10

3.1 Variables used for each area of study . . . 50

3.2 Variables removed by the correlation process . . . 52

4.1 Multiplicity of good models for each area of study . . . 64

4.2 Binary Classification results . . . 66

4.3 Precision, Recall and Accuracy . . . 67

4.4 Variable reduction . . . 68

4.5 Engineering variable’s coefficients . . . 68

4.6 Business variable’s coefficients . . . 69

4.7 Creative Studies variable’s coefficients . . . 69

4.8 Health variable’s coefficients . . . 69

5.1 Type of classes and definition . . . 77

5.2 Range of Accuracy . . . 77

5.3 Instances . . . 78

5.4 Type of Information . . . 78

5.5 Accuracy using only information of Admissions . . . 78

5.6 Models with highest accuracy . . . 79

5.7 Most Relevant Variables . . . 79 xv

(18)
(19)

Contents

Abstract ix

List of Figures xiv

List of Tables xvi

1 Introduction 1

1.1 Problem Definition . . . 2

1.2 Hypothesis and Research Questions . . . 4

1.2.1 Research questions . . . 4

1.3 Objectives . . . 4

1.4 Structure of the document . . . 4

1.5 Summary . . . 5

2 State of the Art 7 2.1 Student’s Academic Performance . . . 7

2.1.1 Overview . . . 7

2.1.2 Related Work . . . 9

2.2 Theoretical Framework . . . 13

2.2.1 Statistics . . . 13 xvii

(20)

2.3 Summary . . . 27

3 Development 29 3.1 Methodology . . . 29

3.1.1 Problem Understanding . . . 31

3.1.2 Data Understanding . . . 32

3.1.3 Data Preparation . . . 45

3.1.4 Modeling . . . 54

3.1.5 Evaluation . . . 56

3.1.6 Interpretation . . . 57

3.2 Summary . . . 58

4 Results 59 4.1 Selection of the best experiments . . . 59

4.2 Accuracy and Statistical Evaluation . . . 65

4.3 Variables . . . 66

4.3.1 Variable Reduction . . . 66

4.3.2 Most Relevant Variables . . . 67

4.4 Principal Component Analysis using 3-D plots . . . 70

4.5 Summary . . . 72

5 Discussion 75 5.1 Analysis of the results . . . 75

5.2 Comparison to the related work . . . 77

5.3 Answers to the research questions . . . 79 xviii

(21)

5.4 Summary . . . 80

6 Conclusion 81 6.1 Robustness of Results . . . 81

6.2 Contributions . . . 82

6.3 Limitations . . . 83

6.4 Future Work . . . 83

6.5 Summary . . . 84

A Use case 87

Bibliography 100

xix

(22)
(23)

Chapter 1 Introduction

In the past decades many studies have been carried out to understand the student’s academic performance. The students with low academic scores and failed courses, as well as dropout students make a huge impact on the graduation rates, reputation and finances of a university [33]. In addition, the current environment of the higher education has become more competi- tive and the universities have to design strategies to understand their strengths and weaknesses [59] [47].

The current approach to analyze the student’s academic performance is using Data Sci- ence. The availability of more computational power and digitization of the information have opened the opportunity to analyze and extract value from data. Moreover, there is a recent field called Educational Data Mining (EDM) which uses statistics, machine learning and data mining algorithms [88]. This field uses different types of data with the objective to address educational research issues [88]. Yet the considerable advantages of using Data Science, still many higher education institutions have not been able to use it [73].

The use of e-learning courses is increasing, which are well developed to enhance the process learning, making it more flexible and adaptive [112]. More specifically, Learning Managements Systems (LMS) host educational material, such as e-learning courses, and pro- vide big challenges to exploit learning analytics (field related to EDM) [100]. But it is not only the case of e-learning courses, also the digitization of academic processes and data belonging to students in electronic form [13].

One of the biggest challenges in this field of study is to understand and learn from data. The availability of data allows different possible approaches to take when analyzing an educational institution such as analyzing the courses, the students, or the teachers. Moreover, the granularity of approaches for the analysis of students is wide. The student can be analyzed for a particular course, a degree or for a particular time [92] [13]. In general terms the most repeated approach has been to solve the identification of students at risk of dropout or course failure [100].

1

(24)

Migu´eis et al. propose to predict the performance of students at the end of the academic degree [73]. Burgos et al. propose the use of binary classification models to identify dropout students in e-learning courses [23]. Sandoval et al. use academic data and information from LMS to predict dropouts in face to face courses with large amount of enrolled students [92].

Hoffait et al. propose to use 3 different binary models to identify students at risk of failing the first year [50].

Furthermore, Delen proposes to use binary classification models and the methodology of Cross Industry Standard Process for Data Mining (CRISP-DM) to identify dropout students at the end of the first year of the bachelor’s degree. He found that a balanced dataset and Support Vector Machines provide higher accuracy [33]. Helal et al. also seek to identify student’s academic failure at the end of the first year but using enrolment data and activity data from LMS. They found that separating populations of students yield higher accuracy [47]. Although, Kabakchieva seeks the same objective she uses an unbalanced dataset with multiple classes and obtained low accuracy [59].

Although, the prediction of the student’s academic performance has been analyzed using Data Science models and numerous types of databases, such as admissions data, for different educational institutions, Tecnologico de Monterrey, has been doing it using only a set of entrance exams, called Initial Evaluations. Hence, this thesis sought to compare the Initial Evaluation and admissions data using Data Science models to predict the student’s academic performance.

This study presents the comparison between the models based on admission data and the models based only on Initial Evaluations to predict the student’s academic performance for a cohort of students from first semester of Tecnologico de Monterrey using a methodology based on CRISP-DM. The predictions aim to identify out-performing and under-performing students. Hence, this thesis is structured as follows: this chapter elaborates on the problem, hypothesis, research questions and objectives. Chapter 2 presents the related work and theo- retical framework. Chapter 3 presents the methodology and experiments. Chapter 4 presents the results. Chapter 5 presents the discussion and chapter 5 presents the conclusions, contri- butions, limitations, and future work.

The present study found significant differences in the performance of the models based on admissions data and the models based only on Initial Evaluations. Moreover, the range of accuracy of the models using admissions data was similar to the related studies. Finally, the study found the variables that affect the models.

1.1 Problem Definition

In the past few years Tecnologico de Monterrey developed and deployed a new educational model called Tec 21. The model is based on four main pillars: challenge-based learning, flexibility, memorable university experience and inspiring professors [55].

(25)

1.1. PROBLEM DEFINITION 3

Currently, every new student that is admitted to Tecnologico de Monterrey learns by a new teaching model. The students do not start on the bachelor they selected, instead they start on an area of study where different skills and concepts are taught with the goal of allowing them to have experience in wider areas. The areas of study are Engineering, Business, Creative Studies, Health, Social Sciences and Built Environment.

Consequently, many changes were carried out in different areas and departments of the university. One of these changes was the introduction of the entrance exams, called Initial Evaluations. These exams are used to assess the educational level of the students before they select the courses for the semester. The objective of the tests is to predict if a student will fail in any of the courses of the semester. If the students do not pass the Initial Evaluations, then the university provides suggestions to take regularization courses for letting the students improve their knowledge to approve the courses.

Specifically, the Initial Evaluations assess the academic level on Mathematics, Physics, Computer logic and Chemistry subjects. These evaluations depend on the area of study the student chooses. Students from Engineering, Creative Studies and Built Environment take the tests for Mathematics, Physics and Computer logic. Students from Social Sciences and Business only take the tests for Mathematics and Computer logic. Students from Health only take the tests for Chemistry.

Nevertheless, the Initial Evaluations are not taken by all the freshmen year students. The university divides the freshman’s year students in two groups, the first group with students that come from the university’s high school system, and the second group with students that do not. The analysis of this study focuses on students that do not belong to the high school system. The main reason is because the students within the university’s high school system do not take the same admission process.

Furthermore, the results of the Initial Evaluation carried out in August 2019 showed that a small number of students passed the exams, meaning the students did not have the enough academic knowledge to take the first courses. And most of them did not take any regularization courses. Thus, it was expected to have many failures of students at the end of the semester. Nevertheless, it did not happen because most of the students approved the courses of the semester. In other words, the Initial Evaluation did not provide useful academic information about the students.

This thesis aimed to fill the gap between the Tecnologico de Monterrey’s methods and the methods presented by other studies for other universities to determine the student’s aca- demic performance. These studies use Data Science models and more information than only a set of exams to provide predictions of the student’s academic performance. Thereby, this study compared the Initial Evaluation and admissions data using Data Science models to predict the student’s academic performance.

(26)

1.2 Hypothesis and Research Questions

Data Science models based on admissions data (Initial Evaluations, Emotions, Curriculum, Admission Exam and High school GPA) predicts the student’s academic performance with higher accuracy and confidence level of 95%, than the Data Science models based only on Initial Evaluations, on the Novel Educational Model Tec21.

1.2.1 Research questions

• How the admissions data and Initial Evaluations can be compared using a methodology based on CRISP-DM?

• What databases, either alone or in combination, provide the highest accuracy?

• What variables affect the experiments with the highest accuracy?

1.3 Objectives

To prove that the Data Science models based on admissions data (Initial Evaluations, Emo- tions, Curriculum, Admission Exam and High school GPA) predicts the student’s academic performance with higher accuracy and confidence level of 95%, than the Data Science mod- els based only on Initial Evaluations, on the Novel Educational Model Tec21. The specific objectives are listed below:

1. To compare admissions data and Initial Evaluations using a methodology based on CRISP-DM.

2. To find the databases, either alone or in combination, that provide the highest accuracy.

3. To find the variables that affect the experiments with the highest accuracy.

1.4 Structure of the document

This research study is divided in six chapters. The first chapter presents the introduction, problem, hypothesis, research questions and objectives. The second chapter elaborates on the related work and the theoretical framework. The third chapter shows the development of the thesis presenting the methodology. The fourth chapter presents the results. The fifth chapter shows the analysis of the results and discussion. Finally, the sixth chapter shows the conclusion of this research, along with the contributions, limitations, and future work.

(27)

1.5. SUMMARY 5

1.5 Summary

This first chapter described the foundation of this thesis. In the introduction of the chapter, it was explained the impact of the general problems present in the field of study where this thesis is elaborated. Also, it was presented the gap between other studies and Tecnologico de Mon- terrey in regards the prediction of student’s academic performance. Finally, the hypothesis, research questions and objectives were presented to state the scope of the study.

(28)
(29)

Chapter 2

State of the Art

This chapter is focused on the presentation of the pillars that support this thesis. The first pillar is the analysis of the studies that have been carried out around the prediction or classification of student’s academic performance using Data Science models, which are more accurately called Machine Learning models. The second pillar of this thesis are the concepts of Data Science which are divided in two, Statistics and Machine Learning.

2.1 Student’s Academic Performance

2.1.1 Overview

Several studies have been carried out around student’s academic performance in the last decade. The literature research done for this study found 95 articles. All of them were clas- sified based on their objective and problem stated. These studies were retrieved in January 2020 from databases of Scopus, Science Direct, Web of Science and Google Scholar. 80% of the of the 95 studies were published between 2017 and 2019, the other 20% were published between 2010 and 2016.

Figure 2.1 presents the summary of the objectives found in these research studies with their corresponding frequency. 50% of the studies that are in the top 3 highest frequencies were the followings: “Predict student performance to identify dropout students”, “Predict student performance to identify students at risk of not graduating” and “Predict student per- formance to identify at risk students of failing a course”.

Figure 2.2 presents the list of problems of the top 3 objectives previously mentioned.

Two problems, out of the five, had the highest frequencies. These problems were “Student Retention (financial loss, lower graduation rates, reputation)” and “Weak learning methods &

low performance students”. The remaining problems were “Student Admission & Selection”, 7

(30)

Figure 2.1: List of Objectives

[3, 81, 6, 33, 64, 47, 13, 92, 73, 58, 59, 93, 31, 46, 16, 4, 83, 85, 19, 11, 34, 105, 116, 2, 41, 51, 50, 12, 60, 86, 94, 23, 24, 115, 43, 15, 114, 72, 14, 5, 22, 65, 71, 38, 32, 67, 62, 106, 107, 70, 80, 56, 66, 68, 111, 7, 40, 49, 99, 29, 1, 97, 87, 91, 42, 76, 37, 27, 28, 54, 77, 52, 61, 79, 78, 98, 44, 53, 8, 101, 10, 25, 30, 39, 90, 89, 96, 69, 57, 66, 108, 100, 103, 104, 102, 110, 113, 9, 26]

“Student Admission & performance” and “Major switching”.

Figure 2.2: List of Problems

[3, 33, 47, 73, 59, 4, 83, 85, 19, 105, 116, 2, 50, 12, 86, 94, 43, 15, 22, 65, 38, 70, 66, 7, 1, 97, 87, 42, 76, 54, 77, 52, 61, 79, 78, 44, 10, 39, 89, 96, 57, 108, 104, 9, 26]

Consequently, the type of data most used by the studies from the top 3 objectives (50%

of the studies) was analyzed. Two types of data were the most repeated, academic scores, either before or during the bachelor, and socio-demographic information. The first group of data were continuous variables and the second were categorical or labeled variables. The right side of Figure 2.3 presents the list of variables and their corresponding frequency in the studies of academic score data. On the other hand, the left side of the Figure 2.3 presents the socio-demographic variables with their corresponding frequency.

Furthermore, around 15% of the 50% of studies analyzed used only the information obtained during the admission process to make their predictions. This means, that most studies aimed to find results by using information from the bachelor or combining it with admissions data. These two approaches have different meaning, the one which uses only admission data aims to determine the student’s academic performance without the knowledge about her or his behavior during the bachelor’s semesters. The other approach aims to use information

(31)

2.1. STUDENT’S ACADEMIC PERFORMANCE 9

Figure 2.3: List of Variables

until a certain point to provide a result. The approach selected depends on the objective of the university’s management, it could be either to detect students prone to fail at the first semester, or at the first year or at other point in time.

Additionally, Table 2.1 presents objectives and algorithms used by some of the studies retrieved. Overall the objectives, the main ideas found is to identify students at risk either of failing a specific year, program, course or dropout the studies. Therefore, one of the implicit objectives of this thesis was to identify students at risk of failing the first semester, which was not very far from literature. In the case of the algorithms, it can be seen that most of them are for classification either binary or multiple, just a few studies used regression algorithms.

The paragraphs above presented the big picture of the student’s academic performance using Data Science obtained from 95 research studies. In summary, the studies are focused to identify students that are prone to fail because they can become dropout students which have a direct impact on the finance and graduation rates of the universities. The data used is obtained from the admission process and the bachelor’s semesters. The data is mainly divided in two types, academic scores from courses or exams, and socio-demographics. Only a few studies use only information of admissions to provide their results. Finally, the models are diverse, and their output can be multiple or binary class.

2.1.2 Related Work

After the analysis of the big picture, a set of studies were selected be compared to this study.

The selection was based on a criterion that allowed this work to have the fairest comparison

(32)

Table 2.1: Studies related to this research.

Author Year Objective Algorithms*

Helal et al.

[47]

2018 To identify students at risk of academic failure in the program.

NB,SMO,DT Sandoval

et al. [92]

2018 To provide a model which uses low cost variables to predict at risk students at the mid term.

RF,LR,RLR

Migu´eis et al. [73]

2017 To classify students according to their av- erage and time taken to conclude the de- gree.

RF, DT, SVM, NB, BagT,BoosT

Asif et al.

[13]

2017 To predict students’ academic achieve- ment at the end of a four year study pro- gramme.

DT

Lakkraju et al. [64]

2015 To detect students at risk of not finishing High School on time.

RF,AB,LogG,SVM,DT Ahadi et

al. [6]

2015 To identify high and low-performing stu- dents as early as possible in a program- ming course to provide better support for them.

NB,BN,AB,DT,RF,DS

Jayaprakash et al. [58]

2014 To identify students at risk of course fail- ure.

LogR,SVM,DT,NB Kabakchieva

[59]

2013 To classify university students according to the total university score results based on their pre-university characteristics.

DT,NB,BN,KNN

Oyelade et at. [81]

2010 To cluster the performance of students based on their based on their average score of one semester.

K-Means

Delen [33] 2010 To identify the freshmen students who are most likely to drop-out after their fresh- men year.

NN, DT, SVM,

LogR

* RF: Random Forest, AB: Adaboost, LogR: Logistic Regression, SVM: Support Vector Machines, DT:

Decision Trees, NB: Naive Bayes, BN: Bayesian Network, LR: Linear Regression, RLR: Robust Linear Regression, BaT: Bagging Trees, BoT: Boosting Trees, NN: Neural Networks, KNN: Nearest Neighbour, RL: Rule Learners, SMO: Sequential Minimal Optimizer.

(33)

2.1. STUDENT’S ACADEMIC PERFORMANCE 11

despite that none of the studies present the same characteristics to the approach of this study.

The following criterion was decided because, as mentioned before, there were a few studies (less than 10% from all the studies reviewed) using only admission data to make the predic- tions or classifications. Also, the majority of the studies were focused on predicting at the end of the first year or at graduation, this thesis was focused on making the predictions at the end of the first semester. In addition, most of the studies aimed to detect the dropouts which allow to have two classes, but imbalanced because in most of the cases the number of students that do not dropout is higher than the dropout students. In summary, there are multiple com- binations to make the prediction of student’s academic performance which are based on the specific approaches defined by the authors. The following list shows the characteristics of the studies selected to be compared to this work.

• Studies that were focused on making the predictions at the end of of the first year, and only a few making the predictions at the end of the academic degree.

• Studies using the methodology Cross Industry Standard Process for Data Mining (CRISP- DM) or similar.

• Studies using similar models.

• Studies using admission data and data from the semesters.

• Studies making classifications either binary or multiple-class.

In the following paragraphs the main characteristics of these studies are presented. The characteristics to be presented are data sources, methodologies, number of instances, models, highest accuracy obtained, most relevant variables and separation of classes.

Delen did a comparative analysis of machine learning techniques to identify dropout students at the end of the first year using the Cross Industry Standard Process for Data Min- ing (CRISP-DM) methodology. He argued that dropout students usually result in an overall financial loss, lower graduation rates, and inferior school reputation in the eyes of all stake- holders. With 16,000 students over 5 years, 4 models (Multi-layer perceptron, Support Vector Machines (SVM), decision trees and logistic regression) but an imbalanced data set for a bi- nary outcome (either droput or not), got high percentage of correctly YES predicted but a low percentage of correctly predicted NO. He improved the results by using a balanced dataset with 7,000 students, where the best model was SVM with an accuracy of 81%. Furthermore, he was able to slightly improve the results with the use of ensemble models (bagging, boost- ing and information fusion). He used a combination of data of admission data with semester grades. Finally, he used a sensitivity analysis of the variables to find the most relevant features.

Fall GPA was one of the most relevant, but High school GPA was one of the less important [33].

As well, Kabakchieva aimed to predict the student performance with classification al- gorithms using J48, JRip, Naive Bayes, BayesNet and K-Nearest Neighbor with CRISP-DM.

(34)

But in this case, she proposed a multi-class classification where the final average score was divided in 5 classes (excellent, very good, good, average and bad). She stated that the Univer- sity management should focus more on the profile of admitted students, getting aware of the different types and specific students’ characteristics based on the received data. In her study she used data of 10,330 students with 5 imbalanced classes based on the University average score of the first year. She obtained a range of accuracy from 52 to 67% with different per- formance for each class. She also used two types of evaluation methods, the first one with 10 cross-fold validation and the second with data split (2/3 for training and 1/3 for testing). She used a also used combination of information from admissions and the semester. At the end she found that the University admission score and the number of failed exams during the first year were some of the most relevant features for the models [59].

Also, Helal et. al presents a different work based on the use of subgroups of data sets to provide better accuracy in the predictions. His methodology included data understanding, data preparation, sub-setting datasets, modeling, and evaluation. This research aimed to predict the student’s academic failure (binary outcome to detect students failing a course) at the end of the first year. The classes were balanced but he did not mention how. The subdivisions were made from a dataset of 2,648 students containing enrolment and data of their online learning platform (Moodle). The machine learning methods used were two black box (na¨ıve-Bayes and sequential minimal optimizer (SMO) based on SVM) and two white box classification (J48 and JRi). They found that for some cases the subgroups have better accuracy than the model which uses the complete data set. He also found that using the combination of datasets (en- rolment and data from their online platform) provide higher accuracy. The range of accuracy obtained was from 61 to 83%. The model with the highest accuracy was SVM. However, they argued that the black box techniques were unsuccessful in generating interpretable models for further use. In contrast, the white box techniques generate highly comprehensible models [47].

On the other hand, Adekitan also seeks to predict the performance of first year student in a university using the admission data. The methodology used in this study included data description, modeling, and evaluation. He used six algorithms for classification (Random Forest/Tree Ensemble/Decision Tree/ NaiveBayes/ Logistic Regression/Rprop MLP) and two regression models (linear and quadratic regressions). He proposed to divide the target vari- able in quartiles, meaning 4 classes but imbalanced. With 1,445 students he tried to predict the cumulative GPA at the first year getting accuracy between 0.39 % and 52%, and the re- sults for regression models provided R2 values of 0.207 and 0.232. He argued the academic performance of a student in a university is determined by a number of factors, both academic and non-academic [3].

However, Adekitan in a second study focuses on predicting the academic performance at graduation of engineering students but using the same methodology and same 4 classes also imbalanced. He uses a database with 1,841 instances containing the GPA of 3 years of study for each student. He used seven models (Probabilistic Neural Network, Random Forest, Tree Ensemble, Naive Bayes, Decision Tree, Logistic Regression and Linear Regression) where the range of accuracy was from 86 to 89%. The highest accuracy obtained was using Logistic

(35)

2.2. THEORETICAL FRAMEWORK 13

Regression. And for the regression model an R2 of 0.95. Where the most relevant variables were the GPA of the second and third year. He dramatically improved the accuracy of the models by changing the approach of the prediction, thus using only information from the bachelors instead from admissions [4].

Finally, Aluko proposed to analyze the performance for a specific major using pre- enrolment requirements and machine learning models. He uses a methodology that included data preparation, modeling, and evaluation. He used two classes where the target variable, the cumulative GPA, was divided in two. He found that prior academic performance in mathemat- ics have the most significant impact on academic success of architecture students. He used only 102 students, and two models (K-Nearest Neighbor (KNN) and Linear discriminant) with a binary variable. The best model was KNN providing and accuracy of 73.3%. However, the small sample size could not be helpful to generalize the outcomes of other students [10].

The studies previously mentioned show characteristics that have differences and sim- ilarities. The main differences are in their objective of predicting the student’s academic performance either at the first year or at the end of the bachelor, also in the data sources which is either using information of admissions or of the semester or a combination of both and in the number of instances used. These set of differences lead to have higher accuracy in the prediction. On the other hand, the similarities are in the usage of models and that the accuracy does not surpass the threshold of 90%.

2.2 Theoretical Framework

The following subsections elaborate on the concepts of Data Science. As mentioned before, they are divided in two, concepts of Statistics and Machine Learning. The concepts of Statis- tics are divided in descriptive and inferential. In the case of Machine Learning the concepts are divided in models, feature reduction techniques, evaluation methods, model interpretation technique and emotions analysis.

2.2.1 Statistics

The following concepts are divided in two types, the concepts to describe the data and the second type in concepts that aim to draw conclusions about data.

Outliers

An outlier is a value that is far from the main body of the data distribution. These values, or values can have a negative impact in the mean and variance of the data. Depending on the meaning of the data, the outliers can be removed to allow the data have a normal distribution.

(36)

There are several techniques to identify the outliers, this study used the boxplot interquartile range (IQR). A value is classified as an outlier if it is 1.5IQR farther from closest quantile.

See figure 2.4, there are some points lying 1.5IQR farther from the fourth quantile, these points are outliers [35].

Figure 2.4: Outliers using boxplot [35]

Correlation

This is a measure of relationship between two variables. This concept is related to multi- variate analysis. The range of value of correlation is from 0 to 1, where 0 means no relation- ship and 1 means strong relationship. Two variables with strong correlation provide the same variability of information. There are different types of correlation coefficients, in this study only two are used, the Pearson’s correlation coefficient and the Spearman’s rank correlation coefficient. The former is used for continuous variables with normal distribution and the last for variables without a normal distribution. The formula for Person’s Correlation is given by 2.1 and for Spearman’s Correlation is given by 2.2, where x and y are the variables, R(xi) and R(yi) are the ranking of the variables and n the number of instances [48].

r =

Pn

i=1(xi− ¯x)(yi− ¯y)

qPn

i=1(xi− ¯x)2Pni=1(yi− ¯y)2 (2.1)

R = 1 −6Pni=1(R(xi) − R(yi))2

n(n2− 1) (2.2)

Paired t-student test

This concept is part of the inferential statistics. The paired t-student test is used to conclude about the difference of means between two variables distributions. It is paired because of the assumption of dependent distributions, meaning that the subjects to be analyzed are the same

(37)

2.2. THEORETICAL FRAMEWORK 15

but they are used for two different processes (two different observations). The null hypothesis of the test is that there is no difference between the means, the alternative hypothesis is the opposite. If the p-value of the test is smaller than the significance level, then the null hypothe- sis is rejected and the alternative accepted. If the p-value is higher than the significance level, then the null hypothesis is not rejected. The significance level is 1 - the confidence level. For example, if the significance level is 0.05, then the confidence level is 0.95. The confidence level used in this study is 0.95 [35].

One-way ANOVA

This test is also part of inferential statistics and multi-variate analysis. This test is the same an un-paired t-student (where the subjects in study are independent) but for more than two distributions. The test is used to determine the difference between populations of a certain characteristic. More specifically, the difference in the means is analyzed. The null hypothesis states that there is no difference between the means of the different distributions and the alternative hypothesis states that at least two of the means are different. For this test the same explanation of p-value of the previous subsubsection applies. Furthermore, to have a visualization of the data distributions a plot with boxplots are used, see figure 2.5 [35].

Figure 2.5: Comparison of data distributions [35]

The previous described concepts are used in the data understanding and evaluation phase of the methodology.

2.2.2 Machine Learning

This subsection provides a concrete and concise explanation of the models and methods of Machine Learning used on the generation of the results of this study. Firstly, the models are explained, then the feature reduction techniques, consecutively the evaluation methods and finally an interpretation technique.

(38)

Logistic Regression

This model is using a linear discriminant. A linear discriminant is based on the use of a line to separate classes. Figure 2.6 presents an example where x and y axis correspond to variables and the black dots and white plus symbols represent two type of classes. The line in diagonal is the decision boundary which helps to identify the classes. A linear discriminant has a linear combination as decision boundary, where the linear combination is the sum of weights of the variables (equation 2.3). However, for Logistic Regression the output is not a class, instead a probability of belonging to a class, see equation2.4, where the probability of not belonging is defined as 1 − p+(x), where x is a vector of the independent variables.

Based on the user definition, the probabilities (mentioned before) generated by the model can be used to produce a binary classification, for example if the probability is be- low 0.5 then the class is 1, otherwise the class is 0. The logistic model is the computation of the log-odds of a linear function. The odds are the probability of occurrence divided by the probability of not occurrence. Then, the logs are used to produce a value within the range from 0 to 1, see figure 2.7. Therefore, the output of the Logistic Regression is interpreted as the log-odds of class membership. Where the log-odds can be translated directly into the probability of class membership. [84].

Figure 2.6: Decision Boundary [84]

f (x) = w0+ w1x1+ w2x2+ ... (2.3)

p+(x) = 1

1 + e−f (x) (2.4)

(39)

2.2. THEORETICAL FRAMEWORK 17

Figure 2.7: Logistic regression’s estimate of class probability as a function of f(x) [84]

Support Vector Machines

In most of the cases Support Vector Machines (SVM) is not easy to understand, but in short, it is also a linear discriminant. The linear discriminant objective is to separate classes using linear functions. The best case is where in a 2-D graph 2 groups of scatter points can be separated by a bar in the middle. Where the bar is tried to be maximized (the margin), and the line in the middle of the bar is the linear discriminant, see figure 2.8. Therefore, the objective function of SVM incorporates that the wider the bar the better, then the line in the center is found [84]. In the case of more than one variable, hyper-planes are used to separate the classes, but it can not be easily showed in a 2-D graph. Another important point is how SVM handles misclassified points. SVM provides a penalization to misclassified points based on their distance to the linear discriminant. Then, SVM places the line where the penalties of the misclassified points are smaller.

Figure 2.8: Maximal margin classifier [84]

(40)

K-Nearest Neighbor

This algorithm is based on the similarity of observations. The similarity is measured based on vary formulas, but the most common are the Euclidean distance and the hamming distance which are normally used for continuous variables. For example, the Euclidean distance in a 2-D graph is easy to see and calculate (it is the distance between two points see figure 2.9), but when there are more than 2 variables the measure requires more operations.

For classification, the model considers that if observations are close to each other (sim- ilar) then they should belong to the same class. Hence, when the classes are created with the training data, then a new observation can be classified based on their similarity to the available classes. But there might be cases where the new observation is similar to more than one class.

Therefore, this model implements a weighted voting scheme where the k, from k-Nearest Neighbors’ name, is the odd number of observations that are used in the voting scheme. The larger the number of observations of one specific class close to the new observation, the higher the possibility of belonging to that class. See figure 2.10 where the point next to the question mark is classified as white plus symbol because there are more instances of this class close to it [84], [63].

On the other side, this algorithm is not efficient because to classify a new instance it requires to compute its distance to the other k instances in order to make the voting. The calculation of distances becomes slower when the number variables of the model is large.

Another disadvantage is that the model does not provide coefficients to the variables, this does not allow to rank features.

Figure 2.9: Euclidean distance [84]

Decision Trees

The decision trees are known to be predictive models. They are easy to understand in graphs because they follow a structured methodology to classify new observations by following the concept of ”divide and conquer”. This model is one of the most preferred because the user can know how they do what they do. A tree is drawn up-side down, and it is composed by a root node, decision nodes, branches, and terminal nodes. Each decision node of the tree considers a predictor/feature, the branches connect the nodes, and the terminal nodes represent the classes, see figure 2.11 where the image of above presents the tree and the image below presents the decision boundaries for the two variables.

(41)

2.2. THEORETICAL FRAMEWORK 19

Figure 2.10: Classification made by KNN [84]

Furthermore, a new observation is classified based on their features which are used to follow the branches of the tree that led to a class. The importance of the predictor features is from the top to bottom, where the root node is the most important discriminant. The ranking of the features is measured in some of the cases with Information Gain, see figure 2.12 where y axis of the bar plot is the Information Gain which clearly shows the ranking of the variables.

The feature that can separate the classes the most is used as the root node [84] [63].

Random Forest

Random forests is an ensemble model using bagging, which main feature is that uses a large number of small un-correlated models to then average their outputs to provide an outcome. It can be seen as group of experts, each expert specialized on a portion of the problem, voting to provide a solution. In the case of this model, it uses un-correlated decision trees which are created with a random sample of observations from the training set. This provides com- plementary predictions where the errors from each decision tree are independent and might be cancelled with the result of other decision tree. In short, Random Forest is a population of trees voting, in the case of classification, for a class. The idea behind ensemble methods is that whole population provide more information than individual [84]. As it is composed by trees the feature importance can also be provided by the model, however it is not easy to interpret like decision tree because the structure is not straight forward.

Gradient Boosting

This model is another type of ensemble classifier which uses boosting. The concept of boost- ing is to use sequential base classifiers where the second classifier aims to reduce the error generated by the first classifier, and it is repeated by the following classifiers, at the end the

(42)

Figure 2.11: Decision Tree example [84]

result of all the classifiers is summed to provide the final output. See figure 2.13 [18] where the blue arrows are the weights trained to make the predictions, the green arrows is the per- formance of the previous classifier which is used to train the next classifier and the red arrows represent the combination of the results of all the classifiers.

Moreover, it is called Gradient because the classifiers have an objective function which is optimized to find the global minimum. In most of the cases the objective function evaluates the error between the predicted values and the actual values. Thus, the error is minimized by optimizing the objective function [109]. In most of the cases the base classifiers are regression trees.

Naive Bayes

This model is based on Bayes’ Theorem, which is also known as conditional probability. In short, the conditional probability is the relative frequency of xi among the training examples

(43)

2.2. THEORETICAL FRAMEWORK 21

Figure 2.12: Feature importance with Information Gain [84]

that belong to cj, where x is an attribute and c is a class. For example, figure 2.14 shows the pies that Johnny likes and the ones that have thick filling. Not all the pies that have thick filling are liked by Johnny. Thus, the conditional probability of Johnny likes pies given that they have a thick filling is expressed by the equation 2.5. On the other hand, the pies that Johnny does not like given that they have thick filling are expressed by the equation 2.6.

P (likes|thickf illing) = 3

8 (2.5)

P (doesnotlike|thickf illing) = 5

8 (2.6)

This model classifies new instances based on the conditional probabilities of features given a class. The steps to make a classification of a new instance x = (x1, ..., xn) are describe below:

1. For each xi and for each class cj, calculate the conditional probability, P (x|cj), as the relative frequency of xiamong those training examples that belong to cj.

2. For each class cj estimate P (cj) as the relative frequency of the class in the training set.

3. For the same class cjcalculate conditional probability P (x|cj), using the naive assump- tion of mutually independent features (equation 2.7).

P (x|cj) =

n

Y

i=1

P (xi|cj) (2.7)

(44)

Figure 2.13: Boosting framework [18]

4. Select the class with the highest value of P (cj) ∗Qni=1P (xi|cj)

Figure 2.14: Example of Conditional Probability [63]

In the specific case of the model used in this study, the Naive Bayes classifier uses Gaus- sian distributions (used for continues variables). Instead of using the conditional probabilities for each feature, a Gaussian distribution is calculated, using maximum likelihood, for each feature. This means that a mean and variance is computed for each variable. Then, to classify a new instance the likelihoods of each feature are obtained from each Gaussian distribution (the likelihood is the value of y of a given point x of the function). Finally the P (cj) is mul- tiplied by all the likelihoods of the features. The class with the highest result is the selected class [63], [82].

(45)

2.2. THEORETICAL FRAMEWORK 23

Principal Component Analysis (PCA)

The main goal of this method is to find the hyper-plane closest to the data to then project the data on it. This hyper-plane keeps the variance of the data. In summary, an axis is found to represent the largest variance of the data, then another axis orthogonal to the first one is generated, and then a third one orthogonal to the previous two axes, this is repeated for n Principal Components. The sum of the variance of all Principal Components is 100%.

The number of Principal Components is defined by the number of features transformed by the method. One of the main advantages of PCA is that using only the first two PCs a high percentage of the variance of the data can be explained [20]. Nevertheless, the downside of the technique is that each component is not easy to interpret. See figure 2.15, where the points for x1and x2are plotted, and the largest variance of the data is represented by c1, which is the first PC, and c2 is the second PC, which explains the variance orthogonal to the first PC.

Figure 2.15: PCA example with two variables [20]

Recursive Feature Elimination

This technique to eliminate features requires a model with the characteristic to provide feature importance or coefficients. It also requires a scoring parameter, which in the case of this study was Accuracy. In the first iteration the method uses the model and all the features, then the variable with the least weight is eliminated (this is also a parameter, where more variables can be eliminated in each iteration), this step is repeated until the desired number of features is reached. For this study, the minimum variables to be selected was defined as 1. Furthermore, the method also uses k-fold cross-validation decision. [82].

(46)

t-student test

This statistical hypothesis compares the means of two variable distributions (populations).

One of the assumptions for this test is that the populations should have a normal distribution with unknown variance. It is a fact that not all the features used by this study have a normal distribution, but the process still provides useful results in practice. Also, another assumption used for this study is that the variances are not equal [35]. This test is used to compare distribution from two different subjects, meaning that they are independent to each other. The null hypothesis states that the means are equal, and the alternative that the differences are different. The p-value given by the test is used to conclude about the hypothesis. If the p- value is smaller than the significance level then the alternative hypothesis is accepted, if not then the null hypothesis is not rejected. The technique was used as following:

1. The data was already cleansed and the binary target variable was created.

2. Each variable was divided in two distributions, one for the positive class and another for the negative class.

3. The means of the two distributions from the same variable were compared using the statistical test.

4. The p-value provided by the comparison for each variable was stored.

5. The variables with a p-value greater than 0.05 were removed from the database.

In short, the variables that were kept were the ones that had a statistical difference based on a significance level of 0.05.

Holdout Data Validation

This is a simple technique to validate the model. The main idea is to ensure the model gener- alizes the data. This is performed by splitting the data into two sets the training and test set.

The common practice is to use 80% of the data to train the model and 20% to test it. The error obtained by the model using the test set is called generalization error, and it shows how well the model will perform with new data. In some cases, it is necessary to even split the training set into two sets again. These sets are called training and development sets, which the last one is used to compare models and to find hyper-parameters. This is recommended when the model will go to production environment [20].

K-fold cross-validation

This method is used on the evaluation of the models as well. The cross-validation stands for dividing the dataset into to k subsets equally sized. Then k experiments are run, where one

(47)

2.2. THEORETICAL FRAMEWORK 25

subset is left outside the training dataset to test the model. In the end, the results of each experiments are averaged. For example, for a 5-fold cross-validation, the dataset would be divided in 5 equally size subsets, then for each experiment one subset is used to test the model and the remaining four are used to train the model. The final result would be the mean of the 5 experiments [63].

Repeated stratified k-fold

This cross-validation method is a variation of k-fold which returns stratified folds. The strati- fication means that the folds preserve the same percentage of samples for each class when the data is divided [82].

Accuracy

This is a basic metric used to measure the proportion of correctly classified classes, it can be two or more classes, where the number of correct classifications is divided by the total number of classifications. The higher the accuracy the better, meaning the model can correctly classify all new observations. The metric is based the confusion matrix which for a binary classification is a 2 × 2 matrix, where the rows represent the True labels and the columns the labels returned by the model, see figure 2.16.

In addition, if the model correctly classifies a YES label then the number of True Positive (TP) is increased. If the model correctly classifies a NO label, then the number of True Negative (TN) is increased. If the model fails to classify a YES class, then the number of False Positive (FP) increases, and if the model fails to classify a NO class then the number of False Negative (FN) increases [63]. The formula of the accuracy is the following:

Figure 2.16: Confusion Matrix for two classes [84]

Accuracy = NT P + NT N

NF P + NF N + NT P + NT N (2.8)

(48)

Precision and Recall

Precision and Recall are two metrics that are used to evaluate the predictions of classifica- tion models. These metrics are based on the confusion matrix as well as the accuracy. The Precision is the percentage of True Positives (TP) among all the instances that the model has predicted as positive, the formula is shown in equation 2.9 where N means number and FP means False Positives. And Recall is the probability that a positive example will be correctly classified, the formula is shown in equation 2.10 where FN means False Negatives [63].

P recision = NT P

NT P + NF P (2.9)

Recall = NT P

NT P + NF N (2.10)

Local Interpretation Technique (SHAP Values)

When the model does not provide coefficients or feature importance, it becomes difficult to understand how the model does what it does. This type of model is called blackbox because there is no interpretation of the variables used to make the predictions. However, Lundberg and Lee presented a study in 2017 showing the usage of a technique called SHAP values which solves this problem.

The main idea of SHAP values is that the technique measures the contributions of the features for a specific instance. This is done by randomly adding features one by one to the model and measuring the expected output. However, different ordering produces different results. This method uses all the orderings of the variables and averages the contributions of all. In the end, the technique provides weights for the contributions of each variable. In this fashion, it helps to understand how the model provides a prediction for an specific instance [17].

The previous paragraphs presented the Data Science concepts used in this study. The concepts were divided in two, in Statistics and Machine Learning. However, the core of this thesis is based on Machine Learning concepts, thereby more information was showed about it.

Emotion Analysis

This study used an emotion analysis based on the National Research Council Canada Emotion Lexicon which is a list of English of words and their associations with eight basic emotions

(49)

2.3. SUMMARY 27

(anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (neg- ative and positive). The number of words is approximately 14,000. This emotion lexicon, called Emolex, was compiled in 2010 by Saif M. Mohammad and Peter D. Turney using the manual annotation of people through Amazon’s Mechanincal Turk Service which is an online service to obtain a large amount of human annotation in an efficient and inexpensive manner[75] [74]. In other words, Emolex used words tagged individually by internet users.

It was decided to use this method due to its practical implementation and also because it has been cited around 800 times based on Google Scholar results. Furthermore, this study only used the emotions instead of using the emotions and the sentiments, this to give an easier interpretation. Also because there is correlation between positive sentiment and emotions such as Joy or Trust, the same applies for the negative sentiments for emotions of Sadness and Fear. In addition, the 8 emotions are similar as the Ekman scheme of universal emotions (anger, contempt, disgust, enjoyment, fear, sadness and surprise) [36].

More specifically, Emolex is a database with a list of English words as rows and 8 emo- tions as columns, where the values of the words for each column is 1 or 0, meaning association or no association respectively. Moreover, a word can have more than one association such as

“abandon” that has association with Fear and Sadness. This thesis used Emolex to classify the words of different texts based on the 8 emotions. The texts were first translated to English using an application of Google Translate and then Emolex was used to obtain the frequencies of emotions present in the text. The result was a database with the texts as rows and the 8 emotions as columns, where the values of the columns were the frequencies. The process to obtain the frequencies was through the comparison of each word of the text with the words of Emolex.

2.3 Summary

This chapter presented the basis that support this study. The basis is divided in two parts.

The first part presented the overview of 95 studies that focused on predicting the student’s academic performance using Data Science models. Where 6 studies where selected to be compared to this research. The characteristics mentioned for each study were the source data, methodology, number of instances, models, highest accuracy, and most relevant variables. In the second part of the chapter the theoretical concepts of Data Science were explained. The concepts were about Statistics and Machine Learning.

(50)
(51)

Chapter 3 Development

In this chapter the methodology structure is described. The methodology for the model devel- opment used in this study was based on the Cross Industry Standard Process for Data Mining (CRISP) with an extension to fulfil all the objectives defined in this work.

3.1 Methodology

As mentioned before the methodology of this study was based on the Cross Industry Standard Process for Data Mining (CRISP-DM) [84]. CRISP-DM comprises six iterative steps (see figure 3.1). The first step is the Business or Problem Understanding, which aims to clarify the purpose of the analysis. The second step is Data Understanding, which objective is to find and digest all the data available to solve the problem, in some cases it is also called Data Exploration. The third step is the Data Preparation, which main goal is to give structure to all the data that will be used by the models.

Consequently, the fourth step, is the modeling of the data, which is where the data is used to generate the expected outcomes. The fifth step is the Evaluation, which aims to determine the quality of the model based on metrics previously defined by the researcher. Finally, the last step is the deployment of the model, which is when the model is moved to a production environment to allow it work with real-time data. However, this last step was not in the scope of this study. One of the key advantages of this methodology is that is not restrictive, meaning that it is possible to jump back to previous steps to improve the results of the work.

Nonetheless, the steps of CRISP-DM have a level of abstraction that leave a room for the researcher to propose techniques to fulfil the gaps. In other words, the methodology mentions what has to be done but not specifically how. Thus, the extended methodology used in this study uses the first 5 steps of CRISP-DM and add one more, called Interpretation, see figure 3.2.

29

(52)

Figure 3.1: CRISP Methodology [84]

More specifically the following list shows the steps and sub-steps of the methodology.

1. Problem Understanding 2. Data Understanding

(a) Uni-variate and multivariate analysis 3. Data Preparation

(a) Emotion’s database (b) Elimination of variables (c) Elimination of instances (d) Target variable computation (e) Separation of areas of study

(f) Database rearrangement (g) Elimination of outliers (h) Elimination of correlation

(i) Binary target variable (j) Data normalization 4. Modeling

(a) Experimentation (databases, feature selection and models) 5. Evaluation

Referencias

Documento similar

The tentative spin and parity assignment to the observed levels in 44 S is based on the comparison between the calcu- lated and experimental (i) excitation energies, (ii)

This project is going to focus on the design, implementation and analysis of a new machine learning model based on decision trees and other models, specifically neural networks..

This paper presents a systematic prequalification procedure, based on Fuzzy Set Theory, whose main differences and advantages in comparison with other models are the use of

In this context, an alternative approach for on-line optimal maintenance management is proposed, based also on plant data rather than only on the decaying performance model..

Another notable work in the identification of Academic Rising Stars using machine learning came in Scientometric Indicators and Machine Learning-Based Models for Predict- ing

The coupling matrix is defined in order to best accommodate the acoustic resonators models, based on NRNs, and a smart optimization of its elements based on the

Government policy varies between nations and this guidance sets out the need for balanced decision-making about ways of working, and the ongoing safety considerations

In this paper, we contribute to filling this gap by focusing on the following issues: comparison of the process-based Integrated Assessment Models (IAMs) used by