A data-driven approach to predict student success in an optimization principles course

(1)

A data-driven approach to predict student success in an

Optimization Principles course

Diego Andrés Calderón Muñoz Manuel Alejandro Bolívar Vargas

Universidad de los Andes Departamento de Ingeniería Industrial

Abstract

All types of organizations have found on Data Mining an opportunity to acquire new and useful knowledge to support their mission. As higher education’s demand increases, it becomes more important for universities to provide a better educational service and to fully understand how their business processes are helping them on the transformation of high school graduates into competent professionals. Educational Data Mining is the application of Data Mining in educational systems. In this paper we explore different predictive models to determine which offers the most accurate prediction of students’ results in the Optimization Principles course offered at Universidad de los Andes.

Keywords:

Educational data mining (EDM), Classification, Logistic regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Class Imbalance, Exploratory factor analysis.

Introduction

The importance of creating useful, interesting and interpretable information from raw data for predictive purposes has existed since the 1950s (or even before). At that time this data analysis was used mainly for credit scoring in the financial industry, however, a few decades later all sorts of companies, governments and nonprofit organizations found themselves with databases that were too large for people to analyze them on their own. Data Mining (DM) defined as the use of standard statistical methods and other automated techniques to interrogate large databases and interpret information, (Finlay, 2014) provided a solution to the limited human capacity of manually processing data.

When DM is used to explore data from educational settings in order to get a better understanding of learners’ behaviors, predict their performance or find out educational functionalities and applications, it is called Educational Data Mining (EDM) (Peña-Ayala, 2014). EDM transforms raw data coming from educational systems into useful information that could potentially have a great impact on educational research and practice (Romero & Ventura, 2005). As seen before, EDM has a

very wide set of applications, in this paper we’ll focus on its predictive side.

The main goal of this Project is to predict the students’ performance on the Industrial Engineering Department course “Optimization Principles”, more precisely, we expect to design a model that can predict whether a student is going fail or not, so instructors can take actions to improve the learning process of their students based on the model’s prediction. This could potentially decrease failing rates without modifying the course’s academic level and also allow the instructors to understand in a better way how their students are learning.

The performance on the course will be predicted as a binary outcome (approved or not approved) by academic performance and demographic variables such as GPA, grades on past courses, age, gender, etc. The data was collected by the Dirección de

Servicios de Información y Tecnología (DSIT) and it includes the academic information of the students since the first semester of 2006. With this variables we built several predictive models using logistic regression, linear discriminant analysis and quadratic discriminant analysis. For each of this models

(2)

we calculated performance indicators and compared them in order to select the final predictive model.

Some of the most used predictive models are linear models, decision trees, neural networks, support vector machines and cluster models. All of these have a common structure, where a set of predictor variables try to explain a set of behavioral (outcome) data. In classification models, where the model´s output represents the probability of someone doing something or not, the most popular model is Logistic Regression (Finlay, 2014); however, the classification problem has several approaches including Discriminant Analysis, Genetic Algorithms, Clustering and many others. Coming up next we’ll discuss the different classification methods used in prior studies.

In the literature, Elgamal (2013) tried to identify the factors that affect learning programming by predicting the student performance in a programming course. The main variables in Elmagal’s study included students mathematical background, programming aptitude, problem solving skills, gender, previous computer programming experience, high school mathematics grade, provenance and e-learning usage. Elmagal used rule extraction algorithms for extracting predictive rules from the data set based on machine learning. He found that mathematical background did affect deeply the performance of a student as it was part of three out of five of the resultant predictive rules.

Other approach is the one used by Ayers et al. (2009) that tried to identify the stage of skill mastery (complete/partial/none) in a set of students. They compared the performance of the three estimates of student skill knowledge under several clustering methods (hierarchical agglomerative clustering, k-means and model based clustering) using simulated data (Ayers, Nugent, & Dean, 2009). A very similar problem was considered by Superby et al. (2006). The authors wanted to classify freshmen students in three groups (low risk, medium risk and high risk of failing), but instead of using clustering

Simple statistical methods like linear regression are also used frequently as in the research lead by Hiss et al (2014). They tried to identify the relevant variables that affect graduation rates and more importantly, find out if the Scholastic Aptitude Test (SAT) scores did affect this rate. Using R-square analyses, they concluded that SAT’s didn't affect a student’s chances of graduating (Hiss & Franks, 2014).

Predicting a student’s final grade is also a very common problem in EDM, when the dependent variable is not the grade itself, but a binary response that indicates failure, the model becomes a classification model. As seen before, there are several alternatives to tackle this problem. Gedeon et al. (1993) used neural networks, Delgado et al. predicted pass or fail from Moodle logs (using radial basis functions) (Delgado, Gibaja, Pegalajar, & Pérez, 2006). While Jishan et al. (2014) focused on improving the prediction by balancing the classes using synthetic minority over sampling (Jishan, Rashu, Haque, & Rahman, 2014).

We can notice that in recent years

research on grade prediction models have acquired more importance and are more frequently part of important studies all over the world. Although there are many ways of estimating a predictive model, we’ll focus on logistic regression, linear discriminant analysis and quadratic discriminant analysis.

Methods

Data Selection and Preparation

The used data contains 1703 observations that correspond to the number of students enrolled in the course since the first semester of 2008. We just considered the first time a student takes the course, so the actual number of students enrolled in the course is higher than the sample we took.

Optimization Principles is a core course in the Operations Research area of the Industrial Engineering program of Universidad de los Andes. It’s the first course that evaluates mathematical modeling and linear programming,

(3)

first time, have difficulties fulfilling the courses’ objectives.

The predictor variables can be separated in two groups, one of academic performance that contains GPA, grades in relevant courses taken before, etc. And other group of social variables that include age, gender, provenance, etc. The complete list of variables is shown in Annex 1.

The 1703 observations were divided in a training set that includes 70% of the total sample and a test set that includes the remaining 30%. The classification of the observations between sets was performed randomly.

Logistic Regression

Logistic regression is pretty much like a common linear regression where a set of independent variables explain a response variable. The main difference is that the response variable is not continuous and has only two possible values (Kuhn 2013).

𝑌 { 0, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 1, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒

This particularity changes the way 𝑌 is estimated and the whole interpretation of 𝑌̂. In this case, 𝑌̂ is not the estimation of 𝑌, but the estimated probability that a specific observation has of being a positive case. As the model is estimating probabilities now, the function to calculate 𝑌̂ needs to be bounded between 0 and 1. This can be accomplished by using any cumulative distribution function. In the logistic regression model, we use the logistic distribution. Then, the estimation of the probability is calculated by:

𝑌̂ = 𝜋𝑥 =

𝑒𝛽𝑇𝑋

1 + 𝑒𝛽𝑇_𝑋

As the function is not linear anymore, the coefficients do not represent the marginal effect of a variable over the probability. Forward in this document we discuss how the interpretation of the coefficients works (Kuhn, 2013).

Saturated Model

For the first Logistic Regression model we used the whole set of predictor variables, the results of the model let us know which variables weren’t useful for predicting the failure of a student. Most of the variables were not significant using a p-value of 5%.It was interesting that none of the personal variables were significant and only academic variables remained in the reduced model.

Reduced Model

After eliminating the non-useful variables using backward stepwise selection we end up with our final logistic regression model. Table 1 contains the information of the estimated coefficients.

Table 1

Logistic Regression’s results

All the remaining variables are significantly useful in the predictive model. There are a couple of interesting values like the coefficient of the variable GPA_Fresh. The coefficient is positive which would imply that a student with a greater GPA at the end of his/her first semester would have greater probabilities of failing Optimization Principles. The marginal effect of each variable would be easier to interpret with the odds of each variable, however, the main purpose of this project is to find the best predictive model for our case study, so the interpretation of the coefficients will not be the priority for now.

(4)

Using the coefficients found in the logistic regression model, we can predict how the test set observations will be classified, and using this it’s possible to test the predictive performance of the model. The results of the classification over the test set are shown in Table 2.

Table 2

Classification Table for the Logistic Regression Model

Over 80% of the test set observations were correctly classified, however only 36.84% of the positive observations were classified as so, while 96.31% of the negative values were correctly classified. This difference in the predictive ability of the model occurs due to the class imbalance because although Optimization Principles is a course with a high rate of failing, a larger part of the students enrolled pass, this means that there is a larger amount of negative cases than of positive cases, this makes the model better at predicting when a student is not failing than at predicting when he/she is. The class imbalance is pretty visible in the classification table shown above, where almost 75% of the observations represent negative cases. The class imbalance is even more evident in the training set where the proportion of positives over negatives is of 4 to 1.

Class imbalance is a serious problem that can cause systematic error in the prediction, like in this case, where almost 90% of the observations are classified as negative cases. As in this particular case study we are more interested in correctly predicting failure, we take some actions to balance the predictive ability of the model between classes. According to Kuhn (2013) a simple way of increasing the prediction accuracy of

the minority class is to determine alternative cutoffs for the predicted probabilities. By default an observation is classified as a positive case when the probability calculated is greater than 50%. To determine the new cutoff, we represent graphically in Figure 1 how the sensitivity and specificity behave as the cutoff changes. This graph is built with the data coming from the training sample.

Figure 1

Sensitivity/Specificity vs cutoff

As seen in Figure 1, with a cutoff of 0.25 both, sensitivity and specificity have similar values and it appears to be balance between the ability to predict positive and negative values. Using this new cutoff we re-classify the observations in the test set and the results are shown in table 4

:

Table 3

Classification Table for the Logistic Regression Model. Cutoff=0.25

As expected, the general correct classification rate and specificity decrease when the cutoff is changed, however the sensitivity increases significantly as the false negative rate decreases. We can see that

(5)

there is much more balance between the prediction of positive and negative values using the modified cutoff.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a technic to classify multivariate observations in the k-dimensional space. To classify a new observation LDA assumes that each class follows a multivariate normal distribution and that the variance of each class is equal. In general terms, a new observation 𝑥∗_is

classified as part of the ith class when:

𝑓𝑖(𝑥∗)𝜋𝑖>𝑓𝑗(𝑥∗)𝜋𝑗 ∀𝑗≠𝑖|i,𝑗∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

Where 𝑓𝑖(𝑥∗) represents the probability

density function of the 𝑖𝑡ℎ _{class, and}_𝜋 𝑖

represents the a priori (prior) probability of the 𝑖𝑡ℎ_{class. This prior probability is equivalent to}

the proportion of observations in the whole data set that belong to the 𝑖𝑡ℎ_class.

The selection of variables performed in the logistic regression (LR) does not apply to the discriminant analysis because when a variable is not significant in LR it may be because of multicollinearity, as two variables explain the same phenomena, it is probable that only one of them is significant. In discriminant analysis, multicollinearity is not a problem and in fact, it is desired that the predictor variables are highly correlated, otherwise this and other multivariate analysis would not be useful. It’s also important to mention that discriminant analysis can’t include categorical variables as this would cause an unreal separation of the observations in the k-dimensional space. With this in mind, only continuous variables were included in LDA. The results of the classification of the test set using discriminant analysis are contained in Table 4.

Table 4

Classification Table for LDA

The results have the same problem seen before in LR, the model is much better classifying negative cases, and it is almost perfect at classifying non-failure outcomes. As in LR, the class imbalance is the responsible of the poor performance of the model when it comes to predicting positive cases. In discriminant analysis there is no cutoff, however the prior probabilities can be modified in order to balance the predictive performance of the model between classes. The results of the same discriminant analysis using equal prior probabilities are shown below in Table 5:

Table 5

Classification Table for LDA with equivalent priors

Modifying the prior probabilities we obtain a more balanced model.

Table 6 shows the mean comparison between the two discriminant populations, it is clear that the results are coherent with the logic because in all the academic performance variables the mean of the group classified as zero is greater than the mean of the one population. It is also interesting that

(6)

apparently older people tend to be classified as positive events.

Table 6

Mean comparison between populations

We could get into a deeper analysis, but as said before, the priority of this paper is not to identify risk factors but to find the more suitable model for our case.

Quadratic Discriminant Analysis (LDA)

The main difference between QDA and LDA is that in QDA the variances between the populations are not assumed to be equal, this allows the discriminant function to be quadratic. In most cases, QDA has better results than LDA because the assumption in LDA is very strong and is rarely true, however letting the discriminant function to be quadratic can lead to over-fitting problems, this is why both methods will be considered.

The classification results on the test observations are contained in Table 7

. Only the classification with equivalent priors is shown because we know that the class imbalance is going to affect the ability of the model to predict positive values.

Table 7

The classification using QDA has a better specificity, but a worse sensitivity indicator than LDA. The criteria that will define which is the best model will be discussed forward in this document in the section Model Comparison.

Identification of Latent Variables – Aptitudes

Some of the observed variables included in the models may be product of hidden or latent variables that could potentially improve the prediction of whether a student is failing or not the Optimization Principles course.

As discriminant analysis, exploratory factor analysis (EFA) is a multivariate technic, but unlike discriminant analysis, EFA is not a classification method but is used mainly to reduce the dimensions of a data set. The resultant factors are more interpretable when exploratory factor analysis is performed on related variables that have large correlations (regardless if it’s positive or negative). At first, we extracted the factors from the variables that represent the grades of the different courses that the students took before they enrolled in the Optimization Principles course.

The strong correlations between courses like between APO I and PROG 2 validate the use of exploratory factor analysis in this case.

The solution of the exploratory factor analysis is not unique, there are several rotations that satisfy all the model’s constraints, so it is up to the investigator to choose which rotation suits the best. In this case the chosen rotation

(7)

will be promax rotation, because we do not expect factors to be orthogonal and this will distribute the loadings of each variable between factors so that the interpretation of the results becomes easier. The results using the chosen rotation are included in Table 8.

Table 8

Results from exploratory factor analysis. Aptitudes

The first factor has big loadings on the math courses and on physics II, this factor will be interpreted as the mathematical aptitude of the student. The second factor can be easily interpreted as programming aptitude as it only includes loadings on APO I and on PROG 2. The third factor is an indicator of how strong are the theoretical bases of the student and at last, the fourth factor includes loadings only in the course Introduction to the Industrial Engineering, so it does not need a complex interpretation.

Table 9

Factor correlations matrix. Aptitudes

The correlation matrix between the factors presented in Table 9 shows that the factors are not orthogonal using the selected rotation, in fact, some of them are highly correlated like factor two and factor three, this may interfere with the significance of some factors when we run a logistic regression.

Looking at the cumulative variance it is evident that the four factors explain only 48% of the variance of the original variables, however, the null hypothesis, that says that the four factors are sufficient to explain the original variability is not rejected.

Identification of Latent Variables – General Performance

We could also extract factors regarding the general performance of a student using the average grade variables combined with variables like percentage of approved credits.

As in the first factor analysis, the chosen variables have strong linear correlations that allow us to perform the method. The results of this second factor analysis are shown in Table 10.

Table 10

Results from exploratory factor analysis. General Performance

The maximum number of factors that can be obtained is four, looking at the results it is evident that the null hypothesis is rejected, so we won’t interpret the resulting factors because they are not sufficient to describe the original data.

With the latent variables obtained from the first exploratory factor analysis we are going to perform again the classification methods seen before and finally all the models will be compared.

(8)

Model Comparison

As seen through the different models considered in this paper, it is usual that models that have a good indicator of sensibility achieve this level of fitting by neglecting specificity and vice versa. This is why neither sensibility nor specificity will be used as comparison indicators.

To compare the different models we use the area under the ROC curve (AUC). The ROC curve presents graphically how the sensitivity and specificity levels change when the threshold changes. The horizontal axis represents 1−𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦, while the vertical

axis represents 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦. With this in mind it is evident that the larger the AUC of a ROC curve is, the better the model is at predicting both, positive and negative cases. In a model that predicts perfectly the classes of a data set, the AUC would be 1. Figure 2 presents the ROC curves of the models that did not include latent variables, while Figure 3 exhibits the ROC curves of the models that did include the factors.

In both graphs logistic regression shows a better performance while changing he threshold, but the ROC curves on their own are not a comparison method. Table 13 shows the AUC of the different models as well as other performance indicators.

Figure 2

ROC curves of LR,LDA and QDA

Figure 3

ROC curves of LR, LDA and QDA. Including latent variables

Table 11 Comparison table

The results presented in Table 11 show that all tree original models have an improvement in the AUC when the latent variables are included, this proves that in some projects the use of unobserved variables can explain in a better way the studied phenomenon, in this case the failure of students.

Regarding the performance criteria (AUC), the model that has a better indicator is logistic regression with latent variables. Although the model is not the best in other indicators, the model we are aiming for is one that has a balanced predicting power, this is why this will be the recommended model for predicting failure.

(9)

Further Analysis of the Chosen Model

The results of the chosen model are contained in Table 12. As said before, in logistic regression the coefficients do not represent the marginal effect of the variable over the probability because the function is no longer lineal.

Table 12

Logistic Regression’s results with latent variables

With this in mind the concept of Odds will be introduced. Recalling the logistic cumulative probability function.

𝜋𝑥_𝑖 =

𝑒𝛽𝑇𝑋_𝑖

1 + 𝑒𝛽𝑇_𝑋 𝑖

The odds would be equivalent to:

𝜋𝑥_𝑖

1 − 𝜋𝑥_𝑖

= 𝑒𝛽𝑇𝑋𝑖

This makes the interpretation easier because the coefficients would represent the marginal effect of a variable over the natural logarithm of the odds, which is not interpretable. On the other hand, 𝑒𝛽𝑗_{represents the marginal effect} of the jth variable over the odds, which are an estimation of how the proportion of positive and negative is distributed. The results of the chosen model with the Odds are shown in Table 13.

Table 13

Logistic Regression’s results (Odds). With latent variables

The results give us some interesting information, for example, for every unit the GPA of a student increases, the odds he/she has of failing Optimization Principles reduce in 84,9%. If a student takes the course during summer school, he/she is reducing the odds of failing in 90%, the student is also reducing his/her chances of failing by being enrolled in a second program (by 42%) and by taking extra credits (14% per every credit). The latent variables don’t have a trivial interpretation because their scores are calculated by how far from the mean is the linear combination calculated from the factor loadings. However, with the results it is possible to say that when a student has a better mathematical and programming background, he/she is reducing his/her possibilities of failing.

Figure 4 represents how the probability of a student with average mathematical and programming aptitudes, that is not enrolled in a second program, that is not taking the course during summer school, that is taking 18 credits (the suggested amount for fourth semester) and that had a GPA of 4.0 as a freshman changes by varying his/her current GPA.

(10)

Figure 4

We can see that the slope is pretty steep for GPA values lower than 4, this means that students with GPA’s greater than 4 have almost the same probability, while the other students’ probability of failing can deeply change between two similar GPA’s.

Conclusions

and

improvement

opportunities

The model that best fits to the characteristics of the case study is a logistic regression that includes latent variables for mathematical and programming aptitudes. The social variables like gender, age and provenance were not significant to predict the results of a student, and only academic variables can really be used as predictors of the failure or not failure of a student.

The results from the model also showed that students that are enrolled in larger academic responsibilities tend to achieve better results, this might be because of the fact that they have grown accustomed to heavier academic loads and have developed organization and study strategies to obtain superior results tough they have larger amounts of work.

The predicting ability of the model could be significantly be improved by using boosting or other technics that will be reserved for future projects.

References

Ayers, E., Nugent, R., & Dean, N.

(2009). A Comparison of

Student Skill Knowledge

Estimates.

Delgado, M., Gibaja, E., Pegalajar, M., &

Pérez, O. (2006). Predicting

Students' Marks from Moodle

Log using Neural Network

Models.

ElGamal, A. (2013). An Educational

Data Mining Model for

Predicting Student

Performance in Programming

Course.

International Journal of

Computer Applications

.

Finlay, S. (2014).

Predictive Analytics,

Data Mining and Big Data.

Myths, Misconceptions and

Methods.

MacMillan.

Hiss, W. C., & Franks, V. W. (2014).

Defining Promise: Optional

standardized testing policies in

american college and

university admissions.

Jishan, S., Rashu, R., Haque, N., &

Rahman, R. (2014). Improving

accuracy of students' final

grade prediction model using

optimal width binning and

synthetic minority

over-sampling technique.

Kuhn, M., Johnson, K. (2013). Applied

Predictive Modeling. Springer.

Peña-Ayala, A. (2014).

Educational

Data Mining: Application and

Trends.

Mexico City: Springer.

Romero, C., & Ventura, S. (2005).

Educational Data Mining: A

Review of the State of Art.

Superby, J., Vandamme, J., & Meskens,

N. (2006). Determination of

factors influencing the

achievement of the first-year

university students using data

mining methods.

International

conference on intelligent

tutoring systems.

(11)

(12)

Annex 1.

Label Possible Values Description

Sex 0 Male

1 Female

Diff Calc [1.5;5] Grade obtained in Differential Calculus.

Int Calc [1.5;5] Grade obtained in Integral Calculus.

LinAlg [1.5;5] Grade obtained in Linear Algebra

Physics I [1.5;5] Grade obtained in Physics I

Physics II [1.5;5] Grade obtained in Physics II

Introd [1.5;5] Grade obteined in Introduction to Industrial Engineering.

APO I [1.5;5] Grade obteined in APO I

PROG 2 [1.5;5] Grade obtained in the second programming course.

SAD?

0 Didn’t take SAD

1 Took SAD

Pos SAD1

0 Didn’t have the chance of taking SAD

1 Had the chance of taking SAD

AGE Age at the time the student got enrolled in the

course.

Bogotá

0 Not from Bogotá.

1 From Bogotá.

Semesters Number of semesters in the university

Double 0 Enrolled in a single program

1 Enrolled in two programs

GPA [1.5;5] GPA

Cred Tkn Number of credits taken.

(13)

Passed Cred Aproved credits

% Approved (0;1] % of approved credits

GPA_Fresh [1.5;5] GPA from first semester

Vacacional 0 Didn’t take the course during summer school 1 Took the course during summer school

Vectorial Ap 0 Has not taken Vectorial Calculus

1 Has taken Vectorial Calculus

VectxAp [0;5] Grade obteined in Vectorial Calculus. 0 if N/A

fail 0 Passed the course