3. METODOLOGÍA DE INVESTIGACIÓN
3.3 TABULACIÓN DE ENCUESTAS
3.3.3 ANÁLISIS GENERAL DE LOS DATOS OBTENIDOS
The cluster analysis can be further developed into a classification model. The number of clusters can be classes of schools. When the metrics change in the schools year by year, a classification model using the cluster number as the labels can be used to classify schools into
different clusters automatically. This will assist in automatically tracking the changes happening in the school profiles.
Two methods of classification are implemented and compared to see which methods give the optimal results.
5.4.1 Random Forests
The random forests (RF) method, described in Section 4.2.1, with multiclass classification model, was implemented for the school data. The existing data was divided into a training set consisting of 1,200 schools and a testing set of 436 schools. The results, shown in Table 23, show that as a classification model, RF performed extremely well, with an out-of-bag (OOB) error rate of 2.08%, which means that the model has an accuracy of 97.92%. The term “out-of- bag” comes from “bagging” (itself short for “bootstrap aggregation”), a predecessor to the RF method. Each tree in bagging uses roughly two-thirds of random observations from the dataset, and the remaining one-third are “out of the bag” and used for prediction, from which the error rate is calculated (James et al., 2013, pp. 317–318). RF is more efficient than bagging, because bagging uses all p predictors rather than m p as in RF (James et al., 2013, p. 320), but the error rate is still called the OOB error rate. The RF for our data constructed 500 trees, with each tree using at least three variables for splits.
Table 23 Classification results from Random Forests, and prediction error (OOB) rate estimate
Variable importance to each class: Table 24 displays the degree of importance that each variable had in determining the class of all observations.
Table 24: Global variable importance based on an ensemble of 500 trees used for classification
If a significant predictor is removed from the model, some observations will be incorrectly classified by the remaining model. The proportion of observations that will be misclassified when a given predictor is removed is known as the mean decrease in accuracy for that predictor. These values are graphed in Figure 39 and tabulated in Table 25. For example, by removing math valid scores (m_val), on average 0.14, or about one in seven, of the observations in the data set will be misclassified.
The Gini index is a measure of how purely the nodes in the tree represent the classes; it is small when most of the observations at each node belong to just one class (James et al., 2013, p.
312). The mean decrease in Gini indicates the degree of purity contributed to a class by a certain variable. Variables that result in nodes with higher purity have a higher decrease in Gini
coefficient. These values are also graphed in Figure 39 and tabulated in Table 25.
Figure 39: Graphs of mean decrease in accuracy and mean decrease in Gini index for each variable used in the model.
Table 25: Mean decrease in accuracy and mean decrease in Gini index for the variables used in the model
Finally, Figure 40 graphs the observations on a two-dimensional plane, with the classes color coded. Most of the observations line up neatly along three axes, and the classes are mostly confined to specific segments of specific axes, indicating strong prediction power.
Figure 40: Plot indicates the separation of the four classes on a two-dimensional plane.
Meaning of the results from Random Forests: The results from Random Forests show that, as a classification model, it is highly successful in separating the observations and assigning them to their respective classes with a very low error rate. Thus, if any new school is added to the data, or any update happens to an existing school’s parameters, reclustering of data is not needed. The RF-trained model will be able to classify the new or altered observation to the cluster it would now belong to with a 97.92% accuracy.
5.4.2 Linear Discriminant Analysis
Linear discriminant analysis (LDA; not to be confused with the latent Dirichlet allocation described in Section 4.3.3) was also performed on the schools dataset. The results show that as a classifier it is also efficient for the given dataset, though not as strong as RF. Tables 26, 27, and 28 show the confusion matrix, total numbers of observations in each class, and the accuracy (diagonal entries) and misclassification rates (off-diagonal entries) for the 1,200 schools in the training dataset. It shows an average accuracy rate of 94.6%.
Table 26: Confusion matrix for test data set for linear discriminant analysis
Table 27: Number of observations from the training set in each class
Table 28: Accuracy and misclassification rates for each class assignment in LDA for test set
Since there is a considerable (3.3-point) difference between LDA and Random Forests, this test shows that RF performs better as a classifier for this data. LDA performs best when the
data assumptions are linear and have a Gaussian distribution. RF, on the other hand, is distribution agnostic.
5.5. Frequent Patterns and Association Rules
The dataset used for frequent pattern analysis (described in Section 4.4) is school-level demographics, teacher education and teaching experience, foster care, graduation numbers, and percent meeting university requirements, and it is merged with the math performance attribute. A total of 51 attributes are used for the pattern analysis. The full list of attributes is given in
Appendix 3. Data from years 2014 and 2015 has been applied for pattern analysis, and differences are observed to see whether any new patterns exist from each years. Some of the examples of the attributes are given in Table 29.
Table 29: Sample attributes at school level used for frequent pattern analysis EnrollmentC EnglishLearnersPerC AmericanIndianPerC AsianPerC AfricaAmericanPerC HispanicPerC FluentEnglishProficientPerC FosterYouthNumC FreeReducedMealsPerC CohortGraduatesPerC PerPupilRatioTeacherC 1stYearTeachersC 2YearTeachersC
AvgYearsTeachingC (average years teaching experience) TeachersFTEC (number of full-time teachers)
Grad_app
Data attributes are discretized and binned based on the mean, median, and quantile values. For example, the attribute Enrollmentranges from 2 to 4,814, with a median of 1,366. Thus, the values were binned into 10 categories to form the variable EnrollmentC. Similarly, the attribute AvgYearsTeaching has minimum of 0, maximum of 21, and median of 11. This
attribute was binned into four categories as AvgYearsTeachingC.
Table 30 shows the topmost frequent patterns and association rules derived out of them: Table 30: Most frequent patterns and association rules
Frequent Patterns and Associations
{FluentEnglishProficientPerC=4,PerPupilRatioTeacherC=2} {CohortGraduatesPerC=4} {EnrollmentC=4,PerPupilRatioTeacherC=2} {CohortGraduatesPerC=4} {EnrollmentC=4,X2YearTeachersC=2} {CohortGraduatesPerC=4} {FreeReducedMealsPerC=1} {EnglishLearnersPerC=1} {TeachersFTEC=3} {CohortGraduatesPerC=4} {FreeReducedMealsPerC=3} {CohortGraduatesPerC=4} {EnrollmentC=3,TeachersFTEC=2} {CohortGraduatesPerC=4} {FosterYouthNumC=3,PerPupilRatioTeacherC=2} {AvgYearsTeachingC=1} {EnrollmentC=4,AvgYearsTeachingC=1} {TeachersFTEC=2} {EnrollmentC=4,CohortGraduatesPerC=4} {TeachersFTEC=2} {EnglishLearnersPerC=2,TeachersFTEC=2} {AvgYearsTeachingC=1}
{m_pprof=1, MetAttendTarg=3, AmericanIndianPerC=1, WhitePerC=1, FreeReducedMealsPerC=4, AvgYearsTeachingC=1} {HispanicPerC=5}
Suppose we wish to learn what conditions lead to a school having the highest value of cohort graduation percentage. In the list of frequent patterns and association rules, we look for the CohortGraduatePerC at the highest level, 4, among the consequents, and we find several patterns in the antecedents. For example, the association rule
“{FluentEnglishProficientPerC=4,PerPupilRatioTeacherC=2} {CohortGraduatesPerC=4}” indicates that the cohort graduation percentage is at its highest level (above 75%) when the FluentEnglishProficientPercent is above 75% and pupil teacher ratio is around 20.
These patterns uniquely contribute to analysis where regression models fail to identify some interactions and key relationships between variables. The frequent patterns with multiple variables can be projected into a higher-dimensional space to form association rules. These rules give several insights to enhance the decision-making capacity for policy planning. The plot in Figure 40, for example, shows the support and confidence for eight orders of rules.
Figure 41: Support and confidence for multiple-order frequent patterns.