PLAN DE FORMACIÓN EN CENTROS DE TRABAJO Y PROYECTO INTEGRADO
CONTENIDO MÍNIMO DEL PROYECTO INTEGRADO:
C.2. OBJETIVOS ESPECÍFICOS Y CRITERIOS DE EVALUACIÓN DE LA F.C.T
Reflecting on the definition of ordinal and interval variables, I know that the symmetrized-ranked variables are not ordinal variables. However, the scale property of the reexpressed variables rCLASS_, rCLASS_AGE_, and rCLASS_GENDER_ is not obvious. The variables are not on a ratio scale
Histogram 24# Boxplot 6 319 261 79 627 885 8.25 5.25 2.25
may represent up to 19 counts Skewness = 0.8948
FIgure 5.9
Histogram and boxplot for CLASS_AGE_. Histogram 145# Boxplot0 106 23 196 180 862 510 179 8.25 4.75 1.25
may represent up to 18 counts Skewness = 1.1935
FIgure 5.10
Histogram and boxplot for CLASS_GENDER_.
Table 5.4
Women and Children by SURVIVED
Row Pct Frequency No Yes Total
Child 52
47.71 5752.29 109 Female 109
25.65 31674.35 425 Total 161 373 534
because a true zero value has no interpretation. Accordingly, I define the symmetrized-ranked variable as an approximate interval variable.
The preliminary Titanic model is a logistic regression model with the dependent variable SURVIVED, which assumes 1 = yes and 0 = no. The preliminary Titanic model is built by the SAS procedure LOGISTIC, and
1.15 Histogram 706# 285 325 + 885 Boxplot 0.45 –0.25 –0.95
may represent up to 19 counts Skewness = 0.3854
FIgure 5.11
Histogram and boxplot for rCLASS_.
0.15 Histogram 2092# +
109
Boxplot
–0.95
–2.05
may represent up to 44 counts Skewness = –4.1555
FIgure 5.12
its definition, consisting of two interaction symmetrized-ranked variables rCLASS_AGE_ and rCLASS_GENDER_ is detailed in Table 5.6.
The preliminary Titanic model results are as follows: 59.1% (= 420/711) survivors are correctly classified among those predicted survivors. The classification matrix for the preliminary model (Table 5.7) I believe yields a clearer display of the predictiveness of a binary (yes/no) classification model.
I present only a preliminary model because there is much work to be tested with respect to the three-way interaction variables, which are required because the Titanic data only have two ordinal variables (CLASS and AGE)
1.35 Histogram 470# Boxplot
–0.35 1731
may represent up to 37 counts Skewness = 1.3989
FIgure 5.13
Histogram and boxplot for rGENDER_.
2.7 Histogram 24# 6 319 261 79 627 Boxplot 0.9 –0.9 885
may represent up to 19 counts Skewness = 0.4592
FIgure 5.14
and one nominal class (GENDER). To build on the preliminary model is going beyond the objectives of introducing the SRD method; the statistical data mining SRD method offers promise in improving the predictive power of the data and providing the data miner with a starting point for more applications of this utile method. Perhaps the data miner will continue with the preliminary model with the SRD data mining approach to finalize the Titanic model. 1.9 Histogram 145# 106 180 19623 862 510 Boxplot –1.9 179
may represent up to 18 counts Skewness = 0.0551
FIgure 5.15
Histogram and boxplot for rCLASS_GENDER_.
Table 5.5
Comparison of Skewness Values for Original and Symmetrized-Ranked Variables
Variable Skewness Value
Effect of the Symmetrized-Ranked Method (Yes, No, Null)
CLASS_ 1.0593 Yes rCLASS_ 0.3854
AGE_ -4.1555 Null. Binary variables render two parallel lines.
rAGE_ -4.1555
GENDER_ 1.3989 Null. Binary variables render two parallel lines. rGENDER_ 1.3989 CLASS_AGE_ 0.9848 Yes rCLASS_AGE_ 0.4592 CLASS_ GENDER_ 1.1935 Yes rCLASS_ GENDER_ 0.0551
5.6 Summary
I introduced a new statistical data mining method, the SRD method, and added it to the paradigm of simplicity and desirability for good model-build- ing practice. The method uses two basic statistical tools, symmetrizing and ranking variables, yielding new reexpressed variables with likely improved predictive power. First, I detailed Steven’s scales of measurement to provide a framework for the new symmetric reexpressed variables. Accordingly, I defined the newly created SRD variables on an approximate interval scale. Then, I provided a quick review of the simplest of EDA elements, the stem- and-leaf display and the box-and-whiskers plot, as both are needed for pre- senting the underpinnings of the SRD method. Last, I illustrated the new method with two examples, which showed improved predictive power of the symmetric reexpressed variables over the raw variables. It was my intent that the examples are the starting point for more applications of the SRD method.
References
1. Stevens, S.S., On the theory of scales of measurement, American Association for
the Advancement of Science, 103(2684), 677–680, 1946.
2. Simonoff, J.S., The “Unusual Episode” and a Second Statistics Course, Journal
of Statistics Education v.5, n.1, 1997. http://www.amstat.org/publications/jse/ v5n1/simonoff.html.
Table 5.6
Preliminary Titanic Model
The LOGISTIC Procedure: Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept 1 -0.8690 0.0533 265.4989 <.0001 rCLASS_AGE_ 1 0.4037 0.0581 48.1935 <.0001 rCLASS_GENDER_ 1 1.0104 0.0635 252.9202 <.0001 Table 5.7
Classification Table for the Preliminary Titanic Model
Predicted Deceased Predicted Survivors Total Actual deceased 1,199 291 1,490 Actual survivors 291 420 711 Total 1,490 711 2,201
3. Friendly, M., Extending Mosaic Displays: Marginal, Partial, and Conditional Views of Categorical Data, Journal of Computational and Graphical Statistics, 8:373—395, 1999. http://www.datavis.ca/papers/drew/.
4. http://www.geniqmodel.com/res/TitanicGenIQModel.html, 999.
5. Young, F.W., Visualizing Categorical Data in ViSta, University of North Carolina, 1993. http://forrest.psych.unc.edu/research/vista-frames/pdf/Categorical DataAnalysis.pdf.
6. SAS Institute, http://support.sas.com/publishing/pubcat/chaps/56571.pdf, 2009.
73
6
Principal Component Analysis: A
Statistical Data Mining Method
for Many-Variable Assessment
6.1 Introduction
Principal component analysis (PCA), invented in 1901 by Karl Pearson*,† as a
data reduction technique, uncovers the interrelationship among many vari- ables by creating linear combinations of the many original variables into a few new variables such that most of the variation among the many original variables is accounted for or retained by the few new uncorrelated variables. The literature is sparse on PCA used as a reexpression,‡ not a reduction, tech-
nique. I posit the latter distinction for repositioning PCA as an EDA (explor- atory data analysis) technique. EDA is a force for identifying structure: PCA is a classical data reduction technique of the 1900s; PCA is a reexpression method of 1997, a very good year for the statistics community as Tukey’s seminal EDA book was released then; and PCA is a data mining method of today, as if the voguish term data mining can replace EDA proper. In this chap- ter, I put PCA in its appropriate place in the EDA reexpression paradigm. I illustrate PCA as a statistical data mining technique capable of serving in a common application with expected solution and illustrate PCA in an uncom- mon application, yielding a reliable and robust solution. In addition, I provide an original and valuable use of PCA in the construction of quasi-interaction variables, furthering the case of PCA as a powerful data mining method.
*Pearson, K., On lines and planes of closest fit to systems of points in space, Philosophical
Magazine, 2(6), 559–572, 1901.
† Harold developed independently principal component analysis in 1933.
‡ Tukey coined the term reexpression without defining it. I need a definition of terms that I use.
My definition of reexpression is changing the composition, structure, or scale of original variables by applying functions, such as arithmetic, mathematic, and truncation function, to produce reexpressed variables of the original variables. The objective is to bring out or uncover reexpressed variables that have more information than the original variables.