1. Introducción
1.6. Escalado de la tecnología de electrodescontaminación
Many possible directions related to this research can be continued for future investigation. Several interesting issues are demonstrated as follows:
1. In this study, the experiments have focused only on the class noise rather than the attribute noise. More complicated problem domains associated with the attribute noise need to be explored such as the class imbalance problem with missing attribute values (attribute noise) in both binary classification and multi- class classification.
2. In the class imbalance problem, the optimal class distribution is still unknown. In this research study, the ratio between the minority class and the majority class after the data balancing applied is 1:1. This assumption is based on the research
studied by Weiss and Provost [32] and Visa and Ralescu [102]. They concluded that the best ratio of class distribution is near the balanced ratio 1:1 but it may not be optimal. The optimal class distribution could be different from data set to data set. However, in order to overcome the class imbalance problem more effectively, further studies are required to discover the optimal class distribution in different problem domains.
3. Lastly, as there are a number of re-sampling techniques proposed these days, a deterministic model based on the class ratio between the minority and the majority classes can be developed in order to select the most effective re- sampling technique suited to the different proportion of imbalanced data.
This study has explored the use of data cleaning and data re-distribution techniques to overcome the issues of noise and imbalanced data in classification problems. It is believed that the experimental studies and results from this research have contributed to the improvement of classification performance in practical applications. Although many other possible research directions are not included in this section, it is hoped that research studies of noise and imbalanced data problems will continue further in order to solve many other complex problems.
Appendixes
Appendix A: Attribute Information of Data Sets
I.
German Credit Data
[96]No. Attribute Type Description
1 Status of existing checking account
qualitative A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM /salary assignments for at least 1 year A14 : no checking account 2 Duration in month numerical
3 Credit history qualitative A30 : no credits taken/all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paidback duly till now
A33 : delay in paying off in the past A34 : critical account/other credits existing (not at this bank)
4 Purpose qualitative A40 : car (new)
A41 : car (used)
A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs
A46 : education A47 : vacation A48 : retraining A49 : business
A410 : others
5 Credit amount numerical
6 Savings qualitative A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM
A65 : unknown/no savingsaccount
7 Employment
duration
qualitative A71 : unemployed A72 : ... < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years
8 Instalment rate numerical
9 Personal status qualitative A91 : male : divorced/separated A92 : female : divorced/separated/
married A93 : male : single
A94 : male : married/widowed A95 : female : single
10 Debtors qualitative A101 : none
A102 : co-applicant A103 : guarantor
11 Residence numerical
12 Property qualitative A121 : real estate
A122 : if not A121 : building society savingsagreement/life
insurance
A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown/no property
13 Age numerical
14 Instalment plans qualitative A141 : bank A142 : stores
A143 : none
15 Housing qualitative A151 : rent
A152 : own A153 : for free
16 Existing credit numerical
17 Job qualitative A171 : unemployed/unskilled
- non-resident A172 : unskilled - resident A173 : skilled employee/official A174 : management/self-employed/ highlyqualified employee/
officer
18 Liable people numerical
19 Telephone qualitative A191 : none
A192 : yes, registered under the customer’s name 20 Foreign worker qualitative A201 : yes
A202 : no
21 Customer
classification
class 1: good credit risk 2: bad credit risk
II.
BUPA Liver Disorders
[96]No. Attribute Type Description
1 Mean corpuscular volume (mcv) numerical mcv value of blood test 2 Alkaline phosphatase (alkphos) numerical alkphos value of blood
test
3 Alanine aminotransferase (sgpt) numerical sgpt value of blood test 4 Aspartate aminotransferase (sgot) numerical sgot value of blood test 5 Gamma-glutamyl transpeptidase
(gammagt)
numerical gammagt value of blood test
6 Drinks real Number of half-pint
beverages drunk per day.
7 Person classification class 1: healthy person
2: unhealthy persons who have liver disorder problem
III.
Johns Hopkins Ionosphere
[96]No. Attribute Type Description
1-34 Received signal no. 1 - no. 35 real Received signal of the radar system from ionosphere 35 Classification of radar returns class g: good
b: bad
IV.
Pima Indians Diabetes
[96]No. Attribute Type Description
1 Number of times pregnant
numerical
2 Plasma glucose concentration
numerical A two hours in an oral glucose tolerance test
3 Blood pressure numerical Diastolic blood pressure (mm Hg)
4 Triceps numerical Triceps skin fold thickness (mm)
5 Insulin numerical 2-Hour serum insulin (mu U/ml)
6 Body mass index real weight in kg/(height in m)2
7 Diabetes pedigree function
real
8 Age numerical
9 Diabetes classification class 0: negative for diabetes 1: positive for diabetes
V.
Haberman's Survival Data
[96]No. Attribute Type Description
1 Age numerical Age of patient at time of operation
2 Year numerical Patient's year of operation
3 Axillary nodes numerical Number of positive axillary nodes detected 4 Survival status class 1 : the patient survived 5 years or longer
2 : the patient died within 5 year
VI.
SPECT Heart Data
[96]No. Attribute Type Description
1-22 F1 - F22 binary Features that summarise the original
SPECT images
23 Overall diagnosis class 0: abnormal
1: normal
VII.
Balance Scale Data
[96]No. Attribute Type Description
1 Left-Weight numerical Number between 1 to 5
2 Left-Distance numerical Number between 1 to 5
3 Right-Weight numerical Number between 1 to 5
4 Right-Distance numerical Number between 1 to 5
5 Class name class L: scale tip to the left
B: scale is balanced R: scale tip to the right
VIII.
Glass Identification Data
[96]No. Attribute Type Description
1 RI real Refractive index
2 Na real Weight percent of Sodium in corresponding oxide
3 Mg real Weight percent of Magnesium in corresponding
oxide
4 Al real Weight percent of Aluminium in corresponding
oxide
5 Si real Weight percent of Silicon in corresponding oxide
6 K real Weight percent of Potassium in corresponding
oxide
7 Ca real Weight percent of Calcium in corresponding
oxide
8 Ba real Weight percent of Barium in corresponding oxide
9 Fe real Weight percent of Iron in corresponding oxide
10 Type of glass class 1: building windows float processed 2: building windows non-float processed 3: vehicle windows float processed 4: vehicle windows non-float processed (none in this database)
5: containers 6: tableware 7: headlamps
IX.
Yeast Data
[96]No. Attribute Type Description
1 Mcg real McGeoch's method for signal
sequence recognition.
2 Gvh real Von Heijne's method for signal
sequence recognition.
spanning region prediction program.
4 Mit real Score of discriminant analysis of the
amino acid content of the N-terminal region (20 residues long) of
mitochondrial and non-mitochondrial proteins.
5 Erl binary Presence of "HDEL" substring
(thought to act as a signal for retention in the endoplasmic reticulum lumen).
6 Pox real Peroxisomal targeting signal in the C-
terminus.
7 Vac real Score of discriminant analysis of the
amino acid content of vacuolar and extracellular proteins.
8 Nuc real Score of discriminant analysis of
nuclear localisation signals of nuclear and non-nuclear proteins.
9 Sequence Name class CYT: Cytosolic or cytoskeletal
NUC: Nuclear MIT: Mitochondrial ME3: Membrane protein, no N- terminal signal
ME2: Membrane protein, uncleaved signal
ME1: Membrane protein, cleaved signal
EXC: Extracellular VAC: Vacuolar
POX: Peroxisomal
Appendix B: Data Preparation and Normalisation
In order to prepare the data for training learning algorithms, data in each attribute is required to be converted to numeric value and fall in the range zero to one. Firstly, all qualitative data must be converted to numeric value. For example, in German credit data, the data A201 and A202 of the foreign worker attribute can be converted to one and two respectively. After each qualitative data is converted to numeric value, the data normalisation for the required interval range is performed by the following equation [104].
newValue = originalValue – minimumValue maximumValue - mininumValue where
• newValue is the computed value falling in the [0,1] interval range.
• originalValue is the value to be converted.
• minimumValue is the smallest possible value for the attribute.
List of References
[1] G. P. Zhang, "Neural networks for classification: A survey," IEEE Transactions
on Systems, Man, and Cybernetics, vol. 30, pp. 451-462, 2000.
[2] S. B. Kotsiantis, "Supervised machine learning: A review of classification techniques," Informatica, vol. 31, pp. 249-268, 2007.
[3] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine learning, neural and
statistical classification. Chichester: Ellis Horwood, 1994.
[4] X. Zhu and X. Wu, "Class noise vs. attribute noise: A quantitative study,"
Artificial Intelligence Review, vol. 22, pp. 177-210, 2004.
[5] D. Guan, W. Yuan, Y. K. Lee, and S. Lee, "Identifying mislabeled training data with the aid of unlabeled data," Applied Intelligence, pp. 1-14, 2010.
[6] C.-M. Teng, "A comparison of noise handling techniques," in Proceedings of
the Fourteenth International Florida Artificial Intelligence Research Society Conference, Florida, USA, 2001, pp. 269-273.
[7] C. E. Brodley and M. A. Friedl, "Identifying mislabeled training data," Journal
of Artificial Intelligence Research, vol. 11, pp. 137-167, 1999.
[8] M. Pechenizkiy, A. Tsymbal, S. Puuronen, and O. Pechenizkiy, "Class noise and supervised learning in medical domains: The effect of feature extraction," in
Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS '06), 2006, pp. 708-713.
[9] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," SIGKDD
Explorations Newsletter, vol. 6, pp. 20-29, 2004.
[10] J. Laurikkala, "Improving identification of difficult small classes by balancing class distribution," in Proceedings of the 8th Conference on AI in Medicine in
Europe: Artificial Intelligence Medicine, London: Springer-Verlag, 2001, pp.
63-66.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique," Journal of Artificial Intelligence
[12] Q. Gu, Z. Cai, L. Zhu, and B. Huang, "Data mining on imbalanced data sets," in
International Conference on Advanced Computer Theory and Engineering (ICACTE '08), 2008, pp. 1020-1024.
[13] Y. Sun, A. K. C. Wong, and M. S. Kamel, "Classification of imbalanced data: A review," International Journal of Pattern Recognition and Artificial
Intelligence, vol. 23, pp. 687-719, 2009.
[14] Z.-H. Zhou and X.-Y. Liu, "Training cost-sensitive neural networks with methods addressing the class imbalance problem," IEEE Transactions on
Knowledge and Data Engineering, vol. 18, pp. 63-77, 2006.
[15] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81- 106, 1986.
[16] V. Garcia, J. S. Sanchez, R. A. Mollineda, R. Alejo, and J. M. Sotoca, "The class imbalance problem in pattern classification and learning," in Taller de
Mineria de Datos y Aprendizaje (Tamida 2007), 2007, pp. 283-291.
[17] P. Kraipeerapun, C. C. Fung, and S. Nakkrasae, "Porosity prediction using bagging of complementary neural networks," in Proceedings of the 6th
International Symposium on Neural Networks on Advances in Neural Networks (ISNN '09), Wuhan, China, 2009, pp. 175-184.
[18] G. L. Libralon, A. C. P. d. L. F. d. Carvalho, and A. C. Lorena, "Pre-processing for noise detection in gene expression classification data," Journal of Brazilian
Computer Society, vol. 15, pp. 3-11, 2009.
[19] L. Daza and E. Acuna, "An algorithm for detecting noise on supervised classification," in Proceedings of The World Congress on Engineering and
Computer Science (WCECS 2007), San Francisco, USA, 2007, pp. 701-706.
[20] S. Verbaeten and A. Van Assche, "Ensemble methods for noise elimination in classification problems," in Proceedings of the 4th International Conference on
Multiple Classifier Systems, Guildford, UK, 2003, pp. 317-325.
[21] T. Fawcett and P. Foster, "Combining data mining and machine learning for effective user profile," in the 2nd International Conference on Knowledge
Discovery and Data Mining: AAAI Press, 1996, pp. 8-13.
[22] M. Kubat, R. C. Holte, and S. Matwin, "Machine learning for the detection of oil spills in satellite radar images," Machine Learning, vol. 30, pp. 195-215, 1998.
[23] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, "Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,"
Nonlinear Analysis: Real World Applications, vol. 7, pp. 720-747, 2006.
[24] N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: Special issue on learning from imbalanced data sets," SIGKDD Explorations Newsletter, vol. 6, pp. 1-6, 2004.
[25] G. M. Weiss, "Mining with rarity: A unifying framework," SIGKDD
Explorations Newsletter, vol. 6, pp. 7-19, 2004.
[26] H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions
on Knowledge and Data Engineering, vol. 21, pp. 1263-1284, 2009.
[27] R. C. Holte, L. E. Acker, and B. W. Porter, "Concept learning and the problem of small disjuncts," in Proceedings of the Eleventh International Joint
Conference on Artificial Intelligence, 1989, pp. 813-818.
[28] J. R. Quinlan, "Improved estimates for the accuracy of small disjuncts,"
Machine Learning, vol. 6, pp. 93-98, 1991.
[29] K. M. Ting, "The problem of small disjuncts: Its remedy in decision trees," in
Proceedings of the Tenth Canadian Conference on Artificial Intelligence, 1994,
pp. 91-97.
[30] K. Yoon and S. Kwek, "A data reduction approach for resolving the imbalanced data issue in functional genomics," Neural Computing and Applications, vol. 16, pp. 295-306, 2007.
[31] G. H. Nguyen, A. Bouzerdoum, and S. L. Phung, "Learning pattern classification tasks with imbalanced data set," in Pattern Recognition, P.-Y. Yin, Ed.: InTech, 2009.
[32] G. M. Weiss and F. Provost, "Learning when training data are costly: The effect of class distribution on tree induction," Journal of Artificial Intelligence
Research, vol. 19, pp. 315-354, 2003.
[33] P. J. G. Lisboa, A. Vellido, and B. Edisbury, Business applications of neural
networks: The state of the art of real world applications: World scientific, 2000.
[34] B. Widrow, D. E. Rumelhart, and M. A. Lehr, "Neural networks: Applications in industry, business and science," Communications of the ACM, vol. 37, pp. 93- 105, 1994.
[35] C. Charalambous, A. Charitou, and F. Kaourou, "Comparative analysis of artificial neural network models: Application in bankruptcy prediction," in
International Joint Conference on Neural Networks (IJCNN '99), 1999, pp.
3888-3893.
[36] L. Gang, B. Verma, and S. Kulkami, "Experimental analysis of neural network based feature extractors for cursive handwriting recognition," in Proceedings of
the 2002 International Joint Conference on Neural Networks (IJCNN '02),
2002, pp. 2837-2842.
[37] D. R. Hush, C. T. Abdallah, G. L. Heileman, and D. Docampo, "Neural networks in fault detection: A case study," in Proceedings of the 1997 American
Control Conference, 1997, pp. 918-921.
[38] Z.-H. Zhou and Y. Jiang, "Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble," IEEE Transactions on Information
Technology in Biomedicine, vol. 7, pp. 37-42, 2003.
[39] V. Garcia, R. A. Mollineda, and J. S. Sanchez, "A new performance evaluation method for two-class imbalanced problems," in Proceedings of the 2008 Joint
IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, Orlando, Florida, 2008, pp. 917-925.
[40] S. Tang and S.-p. Chen, "Data cleansing based on mathematic morphology," in
The 2nd International Conference on Bioinformatics and Biomedical Engineering (ICBBE 2008), 2008, pp. 755-758.
[41] A. Miranda, L. Garcia, A. Carvalho, and A. Lorena, "Use of classification algorithms in noise detection and elimination," in Proceedings of the 4th
International Conference on Hybrid Artificial Intelligence Systems (HAIS '09),
Salamanca, Spain, 2009, pp. 417-424.
[42] F. Gargiulo and C. Sansone, "Improving performance of network traffic classification systems by cleaning training data," in Proceedings of the 2010
20th International Conference on Pattern Recognition, Istanbul: IEEE
Computer Society, 2010, pp. 2768-2771.
[43] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, "Supervised neural network modeling: An empirical investigation into learning from imbalanced data with labeling errors," IEEE Transactions on Neural Networks, vol. 21, pp. 813-830, 2010.
[44] J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learning from imbalanced data," in Proceedings of the 24th International
[45] S. C. Chen, S. W. Lin, T. Y. Tseng, and H. C. Lin, "Optimization of back- propagation network using simulated annealing approach," in IEEE
International Conference on Systems, Man and Cybernetics (SMC '06), 2006,
pp. 2819-2824.
[46] P. Venkatesan and S. Anitha, "Application of a radial basis function neural network for diagnosis of diabetes mellitus," Current Science, vol. 91, pp. 1195- 1999, 2006.
[47] S. G. Wu, F. S. Bao, E. Y. Xu, Y.-X. Wang, Y.-F. Chang, and Q.-L. Xiang, "A leaf recognition algorithm for plant classification using probabilistic neural network," in IEEE International Symposium on Signal Processing and
Information Technology, 2007, pp. 11-16.
[48] C. C. Fung, V. Iyer, W. Brown, and K. W. Wong, "Comparing the performance of different neural networks architectures for the prediction of mineral prospectivity," in Proceedings of 2005 International Conference on Machine
Learning and Cybernetics, 2005, pp. 394-398.
[49] D. Anderson and G. McNeill, "Artificial neural networks technology: A DACS state-of-the-art report," Kaman Sciences Corporation, 1992.
[50] J. Nazari and O. K. Ersoy, "Implementation of back-propagation neural networks with MatLab," School of Electrical Engineering, Purdue University,1992.
[51] R. Ball and P. Tissot, "Demonstration of artificial neural network in Matlab," Division of Nearshore Research, Texas A&M University, 2006.
[52] V. Vapnik and A. Lerner, "Pattern recognition using generalized portrait method," Automation and remote control, vol. 24, pp. 774-780, 1963.
[53] P. Kraipeerapun and C. C. Fung, "Comparing performance of interval neutrosophic sets and neural networks with support vector machines for binary classification problems," in The 2nd IEEE International Conference on Digital
Ecosystems and Technologies (DEST '08), 2008, pp. 34-37.
[54] L. Auria and R. A. Moro, "Support Vector Machines (SVM) as a technique for solvency analysis," DIW Berlin, German Institute for Economic Research, Discussion Papers of DIW Berlin No. 811, 2008.
[55] W. S. Noble, "What is a support vector machine?" Nature Biotechnology, vol. 24, pp. 1565-1567, 2006.
[56] P. Soda, "A hybrid approach handling imbalanced datasets," in Proceedings of
the 15th International Conference on Image Analysis and Processing (ICIAP '09), Vietri sul Mare, Italy, 2009, pp. 209-218.
[57] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, "Safe-level- SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem," in Lecture Notes in Computer Science. vol. 5476/2009, 2009, pp. 475-482.
[58] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, "SVMs modeling for highly imbalanced classification," IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics, vol. 39, pp. 281-288, 2009.
[59] X.-Y. Wang, X.-X. Liang, and J.-Z. Sun, "An integrated approach of neural network and decision tree to classification," in Proceedings of 2005
International Conference on Machine Learning and Cybernetics, 2005, pp.
2055-2058.
[60] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers - a survey," IEEE Transactions on Systems, Man, and Cybernetics, Part C:
Applications and Reviews, vol. 35, pp. 476-487, 2005.
[61] S. K. Murthy, "Automatic construction of decision trees from data: A multidisciplinary survey," Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.
[62] J. Fan and P. Wen, "Application of C4.5 algorithm in web-based learning assessment system," in International Conference on Machine Learning and
Cybernetics, 2007, pp. 4139-4143.
[63] C.-H. Cheng and Y.-S. Chen, "Fundamental analysis of stock trading systems using classification techniques," in International Conference on Machine
Learning and Cybernetics, 2007, pp. 1377-1382.
[64] J. Gu, Y. Zhou, and X. Zuo, "Making class bias useful: A strategy of learning from imbalanced data," in Proceedings of the 8th International Conference on
Intelligent Data Engineering and Automated Learning (IDEAL '07),
Birmingham, UK, 2007, pp. 287-295.
[65] X. Zhu, X. Wu, and Q. Chen, "Eliminating class noise in large datasets," in
Proceedings of the Twentieth International Conference on Machine Learning (20th ICML), Washington D.C., 2003, pp. 920-927.
[66] T. N. Phyu, "Survey of classification techniques in data mining," in Proceedings
of the International MultiConference of Engineers and Computer Scientists