• No se han encontrado resultados

Escalado de la tecnología de electrodescontaminación

1. Introducción

1.6. Escalado de la tecnología de electrodescontaminación

Many possible directions related to this research can be continued for future investigation. Several interesting issues are demonstrated as follows:

1. In this study, the experiments have focused only on the class noise rather than the attribute noise. More complicated problem domains associated with the attribute noise need to be explored such as the class imbalance problem with missing attribute values (attribute noise) in both binary classification and multi- class classification.

2. In the class imbalance problem, the optimal class distribution is still unknown. In this research study, the ratio between the minority class and the majority class after the data balancing applied is 1:1. This assumption is based on the research

studied by Weiss and Provost [32] and Visa and Ralescu [102]. They concluded that the best ratio of class distribution is near the balanced ratio 1:1 but it may not be optimal. The optimal class distribution could be different from data set to data set. However, in order to overcome the class imbalance problem more effectively, further studies are required to discover the optimal class distribution in different problem domains.

3. Lastly, as there are a number of re-sampling techniques proposed these days, a deterministic model based on the class ratio between the minority and the majority classes can be developed in order to select the most effective re- sampling technique suited to the different proportion of imbalanced data.

This study has explored the use of data cleaning and data re-distribution techniques to overcome the issues of noise and imbalanced data in classification problems. It is believed that the experimental studies and results from this research have contributed to the improvement of classification performance in practical applications. Although many other possible research directions are not included in this section, it is hoped that research studies of noise and imbalanced data problems will continue further in order to solve many other complex problems.

Appendixes

Appendix A: Attribute Information of Data Sets

I.

German Credit Data

[96]

No. Attribute Type Description

1 Status of existing checking account

qualitative A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM /salary assignments for at least 1 year A14 : no checking account 2 Duration in month numerical

3 Credit history qualitative A30 : no credits taken/all credits paid back duly

A31 : all credits at this bank paid back duly

A32 : existing credits paidback duly till now

A33 : delay in paying off in the past A34 : critical account/other credits existing (not at this bank)

4 Purpose qualitative A40 : car (new)

A41 : car (used)

A42 : furniture/equipment A43 : radio/television A44 : domestic appliances A45 : repairs

A46 : education A47 : vacation A48 : retraining A49 : business

A410 : others

5 Credit amount numerical

6 Savings qualitative A61 : ... < 100 DM A62 : 100 <= ... < 500 DM A63 : 500 <= ... < 1000 DM A64 : .. >= 1000 DM

A65 : unknown/no savingsaccount

7 Employment

duration

qualitative A71 : unemployed A72 : ... < 1 year A73 : 1 <= ... < 4 years A74 : 4 <= ... < 7 years A75 : .. >= 7 years

8 Instalment rate numerical

9 Personal status qualitative A91 : male : divorced/separated A92 : female : divorced/separated/

married A93 : male : single

A94 : male : married/widowed A95 : female : single

10 Debtors qualitative A101 : none

A102 : co-applicant A103 : guarantor

11 Residence numerical

12 Property qualitative A121 : real estate

A122 : if not A121 : building society savingsagreement/life

insurance

A123 : if not A121/A122 : car or other, not in attribute 6 A124 : unknown/no property

13 Age numerical

14 Instalment plans qualitative A141 : bank A142 : stores

A143 : none

15 Housing qualitative A151 : rent

A152 : own A153 : for free

16 Existing credit numerical

17 Job qualitative A171 : unemployed/unskilled

- non-resident A172 : unskilled - resident A173 : skilled employee/official A174 : management/self-employed/ highlyqualified employee/

officer

18 Liable people numerical

19 Telephone qualitative A191 : none

A192 : yes, registered under the customer’s name 20 Foreign worker qualitative A201 : yes

A202 : no

21 Customer

classification

class 1: good credit risk 2: bad credit risk

II.

BUPA Liver Disorders

[96]

No. Attribute Type Description

1 Mean corpuscular volume (mcv) numerical mcv value of blood test 2 Alkaline phosphatase (alkphos) numerical alkphos value of blood

test

3 Alanine aminotransferase (sgpt) numerical sgpt value of blood test 4 Aspartate aminotransferase (sgot) numerical sgot value of blood test 5 Gamma-glutamyl transpeptidase

(gammagt)

numerical gammagt value of blood test

6 Drinks real Number of half-pint

beverages drunk per day.

7 Person classification class 1: healthy person

2: unhealthy persons who have liver disorder problem

III.

Johns Hopkins Ionosphere

[96]

No. Attribute Type Description

1-34 Received signal no. 1 - no. 35 real Received signal of the radar system from ionosphere 35 Classification of radar returns class g: good

b: bad

IV.

Pima Indians Diabetes

[96]

No. Attribute Type Description

1 Number of times pregnant

numerical

2 Plasma glucose concentration

numerical A two hours in an oral glucose tolerance test

3 Blood pressure numerical Diastolic blood pressure (mm Hg)

4 Triceps numerical Triceps skin fold thickness (mm)

5 Insulin numerical 2-Hour serum insulin (mu U/ml)

6 Body mass index real weight in kg/(height in m)2

7 Diabetes pedigree function

real

8 Age numerical

9 Diabetes classification class 0: negative for diabetes 1: positive for diabetes

V.

Haberman's Survival Data

[96]

No. Attribute Type Description

1 Age numerical Age of patient at time of operation

2 Year numerical Patient's year of operation

3 Axillary nodes numerical Number of positive axillary nodes detected 4 Survival status class 1 : the patient survived 5 years or longer

2 : the patient died within 5 year

VI.

SPECT Heart Data

[96]

No. Attribute Type Description

1-22 F1 - F22 binary Features that summarise the original

SPECT images

23 Overall diagnosis class 0: abnormal

1: normal

VII.

Balance Scale Data

[96]

No. Attribute Type Description

1 Left-Weight numerical Number between 1 to 5

2 Left-Distance numerical Number between 1 to 5

3 Right-Weight numerical Number between 1 to 5

4 Right-Distance numerical Number between 1 to 5

5 Class name class L: scale tip to the left

B: scale is balanced R: scale tip to the right

VIII.

Glass Identification Data

[96]

No. Attribute Type Description

1 RI real Refractive index

2 Na real Weight percent of Sodium in corresponding oxide

3 Mg real Weight percent of Magnesium in corresponding

oxide

4 Al real Weight percent of Aluminium in corresponding

oxide

5 Si real Weight percent of Silicon in corresponding oxide

6 K real Weight percent of Potassium in corresponding

oxide

7 Ca real Weight percent of Calcium in corresponding

oxide

8 Ba real Weight percent of Barium in corresponding oxide

9 Fe real Weight percent of Iron in corresponding oxide

10 Type of glass class 1: building windows float processed 2: building windows non-float processed 3: vehicle windows float processed 4: vehicle windows non-float processed (none in this database)

5: containers 6: tableware 7: headlamps

IX.

Yeast Data

[96]

No. Attribute Type Description

1 Mcg real McGeoch's method for signal

sequence recognition.

2 Gvh real Von Heijne's method for signal

sequence recognition.

spanning region prediction program.

4 Mit real Score of discriminant analysis of the

amino acid content of the N-terminal region (20 residues long) of

mitochondrial and non-mitochondrial proteins.

5 Erl binary Presence of "HDEL" substring

(thought to act as a signal for retention in the endoplasmic reticulum lumen).

6 Pox real Peroxisomal targeting signal in the C-

terminus.

7 Vac real Score of discriminant analysis of the

amino acid content of vacuolar and extracellular proteins.

8 Nuc real Score of discriminant analysis of

nuclear localisation signals of nuclear and non-nuclear proteins.

9 Sequence Name class CYT: Cytosolic or cytoskeletal

NUC: Nuclear MIT: Mitochondrial ME3: Membrane protein, no N- terminal signal

ME2: Membrane protein, uncleaved signal

ME1: Membrane protein, cleaved signal

EXC: Extracellular VAC: Vacuolar

POX: Peroxisomal

Appendix B: Data Preparation and Normalisation

In order to prepare the data for training learning algorithms, data in each attribute is required to be converted to numeric value and fall in the range zero to one. Firstly, all qualitative data must be converted to numeric value. For example, in German credit data, the data A201 and A202 of the foreign worker attribute can be converted to one and two respectively. After each qualitative data is converted to numeric value, the data normalisation for the required interval range is performed by the following equation [104].

newValue = originalValue – minimumValue maximumValue - mininumValue where

• newValue is the computed value falling in the [0,1] interval range.

• originalValue is the value to be converted.

• minimumValue is the smallest possible value for the attribute.

List of References

[1] G. P. Zhang, "Neural networks for classification: A survey," IEEE Transactions

on Systems, Man, and Cybernetics, vol. 30, pp. 451-462, 2000.

[2] S. B. Kotsiantis, "Supervised machine learning: A review of classification techniques," Informatica, vol. 31, pp. 249-268, 2007.

[3] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine learning, neural and

statistical classification. Chichester: Ellis Horwood, 1994.

[4] X. Zhu and X. Wu, "Class noise vs. attribute noise: A quantitative study,"

Artificial Intelligence Review, vol. 22, pp. 177-210, 2004.

[5] D. Guan, W. Yuan, Y. K. Lee, and S. Lee, "Identifying mislabeled training data with the aid of unlabeled data," Applied Intelligence, pp. 1-14, 2010.

[6] C.-M. Teng, "A comparison of noise handling techniques," in Proceedings of

the Fourteenth International Florida Artificial Intelligence Research Society Conference, Florida, USA, 2001, pp. 269-273.

[7] C. E. Brodley and M. A. Friedl, "Identifying mislabeled training data," Journal

of Artificial Intelligence Research, vol. 11, pp. 137-167, 1999.

[8] M. Pechenizkiy, A. Tsymbal, S. Puuronen, and O. Pechenizkiy, "Class noise and supervised learning in medical domains: The effect of feature extraction," in

Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS '06), 2006, pp. 708-713.

[9] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," SIGKDD

Explorations Newsletter, vol. 6, pp. 20-29, 2004.

[10] J. Laurikkala, "Improving identification of difficult small classes by balancing class distribution," in Proceedings of the 8th Conference on AI in Medicine in

Europe: Artificial Intelligence Medicine, London: Springer-Verlag, 2001, pp.

63-66.

[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique," Journal of Artificial Intelligence

[12] Q. Gu, Z. Cai, L. Zhu, and B. Huang, "Data mining on imbalanced data sets," in

International Conference on Advanced Computer Theory and Engineering (ICACTE '08), 2008, pp. 1020-1024.

[13] Y. Sun, A. K. C. Wong, and M. S. Kamel, "Classification of imbalanced data: A review," International Journal of Pattern Recognition and Artificial

Intelligence, vol. 23, pp. 687-719, 2009.

[14] Z.-H. Zhou and X.-Y. Liu, "Training cost-sensitive neural networks with methods addressing the class imbalance problem," IEEE Transactions on

Knowledge and Data Engineering, vol. 18, pp. 63-77, 2006.

[15] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81- 106, 1986.

[16] V. Garcia, J. S. Sanchez, R. A. Mollineda, R. Alejo, and J. M. Sotoca, "The class imbalance problem in pattern classification and learning," in Taller de

Mineria de Datos y Aprendizaje (Tamida 2007), 2007, pp. 283-291.

[17] P. Kraipeerapun, C. C. Fung, and S. Nakkrasae, "Porosity prediction using bagging of complementary neural networks," in Proceedings of the 6th

International Symposium on Neural Networks on Advances in Neural Networks (ISNN '09), Wuhan, China, 2009, pp. 175-184.

[18] G. L. Libralon, A. C. P. d. L. F. d. Carvalho, and A. C. Lorena, "Pre-processing for noise detection in gene expression classification data," Journal of Brazilian

Computer Society, vol. 15, pp. 3-11, 2009.

[19] L. Daza and E. Acuna, "An algorithm for detecting noise on supervised classification," in Proceedings of The World Congress on Engineering and

Computer Science (WCECS 2007), San Francisco, USA, 2007, pp. 701-706.

[20] S. Verbaeten and A. Van Assche, "Ensemble methods for noise elimination in classification problems," in Proceedings of the 4th International Conference on

Multiple Classifier Systems, Guildford, UK, 2003, pp. 317-325.

[21] T. Fawcett and P. Foster, "Combining data mining and machine learning for effective user profile," in the 2nd International Conference on Knowledge

Discovery and Data Mining: AAAI Press, 1996, pp. 8-13.

[22] M. Kubat, R. C. Holte, and S. Matwin, "Machine learning for the detection of oil spills in satellite radar images," Machine Learning, vol. 30, pp. 195-215, 1998.

[23] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, "Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,"

Nonlinear Analysis: Real World Applications, vol. 7, pp. 720-747, 2006.

[24] N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: Special issue on learning from imbalanced data sets," SIGKDD Explorations Newsletter, vol. 6, pp. 1-6, 2004.

[25] G. M. Weiss, "Mining with rarity: A unifying framework," SIGKDD

Explorations Newsletter, vol. 6, pp. 7-19, 2004.

[26] H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions

on Knowledge and Data Engineering, vol. 21, pp. 1263-1284, 2009.

[27] R. C. Holte, L. E. Acker, and B. W. Porter, "Concept learning and the problem of small disjuncts," in Proceedings of the Eleventh International Joint

Conference on Artificial Intelligence, 1989, pp. 813-818.

[28] J. R. Quinlan, "Improved estimates for the accuracy of small disjuncts,"

Machine Learning, vol. 6, pp. 93-98, 1991.

[29] K. M. Ting, "The problem of small disjuncts: Its remedy in decision trees," in

Proceedings of the Tenth Canadian Conference on Artificial Intelligence, 1994,

pp. 91-97.

[30] K. Yoon and S. Kwek, "A data reduction approach for resolving the imbalanced data issue in functional genomics," Neural Computing and Applications, vol. 16, pp. 295-306, 2007.

[31] G. H. Nguyen, A. Bouzerdoum, and S. L. Phung, "Learning pattern classification tasks with imbalanced data set," in Pattern Recognition, P.-Y. Yin, Ed.: InTech, 2009.

[32] G. M. Weiss and F. Provost, "Learning when training data are costly: The effect of class distribution on tree induction," Journal of Artificial Intelligence

Research, vol. 19, pp. 315-354, 2003.

[33] P. J. G. Lisboa, A. Vellido, and B. Edisbury, Business applications of neural

networks: The state of the art of real world applications: World scientific, 2000.

[34] B. Widrow, D. E. Rumelhart, and M. A. Lehr, "Neural networks: Applications in industry, business and science," Communications of the ACM, vol. 37, pp. 93- 105, 1994.

[35] C. Charalambous, A. Charitou, and F. Kaourou, "Comparative analysis of artificial neural network models: Application in bankruptcy prediction," in

International Joint Conference on Neural Networks (IJCNN '99), 1999, pp.

3888-3893.

[36] L. Gang, B. Verma, and S. Kulkami, "Experimental analysis of neural network based feature extractors for cursive handwriting recognition," in Proceedings of

the 2002 International Joint Conference on Neural Networks (IJCNN '02),

2002, pp. 2837-2842.

[37] D. R. Hush, C. T. Abdallah, G. L. Heileman, and D. Docampo, "Neural networks in fault detection: A case study," in Proceedings of the 1997 American

Control Conference, 1997, pp. 918-921.

[38] Z.-H. Zhou and Y. Jiang, "Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble," IEEE Transactions on Information

Technology in Biomedicine, vol. 7, pp. 37-42, 2003.

[39] V. Garcia, R. A. Mollineda, and J. S. Sanchez, "A new performance evaluation method for two-class imbalanced problems," in Proceedings of the 2008 Joint

IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, Orlando, Florida, 2008, pp. 917-925.

[40] S. Tang and S.-p. Chen, "Data cleansing based on mathematic morphology," in

The 2nd International Conference on Bioinformatics and Biomedical Engineering (ICBBE 2008), 2008, pp. 755-758.

[41] A. Miranda, L. Garcia, A. Carvalho, and A. Lorena, "Use of classification algorithms in noise detection and elimination," in Proceedings of the 4th

International Conference on Hybrid Artificial Intelligence Systems (HAIS '09),

Salamanca, Spain, 2009, pp. 417-424.

[42] F. Gargiulo and C. Sansone, "Improving performance of network traffic classification systems by cleaning training data," in Proceedings of the 2010

20th International Conference on Pattern Recognition, Istanbul: IEEE

Computer Society, 2010, pp. 2768-2771.

[43] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, "Supervised neural network modeling: An empirical investigation into learning from imbalanced data with labeling errors," IEEE Transactions on Neural Networks, vol. 21, pp. 813-830, 2010.

[44] J. V. Hulse, T. M. Khoshgoftaar, and A. Napolitano, "Experimental perspectives on learning from imbalanced data," in Proceedings of the 24th International

[45] S. C. Chen, S. W. Lin, T. Y. Tseng, and H. C. Lin, "Optimization of back- propagation network using simulated annealing approach," in IEEE

International Conference on Systems, Man and Cybernetics (SMC '06), 2006,

pp. 2819-2824.

[46] P. Venkatesan and S. Anitha, "Application of a radial basis function neural network for diagnosis of diabetes mellitus," Current Science, vol. 91, pp. 1195- 1999, 2006.

[47] S. G. Wu, F. S. Bao, E. Y. Xu, Y.-X. Wang, Y.-F. Chang, and Q.-L. Xiang, "A leaf recognition algorithm for plant classification using probabilistic neural network," in IEEE International Symposium on Signal Processing and

Information Technology, 2007, pp. 11-16.

[48] C. C. Fung, V. Iyer, W. Brown, and K. W. Wong, "Comparing the performance of different neural networks architectures for the prediction of mineral prospectivity," in Proceedings of 2005 International Conference on Machine

Learning and Cybernetics, 2005, pp. 394-398.

[49] D. Anderson and G. McNeill, "Artificial neural networks technology: A DACS state-of-the-art report," Kaman Sciences Corporation, 1992.

[50] J. Nazari and O. K. Ersoy, "Implementation of back-propagation neural networks with MatLab," School of Electrical Engineering, Purdue University,1992.

[51] R. Ball and P. Tissot, "Demonstration of artificial neural network in Matlab," Division of Nearshore Research, Texas A&M University, 2006.

[52] V. Vapnik and A. Lerner, "Pattern recognition using generalized portrait method," Automation and remote control, vol. 24, pp. 774-780, 1963.

[53] P. Kraipeerapun and C. C. Fung, "Comparing performance of interval neutrosophic sets and neural networks with support vector machines for binary classification problems," in The 2nd IEEE International Conference on Digital

Ecosystems and Technologies (DEST '08), 2008, pp. 34-37.

[54] L. Auria and R. A. Moro, "Support Vector Machines (SVM) as a technique for solvency analysis," DIW Berlin, German Institute for Economic Research, Discussion Papers of DIW Berlin No. 811, 2008.

[55] W. S. Noble, "What is a support vector machine?" Nature Biotechnology, vol. 24, pp. 1565-1567, 2006.

[56] P. Soda, "A hybrid approach handling imbalanced datasets," in Proceedings of

the 15th International Conference on Image Analysis and Processing (ICIAP '09), Vietri sul Mare, Italy, 2009, pp. 209-218.

[57] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, "Safe-level- SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem," in Lecture Notes in Computer Science. vol. 5476/2009, 2009, pp. 475-482.

[58] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, "SVMs modeling for highly imbalanced classification," IEEE Transactions on Systems, Man, and

Cybernetics, Part B: Cybernetics, vol. 39, pp. 281-288, 2009.

[59] X.-Y. Wang, X.-X. Liang, and J.-Z. Sun, "An integrated approach of neural network and decision tree to classification," in Proceedings of 2005

International Conference on Machine Learning and Cybernetics, 2005, pp.

2055-2058.

[60] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers - a survey," IEEE Transactions on Systems, Man, and Cybernetics, Part C:

Applications and Reviews, vol. 35, pp. 476-487, 2005.

[61] S. K. Murthy, "Automatic construction of decision trees from data: A multidisciplinary survey," Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.

[62] J. Fan and P. Wen, "Application of C4.5 algorithm in web-based learning assessment system," in International Conference on Machine Learning and

Cybernetics, 2007, pp. 4139-4143.

[63] C.-H. Cheng and Y.-S. Chen, "Fundamental analysis of stock trading systems using classification techniques," in International Conference on Machine

Learning and Cybernetics, 2007, pp. 1377-1382.

[64] J. Gu, Y. Zhou, and X. Zuo, "Making class bias useful: A strategy of learning from imbalanced data," in Proceedings of the 8th International Conference on

Intelligent Data Engineering and Automated Learning (IDEAL '07),

Birmingham, UK, 2007, pp. 287-295.

[65] X. Zhu, X. Wu, and Q. Chen, "Eliminating class noise in large datasets," in

Proceedings of the Twentieth International Conference on Machine Learning (20th ICML), Washington D.C., 2003, pp. 920-927.

[66] T. N. Phyu, "Survey of classification techniques in data mining," in Proceedings

of the International MultiConference of Engineers and Computer Scientists