In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information

(1)

Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are encouraged to visit:

http://www.elsevier.com/copyright

(2)

ContentslistsavailableatSciVerseScienceDirect

Applied Soft Computing

jo u r n al h om epa g e : w w w . e l s e v i e r . c o m / lo c a t e / a s o c

MicroCBR: A case-based reasoning architecture for the classiﬁcation of microarray data

Juan F. De Paz

^a

, Javier Bajo

^a,∗

, Vicente Vera

^b

, Juan M. Corchado

^a

aDepartamentoInformáticayAutomática,UniversityofSalamanca,PlazadelaMerceds/n,37008Salamanca,Spain

bDepartamentodeEstomatologiaII,FacultaddeOdontologia,UniversidadComplutensedeMadrid,PlazadeRamonyCajal,s/n,CiudadUniversitaria,28040Madrid,Spain

a rt i c l e i n f o

Articlehistory:

Received13July2009

Receivedinrevisedform25July2011 Accepted14August2011

Availableonline22August2011

Keywords:

Case-basedreasoning HGU133

ESOINN

CLLleukemiaclassiﬁcation Decisiontree

a b s t r a c t

Microarraytechnologyprovidesagreatamountofdatarequiringanalysis,andtheuseofnewintelligent algorithmsthatidentifytherelevantinformationfortheclassificationprocesshasbecomeessential.This studypresentsaclassificationtoolcalledMicroCBRthatusesthecase-basedreasoningparadigmand incorporatesanovelfilteringtechniquebasedonstatisticalmethods,anewclusteringtechniquethat usesESOINN(EnhancedSelf-OrganizingIncrementalNeuronalNetwork),andaknowledgeextraction techniquebasedontheJ48algorithm.MicroCBRhasbeenappliedtoclassify91CLLpatientsandthe resultsobtainedareshowninthispaper.

1. Introduction

Theuseofmicroarrays,andmorespeciﬁcallyexpressionarrays, enablestheanalysisofdifferentsequencesofoligonucleotides[1,2].

Microarrayscanbeusedintheﬁeldofmedicinetodetectchanges atthegeneticlevel.Simplyput,amicroarrayisanarrayofprobes thatcontainsgeneticmaterialwithapredeterminedsequence.Pre- determinedsequencescanvaryaccordingtotheﬁeldinwhichthe studyisapplied;inthecaseofhumans,thesequenceswillcontain humangeneticmaterial.Thesesequencesarehybridizedwiththe geneticmaterialofpatients,thusallowingthedetectionofgenetic mutationsthroughtheanalysisofthepresenceorabsenceofcertain sequencesofgeneticmaterial.Thisstudywasconductedtoanalyze theintensitiesproducedduringthehybridization.Theanalysisof thisinformationallowsthedetectionofpatternsthatcharacterize certaindiseasesand,mostimportantly,thegenesassociatedwith thesedifferentdiseases.

Theanalysisofexpressionarraysiscalledexpressionanalysis.

Anexpressionanalysisbasicallyconsistsofthreestages:normal- izationandﬁltering;clusteringandclassiﬁcation;andextraction ofknowledge.Thesestagesarecarriedoutfromtheluminescence valuesfoundintheprobes.Presently,thenumberofprobescon-

∗Correspondingauthor.Address:Compa ˜nía5,37002Salamanca,Spain.

Tel.:+34639771985;fax:+34923277101.

E-mailaddresses:[email protected](J.F.DePaz),[email protected](J.Bajo), [email protected](V.Vera),[email protected](J.M.Corchado).

tainingexpressionarrayshasincreasedconsiderablytotheextent thatithasbecomenecessarytousenewmethodsandtechniques toanalyzetheinformationmoreefficiently[3].Itisnecessaryto developnewtechniquestoanalyzelargevolumesofdata,extract therelevantinformation,anddeletetheinformationwhichhasno relevancetotheclassificationprocess.Moreover,theknowledge obtainedduringtheclassificationprocessisofgreatimportance forsubsequentclassifications.Therearevariousartificialintelli- gencetechniquessuchasartificialneuralnetworks[4,5],Bayesian networks [6], and fuzzy logic [7] which have been applied to microarrayanalysis.Whilethesetechniquescanbeappliedatvari- ousstagesofexpressionanalysis,theirintegrationintosystemsthat combinevariousartificialintelligencetechniques,suchasCBR,is ofparticularinterest.Previousresearch[47]hasstudiedthecapac- ityofdifferentclassifiersinthefieldofsequencing;however,the systemonlyusesnewclassificationsinthelearningphase,andcon- sequentlydoesnottakeintoaccounttherelevanceofthechanges madebyexpertanalysis.ThispaperpresentsMicroCBR,asystem basedonCBR(case-basedreasoning)whichusespastexperiences tosolvenewproblems[8].Assuch,itisperfectlysuitedforsolving theproblemathand.Inaddition,CBRmakesitpossibletoincor- poratethevariousstagesofexpressionanalysisintothereasoning cycleoftheCBR,thusfacilitatingthecreationofstrategiessimi- lartotheprocessesfollowedinmedicallaboratories.Therecovery ofinformationfrompreviousanalysessimplifiestheclassification processbydetectingandeliminatingrelevantandirrelevantprobes detectedinpreviousanalyses.Thisallowsforalearningprocess thatcannotonlyincorporatenewcases,butalsoselecttherelevant informationfromcasesthathavebeenreviewedbyanexpert.

doi:10.1016/j.asoc.2011.08.021

(3)

Forsometimenow,wehavebeenworkingontheidentification oftechniquestoautomatethereasoningcycleofseveralCBRsys- temsasappliedtocomplexdomains[9].TheobjectiveofMicroCBR istodevelopaCBRforclassifyingpatientsbasedoninformation obtainedfrommicroarrays.Thissystemisappliedtotheclassi- ficationof subtypes ofleukemia,specifically,todetectpatterns andextractsubgroupswithintheCLLtypeofleukemia.MicroCBR incorporatesvarioustechniquesofartificialintelligence(filtering, clustering,artificialneuralnetworksandextractionofknowledge) atdifferentstages ofthereasoningcycle.In theretrievephase, newpre-processing andfilteringtechniquesareincorporatedin ordertoselecttheprobeswithrelevantinformationforclassifying patients.Thisinnovativemethodiscapableofobtainingafiltering processthatincorporatesmultiplestatisticaltechniquesbasedon hypothesiscontrasts,whichnotablyreducesthedimensionalityof thedata.Reducingthedimensionalityofthedatamakesitpossible tousetechniqueswithgreatercomputationalcomplexityinlater stagesoftheCBRcycle,whichwouldotherwisebeunviable.The reusestageincorporatesaclassificationtechniquebasedonneural networkESOINN[10].Thistechniqueproposesanovelmethodfor generatingclustersandforidentifyingthenearestclusterforthe finalclassification.ESOINNperforms clusteringbyincorporating boththedistributionprocessoftheentiresurfaceofclassification, andtheseparationbetweengroupswithlowdensityamongthem.

AnadditionalgroupingtechniqueknownasPAM[11](partition aroundmedoids),isalsoused,resultinginamoreaccurateclas- sification,sincetheresultssuggestedbytheESOINNnetworkare comparedtothoseobtainedusingthePAMtechnique.Oncethe classificationprocessisfinished,therevisephaseinitiatesatech- niqueforextractingknowledgeJ48[12]thatcanextractrulesto explainthedecisionstakenintheretrieveandreusestages.More- over,therevisestageincludesaMDS(MultidimensionalScaling) technique[13–15]forpresentinginformationinlowdimension- ality.Ahumanexpertanalyzesthisinformationandevaluatesthe proposedclassificationaswellasthevalidityoftherulesgenerated.

Finally,intheretainstage,ifthehumanexpertconsidersthepro- posedsolutionvalid,thesystemstoresthecaseinformationand therulesthat havebeen obtained.In this way,MicroCBRtakes advantageoftheinformationfrompreviouslyclassiﬁedpatients, andfromtherelevantandirrelevantprobes,andrulesfromthe previouslyappliedknowledgeextraction.

Thepaperisstructuredasfollows:thenextsectionbrieflyintro- ducestheproblemthatmotivatesthisresearch.Section3presents MicroCBRanddescribesthenovelstrategiesincorporatedinthe stagesoftheCBRcycle.Section4describesacasestudyspecifically developedtoevaluatetheCBRsystempresentedwithinthisstudy, consistingofaclassificationofCLLleukemiapatients.Finally,Sec- tion5presentstheresultsandconclusionsobtainedaftertesting themodel.

2. Microarraydataanalysis

Microarrayhasbecomeanessentialtoolingenomicresearch, making it possible to investigate global gene expression in all aspects of humandisease [16]. Microarray technologyis based onadatabaseofgenefragmentscalledESTs(ExpressedSequence Tags), which are used to measure target abundance using the scannedﬂuorescenceintensitiesfromtaggedmoleculeshybridized toESTs[17].Speciﬁcally,theHGU133plus2.0[3]arechipsused forexpressionanalysis.Thesechipsanalyzetheexpressionlevel of over 47,000 transcripts and variants, including 38,500 well- characterizedhumangenes.Itiscomprisedofmorethan54,000 probesetsand1,300,000distinctoligonucleotidefeatures.TheHG U133plus2.0providesmultiple,independentmeasurementsfor eachtranscript.Theuseofmultipleprobesprovidesacompletedata

setwithaccurate,reliable,reproducibleresultsfromeveryexper- iment. Microarray technology is a critical element for genomic analysisandallowsanin-depthstudyofmolecularcharacteriza- tionofRNAexpression,genomicchanges,epigeneticmodiﬁcations orprotein/DNAunions.

Expressionarrays[3]areatypeofmicroarraythathavebeen usedindifferentapproachestoidentifythegenesthatcharacter- izecertaindiseases[7,18,19].Inallcases,thedataanalysisprocess isessentiallycomposedofthreestages:normalizationandfilter- ing,clustering,andclassification.Thefirststepiscriticaltoachieve bothagoodnormalizationofdataandaninitialfilteringtoreduce thedimensionalityofthedatasetwithwhichtowork[20].Since theproblemathandisworkingwithhigh-dimensionalarrays,itis importanttohaveagoodpre-processingtechniquethatcanfacil- itateautomaticdecision-makingaboutthevariablesthatwillbe vitalfortheclassificationprocess.Inlightofthesedecisionsitwill bepossibletoreducetheoriginaldataset.Moreover,thechoice ofaclusteringtechniqueallowsdatatobegroupedaccordingto certainvariablesthatdominatethebehaviourofthegroup.After organizing thedataintogroups it ispossibletoextract knowl- edgeandclassifypatientswithinthegroupthatpresentsthemost similarities.

Case-based reasoning [9] is particularly applicable to this problemdomainbecauseit(i)supportsarichandevolvablerep- resentationofexperiences,problems,solutionsandfeedback;(ii) providesefﬁcientandﬂexiblewaystoretrievetheseexperiences;

and(iii)appliesanalogicalreasoningtosolvenewproblems[21].

CBRsystems canbeusedtoproposenewsolutions orevaluate solutions toavoidpotentialproblems.Theresearchin[22]sug- gests that analogical reasoning is particularlyapplicable tothe biological domain, in partbecausebiological systems are often homologous(rootedinevolution).Moreover,biologistsoftenusea formofreasoningsimilartoCBR,whereexperimentsaredesigned andperformedbasedonthesimilaritybetweenfeaturesofanew systemandthoseofknownsystems.In[23]amixtureofexperts forcase-basedreasoning(MOE4CBR)isproposed.Itisamethod thatcombinesanensembleofCBRclassifierswithspectralclus- teringandlogisticregression,butdoesnotincorporateextraction of knowledgetechniquesand doesnot focusondimensionality reduction.Someapproachessuchas[9]provideCBRsolutionsand knowledgeextractiontechniques,facilitatingthecomprehension oftheclassificationprocess.ThispaperpresentsMicroCBR,aCBR solutionwhichalsoincorporatesnewknowledgeextractiontech- niques, but additionallyfocuses onthe definitionof innovative strategiesfordimensionalityreductionandclustering.Thefollow- ingsectionpresentsadetailedaccountoftheCBRsystemproposed inthiswork.

3. CBRsystemforclassifyingmicroarraydata

ThissectionpresentstheCBRsystemproposedinthecontext ofthisresearchandprovidesaclassiﬁcationtechniquebasedon previousexperiencesfordatafrommicroarrays.TheCBRdeveloped systemimitatesthebehaviourofhumanexpertsinthelaboratory andincorporatesinnovativeknowledgediscoverytechniques.The systemreceivesdatafromtheanalysisofchipsandisresponsible forclassifyingindividualsbasedonevidenceandexistingdata.

ThepurposeofCBRistosolvenewproblemsbyadaptingsolu- tionsthathavebeenusedtosolvesimilarproblemsinthepast[8].

TheprimaryconceptwhenworkingwithCBRsistheconceptof case.Acasecanbedeﬁnedasapastexperience,andiscomposedof threeelements:aproblemdescriptionwhichdescribestheinitial problem;asolutionwhichprovidesthesequenceofactionscar- riedoutinordertosolvetheproblem;andtheﬁnalstagewhich describesthestateachievedoncethesolutionwasapplied.CBR

(4)

managescases(pastexperiences)tosolvenewproblems.Theway casesaremanagedisknownastheCBRcycle,andconsistsoffour sequentialstepswhicharerecalledeverytimeaproblemneedsto besolved:retrieve,reuse,reviseandretain.EachstepoftheCBRlife cyclerequiresamodelormethodinordertoperformitsmission.

IntheCBRsystemproposedwithinthisstudy,theretrievephase filtersvariables,andrecoversimportantvariablesfromthecasesto determinethemostinfluentialfortheclassification.Oncethemost importantvariableshavebeenretrieved,thereusephasebegins adaptingthesolutionsfortheretrievedcasestoobtainthecluster- ing.Oncethisgroupingisaccomplished,thenextstepisknowledge extraction.Therevisephaseconsistsofanexpertrevisionforthe proposedsolution,andfinally,theretainphaseallowsthesystemto learnfromtheexperiencesobtainedinthethreepreviousphases, consequentlyupdatingthecasesmemory.

AkeyelementinaCBRsystemisacase,whichcanbedefined asapastexperience[8]andiscomposedofthreeelements:prob- lemdescription,problemsolution,andthefinalstateobtainedafter applyingthesolution.AcaseinMicroCBR contains information relatedtothepatient,therules,theproposedclassification,and theprobesmarkedasirrelevantorimportant.Thecaseisdefined bythefollowingexpression:

i_j=(id,S=(A₁,...,A_n),C^p,C^r)

wherei_j∈IandI={i₁,...,i_S}isthesetofindividuals/cases,Aisthe setofalltheprobes,A_irepresentstheprobei,C^pisthepredicted classandC^rtheactualclass.

In addition tothecases memory, oursystemincorporates a memoryofrulesthatcontainstheinformationextractedthrough theknowledgeextractiontechniques.Thememoryofrulesisstruc- turedasfollows:

R={r1,...,rl}, with r_i=(l1∧...∧lm)→c_j

where ls=(dts,os,)/dts∈Dt,os∈O,S_Irr⊆A whereRisthesetofrulesfromthedecisiontree,l_scontainsasetof discretizedprobes,anoperatorandarealvalue,Dtisthediscretiza- tionvaluefortheprobeAt,O={=, =/,>,<,≤,≥}operator,andSIrr

isthesetofprobesmarkedasirrelevant,cj∈C^p.

Whenanewcaseisclassiﬁed,anewdecisiontreeisgenerated intherevisestage.Asetofrulesareextractedfromthedecision treewhichprovideknowledgeabouttherelevanceoftheprobes intheclusteringandclassiﬁcationprocess.Fig.1showsascheme ofthetechniquesappliedinthedifferentstagesoftheCBRcycle.

AsseeninFig.1,theimportantprobesthatallowtheclassifica- tionofpatientsarerecoveredintheretrievephase.Theretrieve phaseisdividedinto6sub-phases:pre-processingthroughRMA, removalofirrelevantvariables,uniformdistribution,probeswith- outmeaningfulcut-offpoints,andcorrelatedvariables.Inthereuse phasethepatientsaregroupedbymeansofanESOINNneuralnet- work.Then,thepatientswithnopriorclassificationareassigned toagroupusingthenearestcluster.IntherevisephasetheJ48 [12]techniqueisappliedforextractingknowledgeaboutthemost importantprobes forthe classification,and theMDS technique [13–15]isusedtomakearepresentationinlowdimensionality.

Finally,intheretainphase,theknowledgeisupdated.Thisknowl- edgeincludesthecaseclassiﬁcation,thedecisionrulesobtained, andtheinformationassociatedwiththeimportanceorirrelevance ofcertainprobesextractedfromtherules.

Fig.1showstheschemeoftheCBRsystemproposedwithinthis study.Fig.2detailsthestepsfollowedineachofthestagesofthe CBRcycle.ThestructureoftheMicroCBRwillnowbeexplainedin detail,presentinginnovativetechniquesmodelledineachofthe stagesoftheCBR.

Fig.1.StructureoftheproposedMicroCBRmodel.

3.1. Retrieve

Traditionally,onlythecasessimilartothecurrentproblemare recovered,oftenbecauseofperformance,andthenadapted.With regardtoexpressionarray,thenumber ofcasesisnotacritical factor,rather thenumberofvariables. Forthis reason,wehave incorporatedaninnovativestrategywherevariablesareretrieved atthisstageandthen,dependingontheidentifiedvariables,therest ofthestagesoftheCBRarecarriedout.Thenewstrategyallowsa notablereductioninthedimensionalityofthedata.Thefiltering phaseiscarriedoutonItogetherwiththenewcaseis.Thefiltering isonlyappliedtothoseprobesnotassociatedwithanyoftherules.

First,apre-processingofthedataisconductedusingRMA.Then, the5ﬁlteringsub-phasesareexecuted:removalofcontrolprobes, removaloferroneousprobes,removaloflow variabilityprobes, removalofprobeswithauniformdistribution,andremovalofcor- relatedprobes.Theseﬁvesub-phasesareoutlinedinthefollowing paragraphs.

3.1.1. RMA

Thisphasebeginsoncethelaboratoryexperimentwithmicroar- rayshasbeencompleted.Theresearcherobtainsvariousﬁlesthat contain grossintensityvalues. Prior toanalyzing thedata, itis important to complete the pre-processing phase, which elimi- nates defective samples and standardizes the data. As seen in Fig. 2, this phaseis normally divided into 3 sub-phases:back- groundcorrection,standardization,andsummarization.Thereis currently a limited group of algorithms that investigators use for performingthesesteps.ThemostcommonareMAS5.0[24]

(MicroarrayAffymetrixSuite5.0),PLIER[25](ProbeLogarithmic IntensityError),andRMA)[26](RobustMulti-arrayAverage).

The RMA [26] algorithm is a method for normalizing and summarizingprobe-levelintensitymeasurements.Itanalyzesthe valuesforthePM(Perfect-Match):intheﬁrststep,aBackground Correctioniscarriedouttoremovethenoisefromtheaveragesof thePM;inthesecondstep,thedataisquantilenormalizedinorder tocomparedatafromdifferentmicroarrays;ﬁnally,asummariza- tionismadeandthevaluesforeachprobe-setaregenerated.

(5)

Fig.2. DetailedarchitecturefortheMicroCBRmodel.

3.1.2. Irrelevantprobes

Oncethecontrol andtheerroneous probeshave beenelim- inated, the filtering process begins. The first step consists of eliminatingtheprobesmarkedasirrelevantinpreviousexecutions oftheCBRcycle.Thisway,allprobesthatcanpassthefiltering phase,butarepronetocauseerroneousresultsduringthereuse phase,areremoved.

3.1.3. Variability

Thesecondstageistoremovetheprobesthathavelowvariabil- ity.Thisworkiscarriedoutaccordingtothefollowingsteps:

1.Calculatethestandarddeviationforeachoftheprobesj

_·j=+

¹

n

j=1

(¯_·j−x_ij)² (1)

wherenisthetotalnumberofcases,¯_·_jistheaveragepopula- tionforthevariablej,andx_ijisthevalueoftheprobejforthe individuali.

2.Standardizetheabovevalues z_i= _·j−

(2)

where =(1/n)

n

j=1_·_j and _·_j=+

(1/n)

n

j=1(¯_·_j−x_ij)² wherez_i≡N(0,1)

3.Discardprobesforwhichthevalueofzmeetsthefollowingcon- dition:z<−1.0.Thiswillachievetheremovalofabout16%ofthe probesifthevariablefollowsanormaldistribution.

3.1.4. Uniformdistribution

Finally,allremainingvariablesthatfollowauniformdistribu- tionareeliminated.Thevariablesthatfollowauniformdistribution willnotallowtheseparationofindividuals.Therefore,thevariables thatdonotfollowthisdistributionwillbereallyusefulvariables in theclassiﬁcationof thecases.Thecontrastofassumptionsis explainedbelow,usingtheKolmogorov–Smirnov[27]testasan example.H0:thedatafollowauniformdistribution;H1:theana- lyzeddatadonotfollowauniformdistribution.Statisticalcontrast:

D=max{D⁺,D⁻} (3)

where D⁺=max

1≤i≤n{(i/n)−F₀(x_i)}, D⁻=max

1≤i≤n{F₀(x_i)−((i−1)/n)} withiasthepatternofentry,nthenumberofitemsandF₀(x_i)the probabilityofobservingvalueslessthaniwithH₀beingtrue.The valueofstatisticalcontrastiscomparedtothenextvalue:

D˛= C˛

k(n) (4)

Inthespecialcaseofuniformdistributionk(n)=√

n+0.12+ (0.11/√

n)andalevelofsigniﬁcance˛=0.05,C˛=1.358.

3.1.5. Cut-offpoints

Thisstepremovestheprobesthat,despitenotfollowingauni- formdistribution,havenoseparationbetweenelements,anddo notallowtheelementstobepartitioned.Thewaytoremovethe probesistodetectchangesinthedensitiesofthedata,andtoselect theﬁnalprobes.Theprobesinwhichcut-offsorhighdensitiesare notdetectedareeliminated,astheydonotprovideusefulinfor- mationtotheclassiﬁcationprocess.Thiswillkeeptheprobesthat allowtheseparationofindividuals.Thedetectionoftheseparation intervalsisperformedbycalculatingthedistancebetweenadjacent

(6)

Select a probe

Sort the values

Remove elemets

< threshold Uniform distribution

Confidence

interval Percentil

Cluster Cluster

More probes

yes no

No

yes 1, 2

3, 4

5

6, 7

8 10

10

11

12

Fig.3.Cut-offpointsoverview.

individuals.Oncethedistanceiscalculated,itispossibletodeter- minethepotentiallyrelevantvalues.Theselectioniscarriedoutby applyingconﬁdenceintervalsforthevaluesofthesedifferencesif thevaluesfollowauniformdistribution,orbyselectingthevalues aboveacertainpercentileifthevaluesdonotfollowanormaldis- tribution.ThediagraminFig.3illustratesthedetectionprocessfor cut-offpoints.

Thisprocessisformalizedasfollows:

1.LetIbethesetofindividualswithﬁlteredprobestogetherwith thenewindividual,wherex_·_jrepresentstheprobejforallthe individuals,andx_ijtheindividualifortheprobej

2.Selecttheprobej=1,x_·j

3.Sortinincreasingordervaluesx_·j

4.Calculatethevalueforx_ij=x_i₊_1j−x_ij

5.Determineifthevariablex_ijfollowsauniformdistributionby meansoftheShapiro-Wilktest[28],otherwisegotostep10.

6.Calculatethevaluefor ¯x_.j

7.Establish the conﬁdence interval for the variance, which is established as ²_·j∈[((n−1)·S²)/(²_n

−1,1−˛/2),((n−1)· S²)/²_n

−1,˛/2] with ˛=0.05 and n=#x_·j and the number of elementsforx_·j,S²isthesamplingvariance.

8.Establishthesetofelementsformx_ijnotbelongingtotheset Q_j={x_ij/x_ij >I⁺

·j}whereI⁺

·j

isthevalueoftheupperlimitof theconﬁdenceinterval.

9.Gotostep11.

10.SelectthosevaluesuptothepercentileP_˛fromeveryx_·jand establishthesetQ_j={x_ij/x_ij>P˛}

11. Selecttheprobej+1inthecaseofmoreprobesneedingrevision andgotostep2.

12.CreatethenewsetofprobesI=∪x_·_j/∃^x_ij∈Q_j/i>#x·u∧i<

#x−#x·u.Whereuisaconstantbetween[0,0.5].

13.Finalizeandreturnthenewsetofindividualswiththeﬁltered probesI.

3.1.6. Correlations

Atthelaststageoftheﬁlteringprocess,correlatedvariablesare eliminatedsothatonlytheindependentvariablesremain.Tothis end,thelinearcorrelationindexofPearsoniscalculatedandthe probesmeetingthefollowingconditionareeliminated.

rxiy_·j<˛ (5)

given ˛=0.95, rx_·_iy_·_j=(x_·_ix_.j/x_·_ix_·_j), _x_·i_x_·_j=(1/N)

n s=1(¯_.i− x_si)( ¯u_.j−x_sj)wherex_·ix.jisthecovariancebetweenprobesiand j.

3.2. Reuse

Inthisphasetheclusteringofindividualsiscarriedout,along withtheclassiﬁcationofnewindividualstooneoftheclusters.

Thereareseveralalgorithmsforclustering,butthemostcommon arethehierarchicalalgorithms[29]andthosebasedonpartition- ing[11].Withinthehierarchicalalgorithmsthemostcommonis the dendrogram[29]. Among thepartition-basedmethods it is possibletoﬁndalternativesbasedonRNAssuchasSOM[30](Self- OrganizingMap),GNG[31](GrowingNeuralGas)orSOINN[32]

(Self-Organizing IncrementalNeuronalNetwork).Otheralterna- tivesincludek-means[33]orPAM[11].Thesetwomethodsdeﬁne aseriesofinitialclusterswhichcorrespondtoanewindividualor toexistingindividuals,andaremarkedastheclusterrepresenta- tives,whiletheremainingindividualsareallocatedtothenearest cluster.Theproblemwitheachofthesemethodsisthattheydonot considerchangesinthedistributionofdensitiesofindividuals,and usuallydonotdetectclusterswithatypicalforms,suchaselongated clusters.Analternativetoclustermethodscouldbethedirectappli- cationofclassiﬁers,includingoptimization-basedfunctionssuchas SVM[48],decisiontrees,decisionrules,ordiffusemethodssuchas thek-nearestneighbour.

3.2.1. ESOINNneuralnetwork

NeuralNetworksbasedonGNGcandetectclusterswithatyp- icalforms,adjustiterativelytothedistributionoftheindividuals, anddetectlowdensityzones.TherearevariantsoftheGNG,such as the GCS [34] (Growing Cell Structure) or SOINN [32] (Self- OrganizingIncrementalNeuronalNetwork).Unlikeself-organizing mapsbasedonmeshes,GrowingGridorGCSdonotsetthenum- berofneurons,orthedegreeofconnectivity,buttheydoestablish thedimensionalityofeachmesh.Thiscomplicatestheseparation phasebetweengroups oncetheneurons aredistributed evenly acrossthe surface.The ESOINNneuralnetwork [10] (Enhanced Self-OrganizingIncrementalNeuronalNetwork)isavariation of theSOINNneuralnetwork [32],which allowsthecreation ofa singlelayer,whileESOINNisabletoincorporateboththedistri- butionprocessalongthesurfaceandtheseparationbetweenlow densitygroups.Thelearningprocessofthenetworkisdistributed intotwostages:theﬁrststageofcompetitionCHL[34](Competi- tiveHebbianLearning)wheretheclosestnodetotheinputpattern isselected;andthesecondadaptation/growingstagesimilartoa GNG.Thetrainingphaseandthevariousalgorithmsappliedatevery modiﬁedstageareoutlinedbelow:

1.Updatetheweightsofneuronsbyfollowingaprocesssimilarto theSOINN,butintroducinganewdeﬁnitionforthelearningrate inordertoprovidegreaterstabilityforthemodel.Thislearning

(7)

ratehasproducedgoodresultsinothernetworkssuchasSOM [36].

W_a₁=n₁(M_a₁)(−W_a₁)

Wa_i=n₂(Ma₁)(−Wa_i) witha_i∈Na₁ (6) wheren1(x)=(1/√

x),n2(x)=(1/

2+x²),a_iisneuroni,is theinputpattern,Ma1isthenumberofwinningsofneuronai, Na₁isthesetofneighboursofa_i.

2.Deletetheconnectionswithhigherage.Theagesarestandard- izedandthosewhosevaluesareintheregionofrejectionwith k>0areremoved.Theassignedvalueof˛is0.05,therefore z_i= e_i−

,z≡N(0,1) thenf(z)= 1

√2exp −z² 2

(7) where P(z<k)=˛/2→P(z<k)=0.975→(z)=0.975, k=1.96.

Thereforeallzvaluesthataregreaterthan1.96aredeleted.

3.OnceallinputpatternshavebeenintroducedthenaKS-Test[27]

iscarriedoutinordertodetermineifthedensitydistributionfor theneuronsineachgroupfollowsanormaldistribution.Ifso, thelearningprocedureisﬁnished;otherwisethenextpatternis processed.Thevalueof˛chosenis0.05.

Oncethecaseshavebeendistributedinthemeshes,itisnec- essarytoassigneach ofthemeshestoa class accordingtothe followingprocedure:LetIbethesetofindividualsoncetheprobes havebeen filteredand GÊ theset ofclusterscreatedby means oftheESOINNneuralnetwork,definedasGÊ={gÊ/gÊ⊆I},where g_iÊ∩gÊ_j = ,∀ⁱ=/j with g_iÊ,g_jÊ∈GÊ. Let C be the set of existing classesfortheindividualswherec_j∈Cistheclassjintheset.Wecan saythatthemeshi,gÊ_i belongstoaclassj,gÊ_i∈c_jandisrepresented asg_iÊ

cjwhen

max_c_j_∈_C#Ic_j/g_i^E

#Ic_j ·#I_c_j

#I (8)

whereI_c_j ={s∈I/s∈c_j}andI_c_j/g^E_i isthesetofindividualsfromc_j restrictedtothegroupg_i^E.

ThesetofmeshesbelongingtotheclassisdenotedasG^E_c_jandis deﬁnedbytheexpression(9).

G^E_c_j=

g^E icj∈G^E

g^E_i

cj (9)

3.2.2. PAM

ThePAMalgorithm[11]isexecutedinparalleltotheclustering inordertofacilitateacomparisonoftheresultsobtained.Theclas- siﬁcationmadebybothmethods,PAMandESOINN,generatesan equivalenceindexbetweenthetwomethodsthatdeterminesthe consistencyofthereusephase.ThealgorithmusedforPAMisas follows:

1.Selectthenumberofclustersdependingon#C.

2. Themetricusedforthedistanceisthesameastheoneusedin theESOINNnetwork

3.Classifythepatientstakingallofthevariablesintoaccount,with- outanyﬁlteringG^P={g^P/g^P⊆I}withg^P_i ∩g_j^P= ,∀ⁱ=/j 4. OncethegroupsG^Parecreated,anassignationismadefollowing

theprocedureindicatedin(8).

3.2.3. Equivalenceindex

OncetheindividualshavebeenclassiﬁedusingboththePAM andtheESOINNneuralnetworks,theequivalenceindexforboth

Table1 Numberofprobes.

Uniform Cut-off Correlations

CBR 1311 618 541

methodseqiscalculated,andtheerrorratefortheESOINNnet- workisdeterminedasafunctionofthepre-classiﬁedcases.The equivalenceindexisdeﬁnedasindicatedin(10):

eq= #{i∈I/i∈c^E_j ∧i∈c^P_j}

#I (10)

wherec_j^Erepresentsthesetofmeshesbelongingtoclassjthrough the ESOINN network and c_j^P represents the set of individuals belongingtoclassjthroughthePAMalgorithm.

3.2.4. Classiﬁcation

Oncethemeshesaregeneratedbytheclusteringprocess,pre- viously unclassified individuals are now classified by selecting thenearestmesh.Whenthemeshhasbeenselected,thecaseis assignedtotheclassofthemeshselected.Theassignmentisdefined as(11).

is∈G^E_k

cj→iu∈C_j (11)

whereiuistheunclassiﬁedindividual,isistheindividualclosestto theindividualandiucalculatedusingtheEuclideandistance.

Fig.2describesthefunctioningofthereusephase.Asshown inFig.2,thereusestagereceivesthefilteredandnot-filtereddata resultingfromtheretrievestageasinputs.Theinputisusedforboth theESOINNneuralnetworkandthePAMtechnique.TheESOINN neural network generates a setof groups assigned todifferent classes.Thesegroupsarecomposedofmeshescontainingdifferent elementstogetherwiththeinformationofthepreviousclassifica- tion.ThePAMtechniquerepeatsthesameprocessconcurrentlyand generatesthegroupsforeachoftheclasses.Thegroupsgenerated byPAMcontaintheindividualsandtheirpreviousclassification, butdonotconsidersub-groups.Finally,theequivalenceindexis calculatedandthenewpatientisclassified.Theerrorrateforthe ESOINNnetworkismadethrough(8) todeterminetheclassfor eachofthegroups.

3.3. Revise

AsshowninFig.1,therevisioniscarriedoutbyahumanexpert whodeterminesifthegroupassignedbythesystemiscorrect.To facilitatethehumanexperttask,theequivalenceindexandthe errorratewerecalculatedinthereusestage.It isimportantfor themedicalhumanexperttounderstandtheclassificationprocess performedinthetwopreviousstages.Forthisreason,MicroCBR providesaknowledgeextractionmethodintheRevisephase.This methodanalysesthestepsfollowedintheretrieveandreusestages, and extractsknowledge whichisthenformalizedin rules.Con- sequentlythehumanexpertcaneasilyevaluatetheclassification andextractconclusionsconcerningtheefficiencyoftheclassifica- tionprocess.Theknowledgeextractionphasedetectsanomalous classifications,sinceitaccountsfortheexistenceofprobeswith irrelevantinformation,orthosethatweredecisiveinthemisclassi- fication.Forexample,thereareprobeswhichseparateindividuals aseithermenorwomen.Similarly,thehumanexpertcanmark importantprobesthatarenoteliminatedintheretrievestagedur- ingthefiltering.

Theextractionofknowledgethatis presentedtothehuman expertiscarriedoutusingtheJ48algorithm[12].TheJ48algo- rithmistheJavaimplementationoftheC4.5algorithm,anevolution

(8)

Fig.4. Densityplotandhistogramforprobes1552280 atand219703atthatdonotmeettheuniformitytestandinwhichcut-offpointsweredetected.

Table2

Classiﬁcationerrors.

Type1 Type2 Type3

Total 46 12 33

Predicted/Erroneous 47/5 9/0 35/4

oftheoriginalID3[37],whosemainadvantageisthatitincorpo- ratesnumericalattributesintothelogicaloperationscarriedoutin thetestnodes.Thereareotheralternativesforgeneratingdecision ruleswhichoperatesimilartothedecisiontrees,includingRIPPER [38]andPART[39].Thesemethodsextractsimilarinformationto classifyindividualsaccordingtodecisionrules.Whiletheresults aresimilar,thetreesallowtheexperttoapplyamoreintuitive classiﬁcationofthecases.BeforeapplyingJ48,itisnecessaryto performadiscretizationofthevaluessothatthealgorithmiscapa- bleofworkingwithlargevolumesofinformation.Thereasonfor thisisthesearchforthecut-offpointsinthevariablesusesastrat- egythatdependsmainlyonthenumberofdifferentvaluesthata variablecontains.

Table3

Comparisonofmethods.*,differentmedian;=,equal;(−),medianofcolumnless thanmedianofrow.

CBR Dendrogram PAM SVM KNN

CBR

Dendrogram *(−)

PAM *(−) *(−)

SVM =(−) *(+) *(+)

KNN *(−) *(+) *(+) *(−)

Thegeneralobjectiveofextractionofknowledgetechniquesis toprovideahumanexpertwithinformationaboutthesystem- generatedclassiﬁcationbymeansof asetof rules thatsupport thedecision-makingprocess.Itshouldbenotedthatknowledge extractiontechniquesarenotintendedtosubstitutetherationale andexperienceofa humanexpert duringadiagnosis,ratherto complementtheprocessandserveasanadditionalmethodology orguidelineforcommonproceduresinanalysis.

Theprocessisdescribedinthefollowingsteps.LetIbetheset ofindividualsandsthenumberofprobesoncetheﬁlteringprocess hasﬁnished:

fr: A1×...×As→A₁×...×A_s

(a₁,...,as)→fr(a₁,...,as)=(a₁,...,a_s)

whereA_i∈[0,1]isthevalueofthetermiusingthefunctionfrand isobtainedinthefollowingway:

a_i= a_i−min(A_i) max(A_i)−min(A_i)

Finallythevaluesarediscretizedbymeansoffuinaseriesof predeﬁnedlevelst.

fu: A₁×...×A_s→

s

D×...×D (a₁,...,a_s)→fu(a₁,...,a_s)=(d₁,...,d_s) whereD=

i∈{0,...,t−1}i·(1/(t−1)),wecansaythatd_i=d_jifapply- ingthefunctionfu,withd_j∈Difd_i∈[d_j−1/(2t), d_j+1/(2t)].

Oncethetransformationis ﬁnished,thesetof individualsis determinedbythesubsetI⊆

s

D×...×Dofthedata,andJ48[12]

(9)

Fig.5. Decisiontree.

isusedtogeneratetherulesthatclassifytheindividuals.Theuse ofJ48,allowsrulestobeobtainedforclassifyinganindividuali_k∈I totheclassc_jbymeansofrulessimilarto:

r_i=(l₁∧...∧lm)→c_j

wheredpisthevaluefortheattributepfortheindividuali_k.Inthis waythesetisdeﬁnedforrulesRthatclassifytheindividualsfor eachoftheclasses.Theserulescanberepresentedasadecision tree.

Fig.2representstherevisestageforMicroCBR.Theinputcorre- spondstothediscretizationofthevalues(ifthereusephasehas beensuccessful). Subsequently,knowledgeextractionisapplied through the J48. Finally, the relevant information extracted is stored(inconsequentialprobes,importantprobes,andresultsof the classiﬁcation). At this stage, in addition to therepresenta- tionofthedecisiontree,a3Drepresentationwiththeinformation retrievedisdisplayed.ThedimensionalityisreducedbyusingMDS [13–15]

3.4. Retain

Ifthehumanexpertidentiﬁesrelevantinformationattherevise stage,theknowledgeisacquiredandtheinformationobtainedis stored.Theinformationthatisstoredcorrespondstotheclassiﬁ- cationsthatareconsideredcorrect,thedecisionrulesgenerated thatareconsideredrelevant,andtheprobesmarkedasirrelevant.

Fig.6.Illustrationscomparingprobesfoundintherulesfortheknowledgeextractionstage.Withinthediagonal,theillustrationrepresentsthedensitiesforeachprobe.For theillustrationsoneithersideofthediagonal,thegraphisobtainedusingtheprobescorrespondingtoitsrowandcolumn+class3individuals,class2,andoclass1.

(10)

Fig.7.MainprobesrecoveredusingJ48.

TheinformationstoredisdividedintothecasesmemoryIandthe memoryofrulesRandS_Irr.

Fig.2showsthestructureoftheretainstage.Takingtherevision oftheexpertintoaccount,thesystemlearnsfromthenewexpe- rienceandstorestheinformationthattheexpertestablishedas relevant.Thestoredinformationcanincludeprobes,classiﬁcations andrules.

4. Casestudy:classiﬁcationofCLLleukemiapatients

Microarray analysishasmade it possibletocharacterizethe molecularmechanismsthat cause severalcancers. With regard to leukemia, microarray analysis has facilitated the identiﬁca- tion of certain characteristic genes in the different variants of leukemia[7,35,40].Cancerexpertsremarkontheimportanceofthe

identiﬁcationofthegenesassociatedtoeachtypeofcancerinorder toestablishthemostefﬁcienttreatmentsforthepatients[41,42].

TheCancerInstituteinthecityofSalamancawasinterestedinnovel toolsfordecisionsupportintheprocessofCLL(ChronicLympho- cyticLeukemia)patientclassiﬁcation.

TheInstituteprovideduswith91samplesofpatientdataand askedforatooltoprovidedecisionsupportintheexpressionarray analysisprocessandtoincorporateinnovativetechniquestoreduce thedimensionalityofthedataandidentifythevariables witha higherinﬂuenceintheclassiﬁcationofpatients.Thesamplescorre- spondedtopatientsaffectedbychroniclymphocyticleukemia.CLL isadiseaseoflymphocytesthatappeartobematurebutarebiolog- icallyimmature.TheseBlymphocytesarisefromasubsetofCD5-B cellsthatappeartohavearoleinautoimmunity.Thepathogene- sisofchroniclymphocyticleukemiaislikelyamultistepprocess,

(11)

Table4

Classiﬁedcasesforeachtypedetectedaccordingtothetotalnumberofexisting elements/classiﬁedelements/erroneouselements.

Type1 Type2 Type3 Total

46/48/5 12/9/0 33/34/3 91

42/44/5 10/8/0 29/29/3 81

37/38/4 8/8/1 26/25/2 71

31/30/3 7/8/3 23/23/2 61

26/24/2 5/7/3 20/19/1 51

24/24/3 5/4/0 17/18/3 46

initiallyinvolvingapolyclonalexpansionofCD5-Bcellsfollowedby thetransformationofasinglecell[43].CLLisoneofthefourmain typesofleukemia.About15.110newcasesofCLLwillbediag- nosedin2008.Approximately90.179peoplearecurrentlyliving withCLL,morethanthenumberofpeoplelivingwithanyother typeofleukemia.MostpeoplewithCLLareatleast50yearsold [44].CLLstartswithachangetoasinglecellcalledalymphocyte.

Overtime,theCLLcellsmultiplyandreplacenormallymphocytes inthemarrowandlymphnodes.ThehighnumberofCLLcellsin themarrowmaycrowdoutnormalblood-formingcells,andCLL cellsarenot abletoﬁghtoffinfectionlikenormal lymphocytes do[44].Theaimofthetestsperformedinthisstudyistodeter- minewhetheroursystemisabletoclassifynewpatientsbasedon previouslyanalyzedandstoredcases.

5. Resultsobtained

Thispaper has presented MicroCBR, a case-based reasoning systemspecifically designed to analyze data from microarrays, facilitatingthegroupingandclassificationofindividuals.Moreover, MicroCBRprovidesaninnovativemethodforexploringtheclassifi- cationprocessandextractingknowledgeintheformofruleswhich helpthehumanexpertstounderstandtheclassificationprocess andobtainconclusionsabouttherelevanceoftheprobes.MicroCBR wasappliedtoa realcasescenarioforclassifyingCLLleukemia patientsattheCancerInstituteinthecityofSalamanca.Thehuman expertsinthelaboratoryhaveremarkedontheadvantagesofusing MicroCBRasadecisionsupportsystemforCLLclassification,and haveespeciallynotedthefacilityinacquiringknowledgeandexpla- nations.

Section4presentedthecasestudyconsideredinthis report, whichclassified91CLLleukemiapatientsintogroups.Theaimof thecasestudywastoidentifytheprobesthatallowclassifyingthe CLLleukemiapatientsintosubgroups.Inaninitialtest,datafrom 90patients,wherethepreviousclassificationwasnottakeninto account,wereusedintheMicroCBRsystem.Thepre-processing phasebeganwith54.675probesandtheRMAwasappliedtoobtain theluminescencevaluesforeachoftheprobesandtohomogenize thevalues fromdifferentchips.Afterthepre-processing phase, thefilteringprocesswasapplied,notablyreducingtheprobesto 618,withoutincreasingtheerrorrate.Thecut-offphasereduced theprobescomingfromtheuniformfilteringfrom1.311to618 probes.Table1showsthenumberofprobesgeneratedaftereach ofthefilteringstages.Fig.4a–dpresentstwoprobesthatpassed theuniformitytestandcontaincut-offpoints.Fig.4aandcshowsa densityplotandFig.4banddpresentthehistogramfortheprobes 1552280atand219703at.AsseeninFig.4,neitheroftheprobes followsauniformdistribution,andbothofthemhavelowdensity pointswhichallowaseparationofindividuals.Thisstagetakesthe irrelevantprobesfilteringintoaccount,asprobe201909at,which isassociatedtotheYchromosome.

Thereusephasebeginswhentheprobeshavebeenﬁltered,and generatesthemeshesforthegroupsaswellasthedistributionof theindividualsalongthespace.Themeshclosesttothenewcase wasthenselectedandtheclassiﬁcationwasmade.Toevaluatethe

Fig.8. RepresentationoflowdimensionalityprobeswithMDS.

proposedmodel,thesystemclassified90individualsandonenew individual,andtheresultsobtainedwerecomparedtotheprevi- ousexistingclassifications.Table2showsthenumberofelements withanerroneousclassification.Thefinalclassificationwascom- paredwiththedataobtainedusingaclusteranalysisdendrogram [45],PAM[11],theSVMclassificationtechnique[48]andKNN[49]

onthefiltereddata,andtheresultsofthefinalclassificationwere showntohavealowerrateoferror.Theproportionoferrorsin everygroupwascalculatedandtheKurskal–Wallistest[46]was applied,asshowninTable3,todeterminateifthemedianofthese proportionswasequal.Thistestwasperformedbyapplyinga5×2 crossvalidationfollowedbytheKurskal–Wallistest.Weoptedfor theKurskal–Wallistestbecauseitisanon-parametrictechnique thatdoesnotrequireanysuppositionsaboutthedata;asanalter- nativetheMann–Whitneycouldhavebeenused.Thedifference obtainedbetweenCBRandSVM wasnot consideredsignificant, althoughthemedianwaslower.

Intherevisephase,MicroCBRextractedtheknowledgeobtained duringtheclassificationprocess,asshowninFig.5.Fig.5presents atreewherethenodesareprobesandthebranchesaredecision rules.Theleafnodescontainthegroupassignedandthenumberof elementsclassifiedcorrectlyandincorrectly.Thetreerepresented inFig.5istheresultoftheclassificationofthe90patientsand thesinglenewpatient.Fig.6representssomegraphicswherethe valuesoftheretrievedprobesarecomparedtwobytwowithJ48 [12],andthediagonalshowsthedensitygraph.AsseeninFig.6,

(12)

thereareindividualsbelongingtoclass3whichcanbeeasilysep- aratedformtherest.Fig.7a–dshowsthedatacorrespondingto theﬁrstthreeprobesrecoveredwithJ48.Fig.7aandbshowsthe valuesfortheprobesoftheindividuals.Theindividualsforeach ofthegroupsarerepresentedindifferentcolours,andaregres- sionplaneisobtainedforeachofthegroups.Thegraphshowshow theindividualscanbeseparatedinthespaceusingtheseprobes.

Fig.7canddshowsthesameinformationbutellipsoidshavebeen addedinordertogroup50%ofthedataforeachofthegroups,thus facilitatingthespatialseparationofthedata.Toobtainavisualrep- resentationofthepatient’sclassiﬁcation,weusetheMDS[13–15], andthedimensionalityofthedataisreducedtothree.Fig.8aand brepresentstheinformationonceMDShasbeenappliedand,as shown,theindividualsofthedifferentclustersareseparatedinthe space.

Aftercarryingoutexperimentalstudiesofthedifferentproce- duresappliedtothereasoningcycle,thesystemperformancewas evaluated.Inordertoevaluatetheglobalfunctioningofthesys- tem,weincludedanadditionaltest.InthistestMicroCBRcontains 45previouslyclassiﬁedindividuals,andaimstoclassifytheremain- ing46casesusingpreviousknowledge.Table4showstheresults obtainedafterthetest.EachcolumninTable4representsthesub- typesdetected.EachrowinTable4representsthetotalnumber ofelementsbelongingtoeachoftheclasses,thetotalnumberof elementsassignedtothatclass,andthenumberofmisclassiﬁed elements.Wecanseethatthenumberoferrorsdecreasesasmore casesareincorporatedintothesystem.

6. Conclusions

MicroCBRisaspecializedandnovelsystemthatintegratesthe stepsofanexpressionanalysiswithinthestagesofaCBRcycle.

MicroCBRisabletoincorporatetheknowledgeacquiredinprevious classificationsanduseittoperformnewclassifications,providinga muchappreciateddecisionsupporttoolfordoctors.Theuseofthe CBRsystemfacilitatestheselectionandeliminationofprobesthat provideanomalousresults.Havingbeendiscardedintheretrieve phase,theseprobesarenotconsideredinsubsequentiterationsof thereasoningcycle.Asdemonstrated,theproposedsystemreduces thedimensionalitybasedonthefilteringofgeneswithlittlevari- abilityandthosethatdonotallowaseparationofindividualsdue tothedistributionofdata.Italsopresentsaclusteringtechnique basedontheneuronalnetworkESOINN,whichisvalidatedwith aPAMtechnique.Finally,MicroCBRincorporatesatechniquefor knowledgeextractionandpresentsittothehumanexpertsina veryintuitiveformat.

Withtheresultsobtainedfromempiricalstudieswecancon- cludethatMicroCBRprovidesatoolthatdetectsgenesandprobes, whicharethemostimportantfactorforthedetectionofpathol- ogy,andfacilitatesaclassificationandreliablediagnosis,asshown bytheresultspresentedinthispaper.MicroCBRhasbeenapplied toclassifyCLLleukemiapatientsandallowsthehumanexpertto obtaininformationabouttheclassificationprocess andtoiden- tifytheprobesconsideredasimportantorirrelevantforfurther classifications.

Acknowledgment

SpecialthankstotheInstituteofCancerofSalamancaforthe informationandtechnologyprovided.

References

[1]K.S.Lina,C.F.Chien,Clusteranalysisofgenome-wideexpressiondataforfeature extraction,ExpertSystemswithApplications36(2–2)(2009)3327–3335.

[2]Z.K.Stadlera,S.E.Come,Reviewofgene-expressionproﬁlinganditsclinicaluse inbreastcancer,CriticalReviewsinOncology/Hematology69(1)(2009)1–11.

[3]Affymetrix.GeneChip^®HumanGenomeU133Arrayshttp://www.affymetrix.

com/support/technical/datasheets/hgu133arraysdatasheet.pdf.

[4]T.Sawa,L.Ohno-Machado,Aneuralnetworkbasedsimilarityindexforclus- teringDNAmicroarraydata,ComputersinBiologyandMedicine33(1)(2003) 1–15.

[5] D.Bianchia,R.Calogero,B.Tirozzi,Kohonenneuralnetworksandgeneticclas- siﬁcation,MathematicalandComputerModelling45(1–2)(2007)34–60.

[6] V.Baladandayuthapani,S.Ray,B.K.Mallick,BayesianmethodsforDNAmicroar- raydataanalysis,HandbookofStatistics25(1)(2005)713–742.

[7] R.Avogadri,G.Valentini,Fuzzyensembleclusteringbasedonrandomprojec- tionsforDNAmicroarraydataanalysis,ArtiﬁcialIntelligenceinMedicine45 (2–3)(2009)173–183.

[8]J.Kolodner,Case-BasedReasoning,MorganKaufmann,SanMateo,1993.

[9] F.Riverola,F.Díaz,J.M.Corchado,Gene-CBR:acase-basedreasoningtoolfor cancerdiagnosisusingmicroarraydatasets,ComputationalIntelligence22 (3–4)(2006)254–268.

[10]S.Furao,T.Ogura,O.Hasegawa,Anenhancedself-organizingincrementalneu- ralnetworkforonlineunsupervisedlearning,NeuralNetworks20(8)(2007) 893–903.

[11]L.Kaufman,P.J.Rousseeuw,FindingGroupsinData:AnIntroductiontoCluster Analysis,Wiley,NewYork,1990.

[12]N.Saravanan,S.Cholairajana,K.I.Ramachandran,Vibration-basedfaultdiag- nosisofspurbevelgearboxusingfuzzytechnique,ExpertSystemswith Applications36(2–2)(2009)3119–3135.

[13]I.Borg,P.Groenen,ModernMultidimensionalScalingTheoryandApplications, Springer-Verlag,NewYork,1997.

[14]J.B.Kruskal,MultidimensionalScalingbyoptimizinggoodnessofﬁttonon- metrichypothesis,Psychometrika29(1)(1964)1–27(Springer,NewYork).

[15]M.Turea,F.Tokatlib,I.Kurtc,UsingKaplan–Meieranalysistogetherwith decisiontreemethods(C&RT,CHAID,QUEST,C4.5andID3)indetermining recurrence-freesurvivalofbreastcancerpatients,ExpertSystemswithAppli- cations36(2)(2009)2017–2026.

[16]J.Quackenbush,Computationalanalysisofmicroarraydata,NatureReview Genetics2(6)(2001)418–427.

[17]R.J.Lipshutz,S.P.A.Fodor,T.R.Gingeras,D.H.Lockhart,Highdensitysynthetic oligonucleotidearrays,NatureGenetics21(1)(1999)20–24.

[18]M.Taniguchia,L.L.Guana,J.A.Basarabb,M.V.Dodsonc,S.S.Moorea,Compar- ativeanalysisongeneexpressionproﬁlesincattlesubcutaneousfattissues, ComparativeBiochemistryandPhysiologyPartD:GenomicsandProteomics3 (4)(2008)251–256.

[19]O. Margalit, R. Somech, N.Amariglio, G. Rechav, Microarray basedgene expressionproﬁlingofhematologicmalignancies:basicconceptsandclinical applications,BloodReviews4(4)(2005)223–234.

[20]N.J.Armstrong,M.A.VandeWiel,Microarraydataanalysis:fromhypotheses toconclusionsusinggeneexpressiondata,CellularOncology26(5–6)(2004) 279–290.

[21] I.Jurisica,J.Glasgow,Applicationsofcase-basedreasoninginmolecularbiol- ogy,ArtiﬁcialIntelligenceMagazine,SpecialIssueonBioinformatics25(1) (2004)85–95.

[22] J.S.Aaronson,H.Juergen,G.C.Overton,KnowledgeDiscoveryinGENBANK,in:

ProceedingsoftheFirstInternationalConferenceonIntelligentSystemsfor MolecularBiology,AAAIPress,1993,pp.3–11.

[23]N. Arshadi, I. Jurisica, Data mining for case-based reasoning in high- dimensionalbiologicaldomains,IEEETransactionsonKnowledgeandData Engineering17(8)(2005)1127–1137.

[24]Affymetrix. Statistical Algorithms Description Document, http://www.

affymetrix.com/support/technical/whitepapers/saddwhitepaper.pdf.

[25]Affymetrix.Guideto ProbeLogarithmicIntensityError (PLIER)Estimation http://www.affymetrix.com/support/technical/technotes/pliertechnote.pdf.

[26]R.A.Irizarry,B.Hobbs,F.Collin,Y.D.Beazer-Barclay,K.J.Antonellis,U.Scherf, T.P.Speed,Exploration,normalization,andsummariesofhighdensityoligonu- cleotidearrayprobeleveldata,Biostatistics4(2003)249–264.

[27]R.Brunelli,Histogramanalysisforimageretrieval,PatternRecognition34 (2001)1625–1637.

[28]J.Jureˇckováa,J.Picek,Shapiro–Wilktypetestofnormalityundernuisance regressionandscale,ComputationalStatistics&DataAnalysis51(10)(2007) 5184–5191.

[29]N.Saitou,M.Nie,Theneighbor-joiningmethod:anewmethodforreconstruct- ingphylogenetictrees,MolecularBiologyandEvolution4(1987)406–425.

[30]T.Kohonen,Self-organizedformationoftopologicallycorrectfeaturemaps, BiologicalCybernetics(1982)59–69.

[31]B.Fritzke,Agrowingneuralgasnetworklearnstopologies,AdvancesinNeural InformationProcessingSystems7(1995)625–632.

[32]F.Shen,Analgorithmforincrementalunsupervisedlearningandtopologyrep- resentation,Tokyo:Ph.D.Thesis,TokyoInstituteofTechnology,2006.

[33]S.J.Redmond,C.Heneghan,AmethodforinitialisingtheK-meansclustering algorithmusingkd-trees,PatternRecognitionLetters28(8)(2007)965–973.

[34]T.Martinetz,CompetitiveHebbianlearningruleformsperfectlytopologypre- servingmaps, in:ICANN’93:InternationalConferenceonArtiﬁcial Neural Networks,Amsterdam,1993,pp.427–434.

[35]B.Guinn,A.F.Gilkes,E.Woodward,N.B.Westwood,G.J.Muftia,D.Linchc, A.K.Burnett,K.I.Mills,Microarrayanalysisoftumourantigenexpressionin presentationacutemyeloidleukaemia,BiochemicalandBiophysicalResearch Communication333(5)(2005)703–713.