Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are encouraged to visit:
http://www.elsevier.com/copyright
ContentslistsavailableatSciVerseScienceDirect
Applied Soft Computing
jo u r n al h om epa g e : w w w . e l s e v i e r . c o m / lo c a t e / a s o c
MicroCBR: A case-based reasoning architecture for the classification of microarray data
Juan F. De Paz
a, Javier Bajo
a,∗, Vicente Vera
b, Juan M. Corchado
aaDepartamentoInformáticayAutomática,UniversityofSalamanca,PlazadelaMerceds/n,37008Salamanca,Spain
bDepartamentodeEstomatologiaII,FacultaddeOdontologia,UniversidadComplutensedeMadrid,PlazadeRamonyCajal,s/n,CiudadUniversitaria,28040Madrid,Spain
a rt i c l e i n f o
Articlehistory:
Received13July2009
Receivedinrevisedform25July2011 Accepted14August2011
Availableonline22August2011
Keywords:
Case-basedreasoning HGU133
ESOINN
CLLleukemiaclassification Decisiontree
a b s t r a c t
Microarraytechnologyprovidesagreatamountofdatarequiringanalysis,andtheuseofnewintelligent algorithmsthatidentifytherelevantinformationfortheclassificationprocesshasbecomeessential.This studypresentsaclassificationtoolcalledMicroCBRthatusesthecase-basedreasoningparadigmand incorporatesanovelfilteringtechniquebasedonstatisticalmethods,anewclusteringtechniquethat usesESOINN(EnhancedSelf-OrganizingIncrementalNeuronalNetwork),andaknowledgeextraction techniquebasedontheJ48algorithm.MicroCBRhasbeenappliedtoclassify91CLLpatientsandthe resultsobtainedareshowninthispaper.
©2011ElsevierB.V.Allrightsreserved.
1. Introduction
Theuseofmicroarrays,andmorespecificallyexpressionarrays, enablestheanalysisofdifferentsequencesofoligonucleotides[1,2].
Microarrayscanbeusedinthefieldofmedicinetodetectchanges atthegeneticlevel.Simplyput,amicroarrayisanarrayofprobes thatcontainsgeneticmaterialwithapredeterminedsequence.Pre- determinedsequencescanvaryaccordingtothefieldinwhichthe studyisapplied;inthecaseofhumans,thesequenceswillcontain humangeneticmaterial.Thesesequencesarehybridizedwiththe geneticmaterialofpatients,thusallowingthedetectionofgenetic mutationsthroughtheanalysisofthepresenceorabsenceofcertain sequencesofgeneticmaterial.Thisstudywasconductedtoanalyze theintensitiesproducedduringthehybridization.Theanalysisof thisinformationallowsthedetectionofpatternsthatcharacterize certaindiseasesand,mostimportantly,thegenesassociatedwith thesedifferentdiseases.
Theanalysisofexpressionarraysiscalledexpressionanalysis.
Anexpressionanalysisbasicallyconsistsofthreestages:normal- izationandfiltering;clusteringandclassification;andextraction ofknowledge.Thesestagesarecarriedoutfromtheluminescence valuesfoundintheprobes.Presently,thenumberofprobescon-
∗Correspondingauthor.Address:Compa ˜nía5,37002Salamanca,Spain.
Tel.:+34639771985;fax:+34923277101.
E-mailaddresses:[email protected](J.F.DePaz),[email protected](J.Bajo), [email protected](V.Vera),[email protected](J.M.Corchado).
tainingexpressionarrayshasincreasedconsiderablytotheextent thatithasbecomenecessarytousenewmethodsandtechniques toanalyzetheinformationmoreefficiently[3].Itisnecessaryto developnewtechniquestoanalyzelargevolumesofdata,extract therelevantinformation,anddeletetheinformationwhichhasno relevancetotheclassificationprocess.Moreover,theknowledge obtainedduringtheclassificationprocessisofgreatimportance forsubsequentclassifications.Therearevariousartificialintelli- gencetechniquessuchasartificialneuralnetworks[4,5],Bayesian networks [6], and fuzzy logic [7] which have been applied to microarrayanalysis.Whilethesetechniquescanbeappliedatvari- ousstagesofexpressionanalysis,theirintegrationintosystemsthat combinevariousartificialintelligencetechniques,suchasCBR,is ofparticularinterest.Previousresearch[47]hasstudiedthecapac- ityofdifferentclassifiersinthefieldofsequencing;however,the systemonlyusesnewclassificationsinthelearningphase,andcon- sequentlydoesnottakeintoaccounttherelevanceofthechanges madebyexpertanalysis.ThispaperpresentsMicroCBR,asystem basedonCBR(case-basedreasoning)whichusespastexperiences tosolvenewproblems[8].Assuch,itisperfectlysuitedforsolving theproblemathand.Inaddition,CBRmakesitpossibletoincor- poratethevariousstagesofexpressionanalysisintothereasoning cycleoftheCBR,thusfacilitatingthecreationofstrategiessimi- lartotheprocessesfollowedinmedicallaboratories.Therecovery ofinformationfrompreviousanalysessimplifiestheclassification processbydetectingandeliminatingrelevantandirrelevantprobes detectedinpreviousanalyses.Thisallowsforalearningprocess thatcannotonlyincorporatenewcases,butalsoselecttherelevant informationfromcasesthathavebeenreviewedbyanexpert.
1568-4946/$–seefrontmatter©2011ElsevierB.V.Allrightsreserved.
doi:10.1016/j.asoc.2011.08.021
Forsometimenow,wehavebeenworkingontheidentification oftechniquestoautomatethereasoningcycleofseveralCBRsys- temsasappliedtocomplexdomains[9].TheobjectiveofMicroCBR istodevelopaCBRforclassifyingpatientsbasedoninformation obtainedfrommicroarrays.Thissystemisappliedtotheclassi- ficationof subtypes ofleukemia,specifically,todetectpatterns andextractsubgroupswithintheCLLtypeofleukemia.MicroCBR incorporatesvarioustechniquesofartificialintelligence(filtering, clustering,artificialneuralnetworksandextractionofknowledge) atdifferentstages ofthereasoningcycle.In theretrievephase, newpre-processing andfilteringtechniquesareincorporatedin ordertoselecttheprobeswithrelevantinformationforclassifying patients.Thisinnovativemethodiscapableofobtainingafiltering processthatincorporatesmultiplestatisticaltechniquesbasedon hypothesiscontrasts,whichnotablyreducesthedimensionalityof thedata.Reducingthedimensionalityofthedatamakesitpossible tousetechniqueswithgreatercomputationalcomplexityinlater stagesoftheCBRcycle,whichwouldotherwisebeunviable.The reusestageincorporatesaclassificationtechniquebasedonneural networkESOINN[10].Thistechniqueproposesanovelmethodfor generatingclustersandforidentifyingthenearestclusterforthe finalclassification.ESOINNperforms clusteringbyincorporating boththedistributionprocessoftheentiresurfaceofclassification, andtheseparationbetweengroupswithlowdensityamongthem.
AnadditionalgroupingtechniqueknownasPAM[11](partition aroundmedoids),isalsoused,resultinginamoreaccurateclas- sification,sincetheresultssuggestedbytheESOINNnetworkare comparedtothoseobtainedusingthePAMtechnique.Oncethe classificationprocessisfinished,therevisephaseinitiatesatech- niqueforextractingknowledgeJ48[12]thatcanextractrulesto explainthedecisionstakenintheretrieveandreusestages.More- over,therevisestageincludesaMDS(MultidimensionalScaling) technique[13–15]forpresentinginformationinlowdimension- ality.Ahumanexpertanalyzesthisinformationandevaluatesthe proposedclassificationaswellasthevalidityoftherulesgenerated.
Finally,intheretainstage,ifthehumanexpertconsidersthepro- posedsolutionvalid,thesystemstoresthecaseinformationand therulesthat havebeen obtained.In this way,MicroCBRtakes advantageoftheinformationfrompreviouslyclassifiedpatients, andfromtherelevantandirrelevantprobes,andrulesfromthe previouslyappliedknowledgeextraction.
Thepaperisstructuredasfollows:thenextsectionbrieflyintro- ducestheproblemthatmotivatesthisresearch.Section3presents MicroCBRanddescribesthenovelstrategiesincorporatedinthe stagesoftheCBRcycle.Section4describesacasestudyspecifically developedtoevaluatetheCBRsystempresentedwithinthisstudy, consistingofaclassificationofCLLleukemiapatients.Finally,Sec- tion5presentstheresultsandconclusionsobtainedaftertesting themodel.
2. Microarraydataanalysis
Microarrayhasbecomeanessentialtoolingenomicresearch, making it possible to investigate global gene expression in all aspects of humandisease [16]. Microarray technologyis based onadatabaseofgenefragmentscalledESTs(ExpressedSequence Tags), which are used to measure target abundance using the scannedfluorescenceintensitiesfromtaggedmoleculeshybridized toESTs[17].Specifically,theHGU133plus2.0[3]arechipsused forexpressionanalysis.Thesechipsanalyzetheexpressionlevel of over 47,000 transcripts and variants, including 38,500 well- characterizedhumangenes.Itiscomprisedofmorethan54,000 probesetsand1,300,000distinctoligonucleotidefeatures.TheHG U133plus2.0providesmultiple,independentmeasurementsfor eachtranscript.Theuseofmultipleprobesprovidesacompletedata
setwithaccurate,reliable,reproducibleresultsfromeveryexper- iment. Microarray technology is a critical element for genomic analysisandallowsanin-depthstudyofmolecularcharacteriza- tionofRNAexpression,genomicchanges,epigeneticmodifications orprotein/DNAunions.
Expressionarrays[3]areatypeofmicroarraythathavebeen usedindifferentapproachestoidentifythegenesthatcharacter- izecertaindiseases[7,18,19].Inallcases,thedataanalysisprocess isessentiallycomposedofthreestages:normalizationandfilter- ing,clustering,andclassification.Thefirststepiscriticaltoachieve bothagoodnormalizationofdataandaninitialfilteringtoreduce thedimensionalityofthedatasetwithwhichtowork[20].Since theproblemathandisworkingwithhigh-dimensionalarrays,itis importanttohaveagoodpre-processingtechniquethatcanfacil- itateautomaticdecision-makingaboutthevariablesthatwillbe vitalfortheclassificationprocess.Inlightofthesedecisionsitwill bepossibletoreducetheoriginaldataset.Moreover,thechoice ofaclusteringtechniqueallowsdatatobegroupedaccordingto certainvariablesthatdominatethebehaviourofthegroup.After organizing thedataintogroups it ispossibletoextract knowl- edgeandclassifypatientswithinthegroupthatpresentsthemost similarities.
Case-based reasoning [9] is particularly applicable to this problemdomainbecauseit(i)supportsarichandevolvablerep- resentationofexperiences,problems,solutionsandfeedback;(ii) providesefficientandflexiblewaystoretrievetheseexperiences;
and(iii)appliesanalogicalreasoningtosolvenewproblems[21].
CBRsystems canbeusedtoproposenewsolutions orevaluate solutions toavoidpotentialproblems.Theresearchin[22]sug- gests that analogical reasoning is particularlyapplicable tothe biological domain, in partbecausebiological systems are often homologous(rootedinevolution).Moreover,biologistsoftenusea formofreasoningsimilartoCBR,whereexperimentsaredesigned andperformedbasedonthesimilaritybetweenfeaturesofanew systemandthoseofknownsystems.In[23]amixtureofexperts forcase-basedreasoning(MOE4CBR)isproposed.Itisamethod thatcombinesanensembleofCBRclassifierswithspectralclus- teringandlogisticregression,butdoesnotincorporateextraction of knowledgetechniquesand doesnot focusondimensionality reduction.Someapproachessuchas[9]provideCBRsolutionsand knowledgeextractiontechniques,facilitatingthecomprehension oftheclassificationprocess.ThispaperpresentsMicroCBR,aCBR solutionwhichalsoincorporatesnewknowledgeextractiontech- niques, but additionallyfocuses onthe definitionof innovative strategiesfordimensionalityreductionandclustering.Thefollow- ingsectionpresentsadetailedaccountoftheCBRsystemproposed inthiswork.
3. CBRsystemforclassifyingmicroarraydata
ThissectionpresentstheCBRsystemproposedinthecontext ofthisresearchandprovidesaclassificationtechniquebasedon previousexperiencesfordatafrommicroarrays.TheCBRdeveloped systemimitatesthebehaviourofhumanexpertsinthelaboratory andincorporatesinnovativeknowledgediscoverytechniques.The systemreceivesdatafromtheanalysisofchipsandisresponsible forclassifyingindividualsbasedonevidenceandexistingdata.
ThepurposeofCBRistosolvenewproblemsbyadaptingsolu- tionsthathavebeenusedtosolvesimilarproblemsinthepast[8].
TheprimaryconceptwhenworkingwithCBRsistheconceptof case.Acasecanbedefinedasapastexperience,andiscomposedof threeelements:aproblemdescriptionwhichdescribestheinitial problem;asolutionwhichprovidesthesequenceofactionscar- riedoutinordertosolvetheproblem;andthefinalstagewhich describesthestateachievedoncethesolutionwasapplied.CBR
managescases(pastexperiences)tosolvenewproblems.Theway casesaremanagedisknownastheCBRcycle,andconsistsoffour sequentialstepswhicharerecalledeverytimeaproblemneedsto besolved:retrieve,reuse,reviseandretain.EachstepoftheCBRlife cyclerequiresamodelormethodinordertoperformitsmission.
IntheCBRsystemproposedwithinthisstudy,theretrievephase filtersvariables,andrecoversimportantvariablesfromthecasesto determinethemostinfluentialfortheclassification.Oncethemost importantvariableshavebeenretrieved,thereusephasebegins adaptingthesolutionsfortheretrievedcasestoobtainthecluster- ing.Oncethisgroupingisaccomplished,thenextstepisknowledge extraction.Therevisephaseconsistsofanexpertrevisionforthe proposedsolution,andfinally,theretainphaseallowsthesystemto learnfromtheexperiencesobtainedinthethreepreviousphases, consequentlyupdatingthecasesmemory.
AkeyelementinaCBRsystemisacase,whichcanbedefined asapastexperience[8]andiscomposedofthreeelements:prob- lemdescription,problemsolution,andthefinalstateobtainedafter applyingthesolution.AcaseinMicroCBR contains information relatedtothepatient,therules,theproposedclassification,and theprobesmarkedasirrelevantorimportant.Thecaseisdefined bythefollowingexpression:
ij=(id,S=(A1,...,An),Cp,Cr)
whereij∈IandI={i1,...,iS}isthesetofindividuals/cases,Aisthe setofalltheprobes,Airepresentstheprobei,Cpisthepredicted classandCrtheactualclass.
In addition tothecases memory, oursystemincorporates a memoryofrulesthatcontainstheinformationextractedthrough theknowledgeextractiontechniques.Thememoryofrulesisstruc- turedasfollows:
R={r1,...,rl}, with ri=(l1∧...∧lm)→cj
where ls=(dts,os,)/dts∈Dt,os∈O,SIrr⊆A whereRisthesetofrulesfromthedecisiontree,lscontainsasetof discretizedprobes,anoperatorandarealvalue,Dtisthediscretiza- tionvaluefortheprobeAt,O={=, =/,>,<,≤,≥}operator,andSIrr
isthesetofprobesmarkedasirrelevant,cj∈Cp.
Whenanewcaseisclassified,anewdecisiontreeisgenerated intherevisestage.Asetofrulesareextractedfromthedecision treewhichprovideknowledgeabouttherelevanceoftheprobes intheclusteringandclassificationprocess.Fig.1showsascheme ofthetechniquesappliedinthedifferentstagesoftheCBRcycle.
AsseeninFig.1,theimportantprobesthatallowtheclassifica- tionofpatientsarerecoveredintheretrievephase.Theretrieve phaseisdividedinto6sub-phases:pre-processingthroughRMA, removalofirrelevantvariables,uniformdistribution,probeswith- outmeaningfulcut-offpoints,andcorrelatedvariables.Inthereuse phasethepatientsaregroupedbymeansofanESOINNneuralnet- work.Then,thepatientswithnopriorclassificationareassigned toagroupusingthenearestcluster.IntherevisephasetheJ48 [12]techniqueisappliedforextractingknowledgeaboutthemost importantprobes forthe classification,and theMDS technique [13–15]isusedtomakearepresentationinlowdimensionality.
Finally,intheretainphase,theknowledgeisupdated.Thisknowl- edgeincludesthecaseclassification,thedecisionrulesobtained, andtheinformationassociatedwiththeimportanceorirrelevance ofcertainprobesextractedfromtherules.
Fig.1showstheschemeoftheCBRsystemproposedwithinthis study.Fig.2detailsthestepsfollowedineachofthestagesofthe CBRcycle.ThestructureoftheMicroCBRwillnowbeexplainedin detail,presentinginnovativetechniquesmodelledineachofthe stagesoftheCBR.
Fig.1.StructureoftheproposedMicroCBRmodel.
3.1. Retrieve
Traditionally,onlythecasessimilartothecurrentproblemare recovered,oftenbecauseofperformance,andthenadapted.With regardtoexpressionarray,thenumber ofcasesisnotacritical factor,rather thenumberofvariables. Forthis reason,wehave incorporatedaninnovativestrategywherevariablesareretrieved atthisstageandthen,dependingontheidentifiedvariables,therest ofthestagesoftheCBRarecarriedout.Thenewstrategyallowsa notablereductioninthedimensionalityofthedata.Thefiltering phaseiscarriedoutonItogetherwiththenewcaseis.Thefiltering isonlyappliedtothoseprobesnotassociatedwithanyoftherules.
First,apre-processingofthedataisconductedusingRMA.Then, the5filteringsub-phasesareexecuted:removalofcontrolprobes, removaloferroneousprobes,removaloflow variabilityprobes, removalofprobeswithauniformdistribution,andremovalofcor- relatedprobes.Thesefivesub-phasesareoutlinedinthefollowing paragraphs.
3.1.1. RMA
Thisphasebeginsoncethelaboratoryexperimentwithmicroar- rayshasbeencompleted.Theresearcherobtainsvariousfilesthat contain grossintensityvalues. Prior toanalyzing thedata, itis important to complete the pre-processing phase, which elimi- nates defective samples and standardizes the data. As seen in Fig. 2, this phaseis normally divided into 3 sub-phases:back- groundcorrection,standardization,andsummarization.Thereis currently a limited group of algorithms that investigators use for performingthesesteps.ThemostcommonareMAS5.0[24]
(MicroarrayAffymetrixSuite5.0),PLIER[25](ProbeLogarithmic IntensityError),andRMA)[26](RobustMulti-arrayAverage).
The RMA [26] algorithm is a method for normalizing and summarizingprobe-levelintensitymeasurements.Itanalyzesthe valuesforthePM(Perfect-Match):inthefirststep,aBackground Correctioniscarriedouttoremovethenoisefromtheaveragesof thePM;inthesecondstep,thedataisquantilenormalizedinorder tocomparedatafromdifferentmicroarrays;finally,asummariza- tionismadeandthevaluesforeachprobe-setaregenerated.
Fig.2. DetailedarchitecturefortheMicroCBRmodel.
3.1.2. Irrelevantprobes
Oncethecontrol andtheerroneous probeshave beenelim- inated, the filtering process begins. The first step consists of eliminatingtheprobesmarkedasirrelevantinpreviousexecutions oftheCBRcycle.Thisway,allprobesthatcanpassthefiltering phase,butarepronetocauseerroneousresultsduringthereuse phase,areremoved.
3.1.3. Variability
Thesecondstageistoremovetheprobesthathavelowvariabil- ity.Thisworkiscarriedoutaccordingtothefollowingsteps:
1.Calculatethestandarddeviationforeachoftheprobesj
·j=+
1
n
nj=1
(¯·j−xij)2 (1)
wherenisthetotalnumberofcases,¯·jistheaveragepopula- tionforthevariablej,andxijisthevalueoftheprobejforthe individuali.
2.Standardizetheabovevalues zi= ·j−
(2)
where =(1/n)
nj=1·j and ·j=+
(1/n)
nj=1(¯·j−xij)2 wherezi≡N(0,1)
3.Discardprobesforwhichthevalueofzmeetsthefollowingcon- dition:z<−1.0.Thiswillachievetheremovalofabout16%ofthe probesifthevariablefollowsanormaldistribution.
3.1.4. Uniformdistribution
Finally,allremainingvariablesthatfollowauniformdistribu- tionareeliminated.Thevariablesthatfollowauniformdistribution willnotallowtheseparationofindividuals.Therefore,thevariables thatdonotfollowthisdistributionwillbereallyusefulvariables in theclassificationof thecases.Thecontrastofassumptionsis explainedbelow,usingtheKolmogorov–Smirnov[27]testasan example.H0:thedatafollowauniformdistribution;H1:theana- lyzeddatadonotfollowauniformdistribution.Statisticalcontrast:
D=max{D+,D−} (3)
where D+=max
1≤i≤n{(i/n)−F0(xi)}, D−=max
1≤i≤n{F0(xi)−((i−1)/n)} withiasthepatternofentry,nthenumberofitemsandF0(xi)the probabilityofobservingvalueslessthaniwithH0beingtrue.The valueofstatisticalcontrastiscomparedtothenextvalue:
D˛= C˛
k(n) (4)
Inthespecialcaseofuniformdistributionk(n)=√
n+0.12+ (0.11/√
n)andalevelofsignificance˛=0.05,C˛=1.358.
3.1.5. Cut-offpoints
Thisstepremovestheprobesthat,despitenotfollowingauni- formdistribution,havenoseparationbetweenelements,anddo notallowtheelementstobepartitioned.Thewaytoremovethe probesistodetectchangesinthedensitiesofthedata,andtoselect thefinalprobes.Theprobesinwhichcut-offsorhighdensitiesare notdetectedareeliminated,astheydonotprovideusefulinfor- mationtotheclassificationprocess.Thiswillkeeptheprobesthat allowtheseparationofindividuals.Thedetectionoftheseparation intervalsisperformedbycalculatingthedistancebetweenadjacent
Select a probe
Sort the values
Remove elemets
< threshold Uniform distribution
Confidence
interval Percentil
Cluster Cluster
More probes
yes no
No
yes 1, 2
3, 4
5
6, 7
8 10
10
11
12
Fig.3.Cut-offpointsoverview.
individuals.Oncethedistanceiscalculated,itispossibletodeter- minethepotentiallyrelevantvalues.Theselectioniscarriedoutby applyingconfidenceintervalsforthevaluesofthesedifferencesif thevaluesfollowauniformdistribution,orbyselectingthevalues aboveacertainpercentileifthevaluesdonotfollowanormaldis- tribution.ThediagraminFig.3illustratesthedetectionprocessfor cut-offpoints.
Thisprocessisformalizedasfollows:
1.LetIbethesetofindividualswithfilteredprobestogetherwith thenewindividual,wherex·jrepresentstheprobejforallthe individuals,andxijtheindividualifortheprobej
2.Selecttheprobej=1,x·j
3.Sortinincreasingordervaluesx·j
4.Calculatethevalueforxij=xi+1j−xij
5.Determineifthevariablexijfollowsauniformdistributionby meansoftheShapiro-Wilktest[28],otherwisegotostep10.
6.Calculatethevaluefor ¯x.j
7.Establish the confidence interval for the variance, which is established as 2·j∈[((n−1)·S2)/(2n
−1,1−˛/2),((n−1)· S2)/2n
−1,˛/2] with ˛=0.05 and n=#x·j and the number of elementsforx·j,S2isthesamplingvariance.
8.Establishthesetofelementsformxijnotbelongingtotheset Qj={xij/xij >I+
·j}whereI+
·j
isthevalueoftheupperlimitof theconfidenceinterval.
9.Gotostep11.
10.SelectthosevaluesuptothepercentileP˛fromeveryx·jand establishthesetQj={xij/xij>P˛}
11. Selecttheprobej+1inthecaseofmoreprobesneedingrevision andgotostep2.
12.CreatethenewsetofprobesI=∪x·j/∃xij∈Qj/i>#x·u∧i<
#x−#x·u.Whereuisaconstantbetween[0,0.5].
13.Finalizeandreturnthenewsetofindividualswiththefiltered probesI.
3.1.6. Correlations
Atthelaststageofthefilteringprocess,correlatedvariablesare eliminatedsothatonlytheindependentvariablesremain.Tothis end,thelinearcorrelationindexofPearsoniscalculatedandthe probesmeetingthefollowingconditionareeliminated.
rxiy·j<˛ (5)
given ˛=0.95, rx·iy·j=(x·ix.j/x·ix·j), x·ix·j=(1/N)
n s=1(¯.i− xsi)( ¯u.j−xsj)wherex·ix.jisthecovariancebetweenprobesiand j.3.2. Reuse
Inthisphasetheclusteringofindividualsiscarriedout,along withtheclassificationofnewindividualstooneoftheclusters.
Thereareseveralalgorithmsforclustering,butthemostcommon arethehierarchicalalgorithms[29]andthosebasedonpartition- ing[11].Withinthehierarchicalalgorithmsthemostcommonis the dendrogram[29]. Among thepartition-basedmethods it is possibletofindalternativesbasedonRNAssuchasSOM[30](Self- OrganizingMap),GNG[31](GrowingNeuralGas)orSOINN[32]
(Self-Organizing IncrementalNeuronalNetwork).Otheralterna- tivesincludek-means[33]orPAM[11].Thesetwomethodsdefine aseriesofinitialclusterswhichcorrespondtoanewindividualor toexistingindividuals,andaremarkedastheclusterrepresenta- tives,whiletheremainingindividualsareallocatedtothenearest cluster.Theproblemwitheachofthesemethodsisthattheydonot considerchangesinthedistributionofdensitiesofindividuals,and usuallydonotdetectclusterswithatypicalforms,suchaselongated clusters.Analternativetoclustermethodscouldbethedirectappli- cationofclassifiers,includingoptimization-basedfunctionssuchas SVM[48],decisiontrees,decisionrules,ordiffusemethodssuchas thek-nearestneighbour.
3.2.1. ESOINNneuralnetwork
NeuralNetworksbasedonGNGcandetectclusterswithatyp- icalforms,adjustiterativelytothedistributionoftheindividuals, anddetectlowdensityzones.TherearevariantsoftheGNG,such as the GCS [34] (Growing Cell Structure) or SOINN [32] (Self- OrganizingIncrementalNeuronalNetwork).Unlikeself-organizing mapsbasedonmeshes,GrowingGridorGCSdonotsetthenum- berofneurons,orthedegreeofconnectivity,buttheydoestablish thedimensionalityofeachmesh.Thiscomplicatestheseparation phasebetweengroups oncetheneurons aredistributed evenly acrossthe surface.The ESOINNneuralnetwork [10] (Enhanced Self-OrganizingIncrementalNeuronalNetwork)isavariation of theSOINNneuralnetwork [32],which allowsthecreation ofa singlelayer,whileESOINNisabletoincorporateboththedistri- butionprocessalongthesurfaceandtheseparationbetweenlow densitygroups.Thelearningprocessofthenetworkisdistributed intotwostages:thefirststageofcompetitionCHL[34](Competi- tiveHebbianLearning)wheretheclosestnodetotheinputpattern isselected;andthesecondadaptation/growingstagesimilartoa GNG.Thetrainingphaseandthevariousalgorithmsappliedatevery modifiedstageareoutlinedbelow:
1.Updatetheweightsofneuronsbyfollowingaprocesssimilarto theSOINN,butintroducinganewdefinitionforthelearningrate inordertoprovidegreaterstabilityforthemodel.Thislearning
ratehasproducedgoodresultsinothernetworkssuchasSOM [36].
Wa1=n1(Ma1)(−Wa1)
Wai=n2(Ma1)(−Wai) withai∈Na1 (6) wheren1(x)=(1/√
x),n2(x)=(1/
2+x2),aiisneuroni,is theinputpattern,Ma1isthenumberofwinningsofneuronai, Na1isthesetofneighboursofai.
2.Deletetheconnectionswithhigherage.Theagesarestandard- izedandthosewhosevaluesareintheregionofrejectionwith k>0areremoved.Theassignedvalueof˛is0.05,therefore zi= ei−
,z≡N(0,1) thenf(z)= 1
√2exp −z2 2
(7) where P(z<k)=˛/2→P(z<k)=0.975→(z)=0.975, k=1.96.
Thereforeallzvaluesthataregreaterthan1.96aredeleted.
3.OnceallinputpatternshavebeenintroducedthenaKS-Test[27]
iscarriedoutinordertodetermineifthedensitydistributionfor theneuronsineachgroupfollowsanormaldistribution.Ifso, thelearningprocedureisfinished;otherwisethenextpatternis processed.Thevalueof˛chosenis0.05.
Oncethecaseshavebeendistributedinthemeshes,itisnec- essarytoassigneach ofthemeshestoa class accordingtothe followingprocedure:LetIbethesetofindividualsoncetheprobes havebeen filteredand GE theset ofclusterscreatedby means oftheESOINNneuralnetwork,definedasGE={gE/gE⊆I},where giE∩gEj = ,∀i=/j with giE,gjE∈GE. Let C be the set of existing classesfortheindividualswherecj∈Cistheclassjintheset.Wecan saythatthemeshi,gEi belongstoaclassj,gEi∈cjandisrepresented asgiE
cjwhen
maxcj∈C#Icj/giE
#Icj ·#Icj
#I (8)
whereIcj ={s∈I/s∈cj}andIcj/gEi isthesetofindividualsfromcj restrictedtothegroupgiE.
ThesetofmeshesbelongingtotheclassisdenotedasGEcjandis definedbytheexpression(9).
GEcj=
gE icj∈GE
gEi
cj (9)
3.2.2. PAM
ThePAMalgorithm[11]isexecutedinparalleltotheclustering inordertofacilitateacomparisonoftheresultsobtained.Theclas- sificationmadebybothmethods,PAMandESOINN,generatesan equivalenceindexbetweenthetwomethodsthatdeterminesthe consistencyofthereusephase.ThealgorithmusedforPAMisas follows:
1.Selectthenumberofclustersdependingon#C.
2. Themetricusedforthedistanceisthesameastheoneusedin theESOINNnetwork
3.Classifythepatientstakingallofthevariablesintoaccount,with- outanyfilteringGP={gP/gP⊆I}withgPi ∩gjP= ,∀i=/j 4. OncethegroupsGParecreated,anassignationismadefollowing
theprocedureindicatedin(8).
3.2.3. Equivalenceindex
OncetheindividualshavebeenclassifiedusingboththePAM andtheESOINNneuralnetworks,theequivalenceindexforboth
Table1 Numberofprobes.
Uniform Cut-off Correlations
CBR 1311 618 541
methodseqiscalculated,andtheerrorratefortheESOINNnet- workisdeterminedasafunctionofthepre-classifiedcases.The equivalenceindexisdefinedasindicatedin(10):
eq= #{i∈I/i∈cEj ∧i∈cPj}
#I (10)
wherecjErepresentsthesetofmeshesbelongingtoclassjthrough the ESOINN network and cjP represents the set of individuals belongingtoclassjthroughthePAMalgorithm.
3.2.4. Classification
Oncethemeshesaregeneratedbytheclusteringprocess,pre- viously unclassified individuals are now classified by selecting thenearestmesh.Whenthemeshhasbeenselected,thecaseis assignedtotheclassofthemeshselected.Theassignmentisdefined as(11).
is∈GEk
cj→iu∈Cj (11)
whereiuistheunclassifiedindividual,isistheindividualclosestto theindividualandiucalculatedusingtheEuclideandistance.
Fig.2describesthefunctioningofthereusephase.Asshown inFig.2,thereusestagereceivesthefilteredandnot-filtereddata resultingfromtheretrievestageasinputs.Theinputisusedforboth theESOINNneuralnetworkandthePAMtechnique.TheESOINN neural network generates a setof groups assigned todifferent classes.Thesegroupsarecomposedofmeshescontainingdifferent elementstogetherwiththeinformationofthepreviousclassifica- tion.ThePAMtechniquerepeatsthesameprocessconcurrentlyand generatesthegroupsforeachoftheclasses.Thegroupsgenerated byPAMcontaintheindividualsandtheirpreviousclassification, butdonotconsidersub-groups.Finally,theequivalenceindexis calculatedandthenewpatientisclassified.Theerrorrateforthe ESOINNnetworkismadethrough(8) todeterminetheclassfor eachofthegroups.
3.3. Revise
AsshowninFig.1,therevisioniscarriedoutbyahumanexpert whodeterminesifthegroupassignedbythesystemiscorrect.To facilitatethehumanexperttask,theequivalenceindexandthe errorratewerecalculatedinthereusestage.It isimportantfor themedicalhumanexperttounderstandtheclassificationprocess performedinthetwopreviousstages.Forthisreason,MicroCBR providesaknowledgeextractionmethodintheRevisephase.This methodanalysesthestepsfollowedintheretrieveandreusestages, and extractsknowledge whichisthenformalizedin rules.Con- sequentlythehumanexpertcaneasilyevaluatetheclassification andextractconclusionsconcerningtheefficiencyoftheclassifica- tionprocess.Theknowledgeextractionphasedetectsanomalous classifications,sinceitaccountsfortheexistenceofprobeswith irrelevantinformation,orthosethatweredecisiveinthemisclassi- fication.Forexample,thereareprobeswhichseparateindividuals aseithermenorwomen.Similarly,thehumanexpertcanmark importantprobesthatarenoteliminatedintheretrievestagedur- ingthefiltering.
Theextractionofknowledgethatis presentedtothehuman expertiscarriedoutusingtheJ48algorithm[12].TheJ48algo- rithmistheJavaimplementationoftheC4.5algorithm,anevolution
Fig.4. Densityplotandhistogramforprobes1552280 atand219703atthatdonotmeettheuniformitytestandinwhichcut-offpointsweredetected.
Table2
Classificationerrors.
Type1 Type2 Type3
Total 46 12 33
Predicted/Erroneous 47/5 9/0 35/4
oftheoriginalID3[37],whosemainadvantageisthatitincorpo- ratesnumericalattributesintothelogicaloperationscarriedoutin thetestnodes.Thereareotheralternativesforgeneratingdecision ruleswhichoperatesimilartothedecisiontrees,includingRIPPER [38]andPART[39].Thesemethodsextractsimilarinformationto classifyindividualsaccordingtodecisionrules.Whiletheresults aresimilar,thetreesallowtheexperttoapplyamoreintuitive classificationofthecases.BeforeapplyingJ48,itisnecessaryto performadiscretizationofthevaluessothatthealgorithmiscapa- bleofworkingwithlargevolumesofinformation.Thereasonfor thisisthesearchforthecut-offpointsinthevariablesusesastrat- egythatdependsmainlyonthenumberofdifferentvaluesthata variablecontains.
Table3
Comparisonofmethods.*,differentmedian;=,equal;(−),medianofcolumnless thanmedianofrow.
CBR Dendrogram PAM SVM KNN
CBR
Dendrogram *(−)
PAM *(−) *(−)
SVM =(−) *(+) *(+)
KNN *(−) *(+) *(+) *(−)
Thegeneralobjectiveofextractionofknowledgetechniquesis toprovideahumanexpertwithinformationaboutthesystem- generatedclassificationbymeansof asetof rules thatsupport thedecision-makingprocess.Itshouldbenotedthatknowledge extractiontechniquesarenotintendedtosubstitutetherationale andexperienceofa humanexpert duringadiagnosis,ratherto complementtheprocessandserveasanadditionalmethodology orguidelineforcommonproceduresinanalysis.
Theprocessisdescribedinthefollowingsteps.LetIbetheset ofindividualsandsthenumberofprobesoncethefilteringprocess hasfinished:
fr: A1×...×As→A1×...×As
(a1,...,as)→fr(a1,...,as)=(a1,...,as)
whereAi∈[0,1]isthevalueofthetermiusingthefunctionfrand isobtainedinthefollowingway:
ai= ai−min(Ai) max(Ai)−min(Ai)
Finallythevaluesarediscretizedbymeansoffuinaseriesof predefinedlevelst.
fu: A1×...×As→
s
D×...×D (a1,...,as)→fu(a1,...,as)=(d1,...,ds) whereD=
i∈{0,...,t−1}i·(1/(t−1)),wecansaythatdi=djifapply- ingthefunctionfu,withdj∈Difdi∈[dj−1/(2t), dj+1/(2t)].
Oncethetransformationis finished,thesetof individualsis determinedbythesubsetI⊆
s
D×...×Dofthedata,andJ48[12]
Fig.5. Decisiontree.
isusedtogeneratetherulesthatclassifytheindividuals.Theuse ofJ48,allowsrulestobeobtainedforclassifyinganindividualik∈I totheclasscjbymeansofrulessimilarto:
ri=(l1∧...∧lm)→cj
wheredpisthevaluefortheattributepfortheindividualik.Inthis waythesetisdefinedforrulesRthatclassifytheindividualsfor eachoftheclasses.Theserulescanberepresentedasadecision tree.
Fig.2representstherevisestageforMicroCBR.Theinputcorre- spondstothediscretizationofthevalues(ifthereusephasehas beensuccessful). Subsequently,knowledgeextractionisapplied through the J48. Finally, the relevant information extracted is stored(inconsequentialprobes,importantprobes,andresultsof the classification). At this stage, in addition to therepresenta- tionofthedecisiontree,a3Drepresentationwiththeinformation retrievedisdisplayed.ThedimensionalityisreducedbyusingMDS [13–15]
3.4. Retain
Ifthehumanexpertidentifiesrelevantinformationattherevise stage,theknowledgeisacquiredandtheinformationobtainedis stored.Theinformationthatisstoredcorrespondstotheclassifi- cationsthatareconsideredcorrect,thedecisionrulesgenerated thatareconsideredrelevant,andtheprobesmarkedasirrelevant.
Fig.6.Illustrationscomparingprobesfoundintherulesfortheknowledgeextractionstage.Withinthediagonal,theillustrationrepresentsthedensitiesforeachprobe.For theillustrationsoneithersideofthediagonal,thegraphisobtainedusingtheprobescorrespondingtoitsrowandcolumn+class3individuals,class2,andoclass1.
Fig.7.MainprobesrecoveredusingJ48.
TheinformationstoredisdividedintothecasesmemoryIandthe memoryofrulesRandSIrr.
Fig.2showsthestructureoftheretainstage.Takingtherevision oftheexpertintoaccount,thesystemlearnsfromthenewexpe- rienceandstorestheinformationthattheexpertestablishedas relevant.Thestoredinformationcanincludeprobes,classifications andrules.
4. Casestudy:classificationofCLLleukemiapatients
Microarray analysishasmade it possibletocharacterizethe molecularmechanismsthat cause severalcancers. With regard to leukemia, microarray analysis has facilitated the identifica- tion of certain characteristic genes in the different variants of leukemia[7,35,40].Cancerexpertsremarkontheimportanceofthe
identificationofthegenesassociatedtoeachtypeofcancerinorder toestablishthemostefficienttreatmentsforthepatients[41,42].
TheCancerInstituteinthecityofSalamancawasinterestedinnovel toolsfordecisionsupportintheprocessofCLL(ChronicLympho- cyticLeukemia)patientclassification.
TheInstituteprovideduswith91samplesofpatientdataand askedforatooltoprovidedecisionsupportintheexpressionarray analysisprocessandtoincorporateinnovativetechniquestoreduce thedimensionalityofthedataandidentifythevariables witha higherinfluenceintheclassificationofpatients.Thesamplescorre- spondedtopatientsaffectedbychroniclymphocyticleukemia.CLL isadiseaseoflymphocytesthatappeartobematurebutarebiolog- icallyimmature.TheseBlymphocytesarisefromasubsetofCD5-B cellsthatappeartohavearoleinautoimmunity.Thepathogene- sisofchroniclymphocyticleukemiaislikelyamultistepprocess,
Table4
Classifiedcasesforeachtypedetectedaccordingtothetotalnumberofexisting elements/classifiedelements/erroneouselements.
Type1 Type2 Type3 Total
46/48/5 12/9/0 33/34/3 91
42/44/5 10/8/0 29/29/3 81
37/38/4 8/8/1 26/25/2 71
31/30/3 7/8/3 23/23/2 61
26/24/2 5/7/3 20/19/1 51
24/24/3 5/4/0 17/18/3 46
initiallyinvolvingapolyclonalexpansionofCD5-Bcellsfollowedby thetransformationofasinglecell[43].CLLisoneofthefourmain typesofleukemia.About15.110newcasesofCLLwillbediag- nosedin2008.Approximately90.179peoplearecurrentlyliving withCLL,morethanthenumberofpeoplelivingwithanyother typeofleukemia.MostpeoplewithCLLareatleast50yearsold [44].CLLstartswithachangetoasinglecellcalledalymphocyte.
Overtime,theCLLcellsmultiplyandreplacenormallymphocytes inthemarrowandlymphnodes.ThehighnumberofCLLcellsin themarrowmaycrowdoutnormalblood-formingcells,andCLL cellsarenot abletofightoffinfectionlikenormal lymphocytes do[44].Theaimofthetestsperformedinthisstudyistodeter- minewhetheroursystemisabletoclassifynewpatientsbasedon previouslyanalyzedandstoredcases.
5. Resultsobtained
Thispaper has presented MicroCBR, a case-based reasoning systemspecifically designed to analyze data from microarrays, facilitatingthegroupingandclassificationofindividuals.Moreover, MicroCBRprovidesaninnovativemethodforexploringtheclassifi- cationprocessandextractingknowledgeintheformofruleswhich helpthehumanexpertstounderstandtheclassificationprocess andobtainconclusionsabouttherelevanceoftheprobes.MicroCBR wasappliedtoa realcasescenarioforclassifyingCLLleukemia patientsattheCancerInstituteinthecityofSalamanca.Thehuman expertsinthelaboratoryhaveremarkedontheadvantagesofusing MicroCBRasadecisionsupportsystemforCLLclassification,and haveespeciallynotedthefacilityinacquiringknowledgeandexpla- nations.
Section4presentedthecasestudyconsideredinthis report, whichclassified91CLLleukemiapatientsintogroups.Theaimof thecasestudywastoidentifytheprobesthatallowclassifyingthe CLLleukemiapatientsintosubgroups.Inaninitialtest,datafrom 90patients,wherethepreviousclassificationwasnottakeninto account,wereusedintheMicroCBRsystem.Thepre-processing phasebeganwith54.675probesandtheRMAwasappliedtoobtain theluminescencevaluesforeachoftheprobesandtohomogenize thevalues fromdifferentchips.Afterthepre-processing phase, thefilteringprocesswasapplied,notablyreducingtheprobesto 618,withoutincreasingtheerrorrate.Thecut-offphasereduced theprobescomingfromtheuniformfilteringfrom1.311to618 probes.Table1showsthenumberofprobesgeneratedaftereach ofthefilteringstages.Fig.4a–dpresentstwoprobesthatpassed theuniformitytestandcontaincut-offpoints.Fig.4aandcshowsa densityplotandFig.4banddpresentthehistogramfortheprobes 1552280atand219703at.AsseeninFig.4,neitheroftheprobes followsauniformdistribution,andbothofthemhavelowdensity pointswhichallowaseparationofindividuals.Thisstagetakesthe irrelevantprobesfilteringintoaccount,asprobe201909at,which isassociatedtotheYchromosome.
Thereusephasebeginswhentheprobeshavebeenfiltered,and generatesthemeshesforthegroupsaswellasthedistributionof theindividualsalongthespace.Themeshclosesttothenewcase wasthenselectedandtheclassificationwasmade.Toevaluatethe
Fig.8. RepresentationoflowdimensionalityprobeswithMDS.
proposedmodel,thesystemclassified90individualsandonenew individual,andtheresultsobtainedwerecomparedtotheprevi- ousexistingclassifications.Table2showsthenumberofelements withanerroneousclassification.Thefinalclassificationwascom- paredwiththedataobtainedusingaclusteranalysisdendrogram [45],PAM[11],theSVMclassificationtechnique[48]andKNN[49]
onthefiltereddata,andtheresultsofthefinalclassificationwere showntohavealowerrateoferror.Theproportionoferrorsin everygroupwascalculatedandtheKurskal–Wallistest[46]was applied,asshowninTable3,todeterminateifthemedianofthese proportionswasequal.Thistestwasperformedbyapplyinga5×2 crossvalidationfollowedbytheKurskal–Wallistest.Weoptedfor theKurskal–Wallistestbecauseitisanon-parametrictechnique thatdoesnotrequireanysuppositionsaboutthedata;asanalter- nativetheMann–Whitneycouldhavebeenused.Thedifference obtainedbetweenCBRandSVM wasnot consideredsignificant, althoughthemedianwaslower.
Intherevisephase,MicroCBRextractedtheknowledgeobtained duringtheclassificationprocess,asshowninFig.5.Fig.5presents atreewherethenodesareprobesandthebranchesaredecision rules.Theleafnodescontainthegroupassignedandthenumberof elementsclassifiedcorrectlyandincorrectly.Thetreerepresented inFig.5istheresultoftheclassificationofthe90patientsand thesinglenewpatient.Fig.6representssomegraphicswherethe valuesoftheretrievedprobesarecomparedtwobytwowithJ48 [12],andthediagonalshowsthedensitygraph.AsseeninFig.6,
thereareindividualsbelongingtoclass3whichcanbeeasilysep- aratedformtherest.Fig.7a–dshowsthedatacorrespondingto thefirstthreeprobesrecoveredwithJ48.Fig.7aandbshowsthe valuesfortheprobesoftheindividuals.Theindividualsforeach ofthegroupsarerepresentedindifferentcolours,andaregres- sionplaneisobtainedforeachofthegroups.Thegraphshowshow theindividualscanbeseparatedinthespaceusingtheseprobes.
Fig.7canddshowsthesameinformationbutellipsoidshavebeen addedinordertogroup50%ofthedataforeachofthegroups,thus facilitatingthespatialseparationofthedata.Toobtainavisualrep- resentationofthepatient’sclassification,weusetheMDS[13–15], andthedimensionalityofthedataisreducedtothree.Fig.8aand brepresentstheinformationonceMDShasbeenappliedand,as shown,theindividualsofthedifferentclustersareseparatedinthe space.
Aftercarryingoutexperimentalstudiesofthedifferentproce- duresappliedtothereasoningcycle,thesystemperformancewas evaluated.Inordertoevaluatetheglobalfunctioningofthesys- tem,weincludedanadditionaltest.InthistestMicroCBRcontains 45previouslyclassifiedindividuals,andaimstoclassifytheremain- ing46casesusingpreviousknowledge.Table4showstheresults obtainedafterthetest.EachcolumninTable4representsthesub- typesdetected.EachrowinTable4representsthetotalnumber ofelementsbelongingtoeachoftheclasses,thetotalnumberof elementsassignedtothatclass,andthenumberofmisclassified elements.Wecanseethatthenumberoferrorsdecreasesasmore casesareincorporatedintothesystem.
6. Conclusions
MicroCBRisaspecializedandnovelsystemthatintegratesthe stepsofanexpressionanalysiswithinthestagesofaCBRcycle.
MicroCBRisabletoincorporatetheknowledgeacquiredinprevious classificationsanduseittoperformnewclassifications,providinga muchappreciateddecisionsupporttoolfordoctors.Theuseofthe CBRsystemfacilitatestheselectionandeliminationofprobesthat provideanomalousresults.Havingbeendiscardedintheretrieve phase,theseprobesarenotconsideredinsubsequentiterationsof thereasoningcycle.Asdemonstrated,theproposedsystemreduces thedimensionalitybasedonthefilteringofgeneswithlittlevari- abilityandthosethatdonotallowaseparationofindividualsdue tothedistributionofdata.Italsopresentsaclusteringtechnique basedontheneuronalnetworkESOINN,whichisvalidatedwith aPAMtechnique.Finally,MicroCBRincorporatesatechniquefor knowledgeextractionandpresentsittothehumanexpertsina veryintuitiveformat.
Withtheresultsobtainedfromempiricalstudieswecancon- cludethatMicroCBRprovidesatoolthatdetectsgenesandprobes, whicharethemostimportantfactorforthedetectionofpathol- ogy,andfacilitatesaclassificationandreliablediagnosis,asshown bytheresultspresentedinthispaper.MicroCBRhasbeenapplied toclassifyCLLleukemiapatientsandallowsthehumanexpertto obtaininformationabouttheclassificationprocess andtoiden- tifytheprobesconsideredasimportantorirrelevantforfurther classifications.
Acknowledgment
SpecialthankstotheInstituteofCancerofSalamancaforthe informationandtechnologyprovided.
References
[1]K.S.Lina,C.F.Chien,Clusteranalysisofgenome-wideexpressiondataforfeature extraction,ExpertSystemswithApplications36(2–2)(2009)3327–3335.
[2]Z.K.Stadlera,S.E.Come,Reviewofgene-expressionprofilinganditsclinicaluse inbreastcancer,CriticalReviewsinOncology/Hematology69(1)(2009)1–11.
[3]Affymetrix.GeneChip®HumanGenomeU133Arrayshttp://www.affymetrix.
com/support/technical/datasheets/hgu133arraysdatasheet.pdf.
[4]T.Sawa,L.Ohno-Machado,Aneuralnetworkbasedsimilarityindexforclus- teringDNAmicroarraydata,ComputersinBiologyandMedicine33(1)(2003) 1–15.
[5] D.Bianchia,R.Calogero,B.Tirozzi,Kohonenneuralnetworksandgeneticclas- sification,MathematicalandComputerModelling45(1–2)(2007)34–60.
[6] V.Baladandayuthapani,S.Ray,B.K.Mallick,BayesianmethodsforDNAmicroar- raydataanalysis,HandbookofStatistics25(1)(2005)713–742.
[7] R.Avogadri,G.Valentini,Fuzzyensembleclusteringbasedonrandomprojec- tionsforDNAmicroarraydataanalysis,ArtificialIntelligenceinMedicine45 (2–3)(2009)173–183.
[8]J.Kolodner,Case-BasedReasoning,MorganKaufmann,SanMateo,1993.
[9] F.Riverola,F.Díaz,J.M.Corchado,Gene-CBR:acase-basedreasoningtoolfor cancerdiagnosisusingmicroarraydatasets,ComputationalIntelligence22 (3–4)(2006)254–268.
[10]S.Furao,T.Ogura,O.Hasegawa,Anenhancedself-organizingincrementalneu- ralnetworkforonlineunsupervisedlearning,NeuralNetworks20(8)(2007) 893–903.
[11]L.Kaufman,P.J.Rousseeuw,FindingGroupsinData:AnIntroductiontoCluster Analysis,Wiley,NewYork,1990.
[12]N.Saravanan,S.Cholairajana,K.I.Ramachandran,Vibration-basedfaultdiag- nosisofspurbevelgearboxusingfuzzytechnique,ExpertSystemswith Applications36(2–2)(2009)3119–3135.
[13]I.Borg,P.Groenen,ModernMultidimensionalScalingTheoryandApplications, Springer-Verlag,NewYork,1997.
[14]J.B.Kruskal,MultidimensionalScalingbyoptimizinggoodnessoffittonon- metrichypothesis,Psychometrika29(1)(1964)1–27(Springer,NewYork).
[15]M.Turea,F.Tokatlib,I.Kurtc,UsingKaplan–Meieranalysistogetherwith decisiontreemethods(C&RT,CHAID,QUEST,C4.5andID3)indetermining recurrence-freesurvivalofbreastcancerpatients,ExpertSystemswithAppli- cations36(2)(2009)2017–2026.
[16]J.Quackenbush,Computationalanalysisofmicroarraydata,NatureReview Genetics2(6)(2001)418–427.
[17]R.J.Lipshutz,S.P.A.Fodor,T.R.Gingeras,D.H.Lockhart,Highdensitysynthetic oligonucleotidearrays,NatureGenetics21(1)(1999)20–24.
[18]M.Taniguchia,L.L.Guana,J.A.Basarabb,M.V.Dodsonc,S.S.Moorea,Compar- ativeanalysisongeneexpressionprofilesincattlesubcutaneousfattissues, ComparativeBiochemistryandPhysiologyPartD:GenomicsandProteomics3 (4)(2008)251–256.
[19]O. Margalit, R. Somech, N.Amariglio, G. Rechav, Microarray basedgene expressionprofilingofhematologicmalignancies:basicconceptsandclinical applications,BloodReviews4(4)(2005)223–234.
[20]N.J.Armstrong,M.A.VandeWiel,Microarraydataanalysis:fromhypotheses toconclusionsusinggeneexpressiondata,CellularOncology26(5–6)(2004) 279–290.
[21] I.Jurisica,J.Glasgow,Applicationsofcase-basedreasoninginmolecularbiol- ogy,ArtificialIntelligenceMagazine,SpecialIssueonBioinformatics25(1) (2004)85–95.
[22] J.S.Aaronson,H.Juergen,G.C.Overton,KnowledgeDiscoveryinGENBANK,in:
ProceedingsoftheFirstInternationalConferenceonIntelligentSystemsfor MolecularBiology,AAAIPress,1993,pp.3–11.
[23]N. Arshadi, I. Jurisica, Data mining for case-based reasoning in high- dimensionalbiologicaldomains,IEEETransactionsonKnowledgeandData Engineering17(8)(2005)1127–1137.
[24]Affymetrix. Statistical Algorithms Description Document, http://www.
affymetrix.com/support/technical/whitepapers/saddwhitepaper.pdf.
[25]Affymetrix.Guideto ProbeLogarithmicIntensityError (PLIER)Estimation http://www.affymetrix.com/support/technical/technotes/pliertechnote.pdf.
[26]R.A.Irizarry,B.Hobbs,F.Collin,Y.D.Beazer-Barclay,K.J.Antonellis,U.Scherf, T.P.Speed,Exploration,normalization,andsummariesofhighdensityoligonu- cleotidearrayprobeleveldata,Biostatistics4(2003)249–264.
[27]R.Brunelli,Histogramanalysisforimageretrieval,PatternRecognition34 (2001)1625–1637.
[28]J.Jureˇckováa,J.Picek,Shapiro–Wilktypetestofnormalityundernuisance regressionandscale,ComputationalStatistics&DataAnalysis51(10)(2007) 5184–5191.
[29]N.Saitou,M.Nie,Theneighbor-joiningmethod:anewmethodforreconstruct- ingphylogenetictrees,MolecularBiologyandEvolution4(1987)406–425.
[30]T.Kohonen,Self-organizedformationoftopologicallycorrectfeaturemaps, BiologicalCybernetics(1982)59–69.
[31]B.Fritzke,Agrowingneuralgasnetworklearnstopologies,AdvancesinNeural InformationProcessingSystems7(1995)625–632.
[32]F.Shen,Analgorithmforincrementalunsupervisedlearningandtopologyrep- resentation,Tokyo:Ph.D.Thesis,TokyoInstituteofTechnology,2006.
[33]S.J.Redmond,C.Heneghan,AmethodforinitialisingtheK-meansclustering algorithmusingkd-trees,PatternRecognitionLetters28(8)(2007)965–973.
[34]T.Martinetz,CompetitiveHebbianlearningruleformsperfectlytopologypre- servingmaps, in:ICANN’93:InternationalConferenceonArtificial Neural Networks,Amsterdam,1993,pp.427–434.
[35]B.Guinn,A.F.Gilkes,E.Woodward,N.B.Westwood,G.J.Muftia,D.Linchc, A.K.Burnett,K.I.Mills,Microarrayanalysisoftumourantigenexpressionin presentationacutemyeloidleukaemia,BiochemicalandBiophysicalResearch Communication333(5)(2005)703–713.