ContentslistsavailableatScienceDirect
Pattern Recognition
journalhomepage:www.elsevier.com/locate/patcog
Neural network for ordinal classification of imbalanced data by minimizing a Bayesian cost
Marcelino Lázaro
∗, Aníbal R. Figueiras-Vidal
Signal Theory and Communications Dept. Universidad Carlos III de Madrid, Spain
a rt i c l e i nf o
Article history:
Received 2 February 2022 Revised 1 December 2022 Accepted 4 January 2023 Available online 6 January 2023 Keywords:
Bayes cost Parzen windows Ordinal classification Imbalanced
a b s t r a c t
Ordinalclassificationofimbalanceddataisachallengingproblemthatappearsinmanyrealworldap- plications.Thechallengeistosimultaneouslyconsidertheorderoftheclassesandtheclassimbalance, whichcannotablyimprovetheperformancemetrics.TheBayesianformulationallowstodealwiththese twocharacteristicsjointly:Ittakesintoaccountthepriorprobabilityofeachclassandthedecisioncosts, whichcanbeused toincludetheimbalance andthe ordinalinformation,respectively. Weproposeto usetheBayesianformulationtotrainneuralnetworks,whichhaveshownexcellentresultsinmanyclas- sificationtasks.Alossfunctionisproposedtotrainnetworkswithasingleneuronintheoutputlayer andathresholdbaseddecisionrule.ThelossisanestimateoftheBayesianclassificationcost,basedon theParzenwindowsestimator,whichisfittedforathresholdeddecision.Experimentswithseveralreal datasetsshowthattheproposed methodprovidescompetitiveresultsindifferentscenarios,duetoits highflexibilitytospecifytherelativeimportanceoftheerrorsintheclassificationofpatternsofdifferent classes,consideringtheorderandindependentlyoftheprobabilityofeachclass.
© 2023TheAuthor(s).PublishedbyElsevierLtd.
ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1. Introduction
Ordinal classificationisaspecialformofmulti-classclassifica- tionwheretheclassesexhibitaninherentordering.Itisafrequent problembecausea naturalordering occursinmanyhumantasks, suchasthoseassociatedwitheverykindofgradinginhumande- visedscales:satisfactionsurveys,medicaldiagnosis,qualityassess- ment or credit rating are justsome examples (a detailedlist of applications indifferent research areascan be found inthe liter- ature review[1]). Becauseofits relevance,although thisproblem has beenstudied formore than fourdecades [2]ordinal classifi- cationisstillreceivingattentioninseveraldifferentdirections:or- dinalclassificationmethods(seethetaxonomyin[1]),appropriate performance metrics [3,4] orthe interpretabilityof neural classi- fiersforordinalproblems[5,6]aresomeexamples.
There isawide varietyofordinal classificationmethods: from initial approaches that statedthe problem asa conventional re- gressionproblem[7],wheretheordinalclassesweremappedinto sortednumericvalues,tomorerecentproceduresthatupdatecon-
∗Corresponding author at: Departamento de Teoría de la Señal y Comunica- ciones, Universidad Carlos III de Madrid. Av. Universidad 30, 28911 Leganés, Madrid, Spain.
E-mail address: [email protected] (M. Lázaro) .
ventional techniques to include the ordinal constraints (such as Gaussian Processes (GP) [8], Support Vector Machines(SVM) [9], Neural Networks (NN) [10], or Learning Vector Quantizers (LVQ) [11]).Inthesurvey[1]thesemethodsaregroupedintothreecate- gories:
• Naive approaches: The problem is posed as another standard problem, such asregressionor nominalclassification, without considering the ordering of the classes (with evident limita- tions), or as a cost-sensitive classification problem, with dif- ferentcostsfordifferentmisclassificationerrors.Howtodeter- minethecost matrixwithoutapriori knowledgeoftheprob- lemisthemainlimitationofthesetechniques.
• Ordinalbinarydecompositionapproaches:Theordinaltargetvari- able is mapped into several binary variables, which are used to train a single multiple-output model orseveral models. In manycases,thebinarydecompositionviolatesthemonotonic- ityorconsistencyoftheordinaltarget,althoughconsistentso- lutionshavealsobeenproposed[12].
• Thresholdmethods:Theyarebasedonthedefinitionofthresh- oldsover a real valued measure (or latent variable) that will definethe intervalsassociated toeach class.Toforce amono- tonicbehavior in thislatent variable according to the ordinal structure of the problem can be helpful to improve the gen- eralizationabilityofthetrainedmodel.Apotentialproblemis
https://doi.org/10.1016/j.patcog.2023.109303
0031-3203/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
thatthe distancesamongtheorderedclassesarenotknowna priori.Becauseof this,mostmethods estimatethesedistances duringthelearningprocess.
Manyrealordinalproblemspresentastrongimbalanceamong classes. Conventionalclassifierstend tobe biasedtowards major- ityclassesandtoproducemoreerrorsfortheminorityclasses,al- thoughminorityeventsareoftenhighlyrelevantinmanyapplica- tions (medical diagnosis, fraud detection, etc.). Imbalanced prob- lems are frequent and relevant, and consequently several meth- ods have been proposed to solve them. The proposed solutions canbegroupedintothreecategories[13]:Datapreprocessingmeth- ods,such asresamplingordataaugmentationtechniques (SMOTE [14] is probably the best known technique). The effectiveness of sampling techniques depend on the problem and on the nature of the classifier, they can modify the class distributions, and in generalthere isnotguarantee ofanimproved performance.Cost- sensitivemethodsmodifythedecisionthresholdsorassignweights toinstancesaccordingtoacostmatrix,oraretrainedwithanob- jective function that is cost sensitive, asactive learning versions [15]. The Bayes decision theory has also been used in this con- text, usually with a learning method used to estimate the class posterior probabilities (although most ofthese estimates are not principled, which spoilsthe robustnessof theseapproaches).En- semblemethods,whichcombinemultipleclassifiers,arealsoused to deal with imbalanced problems [16,17], including binarization techniques[18].Mostoftheseprocedures areempiricaland, con- sequently,theyimplicatedegradationrisks.
Althoughimbalanceappearsinmanyrealordinalproblems,few workshaveproposedsolutionsconsideringtheordinalandtheim- balancecharacteristicssimultaneously.Inmostcases,methodsare focused on ordinal classification and imbalance is addressed us- ing samplingtechniques[19,20] orensembles [21].Alsoveryfew works,such as[21],haveevaluated the performancewithappro- priatefiguresofmeritforthisscenario.
Ourobjectiveistoproposeaprincipledsolutionforordinaland imbalanced problems, and to evaluate it using appropriate met- rics. The proposed solution is based on the combinationof neu- ralnetworkswiththeBayesdecisiontheory.Neuralnetworkshave shown excellentperformance inmany applications, includingor- dinal [10] and imbalanced problems [22] (further details of the useofNNs forordinal andimbalancedproblemscan befound in Section 2). The Bayes classification cost considersthe class prob- abilitiesandthedecisioncosts, whichallow tointroducethe im- balance and the ordinal information, respectively, in a solid sta- tistical framework. We propose to combine these two powerful tools to create an ordinal classification method capable of deal- ing withimbalanceddatadirectly,withoutusingdatapreprocess- ingtechniquesorensembles.Theproposedmethodcanbeapplied toanyneuralnetworkarchitecturewithaone-dimensionaloutput andathreshold-baseddecision(the thresholdswillbelearned,to dealwiththelackofknowledgeabouttherelativedistanceamong classes). Our main contributionis a newloss function that is an estimateoftheBayesclassificationcost,usingtheParzenwindows estimatortomodeltheconditionalprobabilityofthenetworkout- putforeachclass.Asimilarapproachwasproposedin[23]forim- balancedbinaryclassification,andlaterextendedin[24]forbinary example-dependent-cost(EDC)classification.Here,theformulation is fittedtoa multiclassordinal problemwithathresholded deci- sionrule.
Themanuscriptisorganizedasfollows.Section2reviewssome related works. Section 3 formally states an imbalanced ordinal problem. The proposed solution ispresented inSection 4.Exper- iments with real datasets are presented in Section 5 to evaluate the proposed method, andSection 6 closes themanuscript sum- marizingitscharacteristics.
2. Relatedworks
Therearemanyworksdedicatedtoordinalclassificationprob- lemswithimbalanceddata,butfewworksconsidertheordinalin- formationandtheimbalancejointly.Inthissection,relatedworks arereviewedtohighlight thedifferenceswiththeproposed solu- tion.Becauseofthis,thereviewisfocusedonneuralnetworksolu- tions,cost-sensitiveapproachesandlossfunctionsusedinordinal classification.
2.1. Preprocessingmethodsorensemblestohandleimbalance
Many solutions are focused on ordinal classification, and the imbalanceishandledusingpreprocessingtechniques,suchassam- plingordataaugmentation,orensembles.Someexamples,[19–21], werementionedabove,andadditionalexampleswillbeaddressed inthefollowingsections.
In our work the imbalance is considered inthe loss function to be minimizedin the trainingof aneural network, which also considerstheorderinformation.
2.2. Useofordinalevaluationmetrics
Some methods proposeto use ordinalevaluation metricsas a validationtool,buttheordinal informationisnot includedinthe designoftheclassifier,suchas[25]whereensembles(stakedgen- eralization)were validated usingseveralordinal metrics.The im- balancewastreatedbyusingsamplingtechniques.
Inthisworktheordinal information,aswell astheimbalance, isconsideredinthedesignofthesolution.Appropriateevaluation metrics for an imbalanced ordinal classification problem will be usedtoevaluatetheproposedmethod.
2.3. UseoftheBayestheory
TheBayesclassificationtheoryhasbeenusedinthecontextof machine learning. Some cost-sensitiveapproaches usethe output ofalearningmachinetoprovideclassprobabilityestimates,anda riskfunctionbasedontheseestimatesalongwithacostmatrixis usedtomake decisions,asin[26].However, notall learningma- chinesareabletoprovideconsistentestimatesofposteriorproba- bilities[27],andinpracticesome heuristiccalibrationprocedures areusedtoimprovetheperformance,asin[28],whichharmsthe robustnessoftheseapproaches.
Thereare alsoprevious approachesusing theParzen windows estimatorwiththeBayestheory,asintheprobabilisticneuralnet- worksproposedbySpecht[29],orin[30].Theproposedapproach isnotablydifferent:Parzenwindows areappliedhereattheone- dimensional output space of a neural network instead of being applied in the multi-dimensional input space, as in [29] or [30]. Moreover,intheseworkstheParzen estimatorisusedinthe de- cision rule,which is based onits estimates.In theproposed ap- proach,the decision isbased on thresholds defining decision re- gions over the one-dimensional output ofa neural network. The Parzenwindowestimatorisusedinthedefinitionofthelossfunc- tionusedtotraintheneuralnetworkclassifier.
2.4. Ordinallossfunctions
Commonlossfunctionsusedtotrain neuralnetworks, suchas cross-entropy,donotconsidertherelativeorderingamongclasses ortheimbalance(althoughweightedversionsoftheselosseshave been proposed to deal with imbalance). For this reason, several typesof lossfunctions havebeen proposed forordinal classifica- tion.
Some works,as[31],uselossesmodelingtheoutputprobabili- tiesasfollowingaunimodaldistribution.In[32]theselosses,along withadditionalquasi-unimodallosses, areevaluated usinga con- volutionalneural network(CNN)in3ordinal datasetsofRGBim- agesusingseveralmetrics.Inthiswork,theimbalanceisonlycon- sidered in the evaluation phase, by using appropriate evaluation metrics.
In [33],an ordinal loss function was proposed to train CNNs.
This loss function adds the cross-entropy per output ina multi- outputproblemdefinedbydecomposingtheM-classordinalprob- lem inM− 1binaryproblems.Authors implementedseveralcon- figurations, including ensembles to evaluate the performance in imbalanced datasets. Similarly, [12] combines a binary decompo- sition with a weighted cross-entropy loss for M-1 binary classi- fiers withtheoreticalguarantees forclassifierconsistency totrain CNNs. The weights inthe loss for the M-1tasks can be usedto takeintoaccounttheimbalance,withthelimitationsgivenbythe binary decomposition. However, all experiments were performed with uniformweights andthe evaluationmetrics, mean absolute error(MAE)androotmeansquarederror(RMSE),didnotconsider imbalance.
Thequadraticweightedkappa(QWK)metrichasbeenproposed asa loss functionto trainCNNs [34].Thekappa index, originally designedasameasureofagreementbetweenobservationsthatal- lows to discard agreements dueto mere chance, has been used to evaluatethe performance inordinal problemsbecauseit takes into account thedifference betweenthe decisionandthe correct class. In [34] the imbalanceis attackedby using data augmenta- tiontechniques.In[35]theQWKlosswascombinedwithcumula- tivelinkmodels(CLM)tosignificantlyimprovetheordinalclassifi- cationmetrics. Again,augmentationtechniqueswereusedtodeal withimbalance.
OurproposedlossfunctionisanestimateoftheBayesclassifi- cationcost,whichallowstoincludeinthetrainingofaneuralnet- work theordinal informationalongwiththeimbalance,providing alsoflexibilitytoconsidertherelativeimportanceofeveryclassin the problemathand.It canbeused withanyneural networkar- chitecture, swallowordeep,havingasingleneuronintheoutput layerandathreshold-baseddecisionrule.
3. Ordinalclassificationofimbalanceddata
In this section the problem ofordinal classification of imbal- anced data is stated. Standard notation will be used: pX(x) de- notesthepointmassfunctionofadiscreterandomvariableX,i.e., pX(x)=Pr(X=x), fX(x) is theprobability density function(PDF) of a continuousrandomvariable X,
|
·|
is theabsolute value op-erator when applied to scalars or the cardinality of a set, andaˆ denotestheestimateorthedecisionthatismadefora.
Amulticlassclassificationproblemcanbestatedasfollows:the objectiveistoassignagivenpatternofdimensionA,x∈IRA,toone classoutofasetofM possibleclassesorhypotheses
H=
{
H1,H2,...,HM}
. (1)Inanordinalproblemtheclassesexhibitanaturalorder,whichis denotedas
H1≺ H2≺ · · · ≺ HM, (2)
where≺ expressesarelationoforder.Tosimplifythenotation, in thefollowingHi=iwillbeassumed.Notethatinthiscasethecu- mulative distributionfunction(CDF)oftheclassesiswell defined andnaturallycontainstheorder.
Theavailableinformationabouttheproblemisasetoflabeled samples(typicallynamedthetrainingset)
T =
{ (
xk,yk) |
k∈Z,1≤ k≤ N}
, (3)withyk∈H beingthelabelthatindicates theclassthepatternxk belongsto.ThesetSt isthesetofindexesofpatternsthatbelong toclasst
St=
{
k|
yk=t,yk∈T}
,t∈H, (4) andNt isthenumberofelementsinthesetSt,i.e.,Nt=|
St|
.Ob-viously,
!M t=1
Nt=N. (5)
If the problemis imbalanced, the numbers of samples of the M classes,Nt,t∈H,aresignificantlydifferent.
4. Proposedmethod
Inthissectionwewillpresenttheproposedsolutiontoordinal classificationofimbalanceddata.
4.1. Neuralnetworkarchitectureanddecisionrule
Inconventionalclassificationproblems,whenaneuralnetwork istrainedwitha labeleddataset,multiclass problemsaresolved typically by using networkswith a neuron per class inthe out- putlayerwithsoftmaxactivation(thedecisionforapatternisthe classassociatedtotheneuronwiththehighestoutputforthispat- tern).Tosolvetheordinalclassificationproblemweproposetouse a neural network witha singleneuron in the outputlayer (one- dimensionaloutputspace).Tohaveasingleoutputwithamono- tonicdecisionrule(thehighertheoutput,thehighertheclassla- belintheordinalarrangement)allowstomatchthenetworkout- putwiththeordinalstructureofthecorrespondingproblem.Fora giveninputpatternxktheoutputofthenetworkis
zk=g
(
xk,w)
∈IR, (6)wherethe functiong(·,w)dependson theneural network archi- tecturethroughthesetofparametersofthenetwork,thatarein- cludedinthevectorw.Fromthisone-dimensionalnetworkoutput, theclassifierhastodecideaclassforpatternxk
ˆ
yk=decision
(
zk)
. (7)In the proposed scheme the decision is based on thresholds defining decisionregions. Inparticular, M− 1ordered thresholds,
{
u1,u2,...,uM−1}
, with ud−1< ud, define M decision regions, as showninFig.1Id=
(
ud−1,ud],d∈H. (8)With these decision regions, the decision rule of the proposed methodis
ˆ
yk=difzk∈Id. (9)
The M− 1 thresholds are included in the vector u. To have an appropriate definitionforregions I1 andIM in (8), we can define u0=−∞anduM=+∞.
4.2. Lossfunction:AnestimateoftheBayescost
To simultaneously deal with imbalance andordinal classifica- tionanewlossfunction,inspiredbytheBayesdecisiontheory,is
Fig. 1. Thresholds and decision regions for the decision rule (9) .
proposed:AnestimateoftheBayescostisfittedtotheNNarchi- tecture anddecision rule presented in the previous section. The BayesianclassificationtheoryisbasedontheaverageBayescost BC=!
t∈H
π
t!
d∈H
cd,tpYˆ|Y
(
d|
t)
, (10)which has to be minimizedby the classifier.The variableY rep- resentsthetrueclass,andYˆrepresentsthedecision.Thisformula- tionisconvenientforimbalancedordinalclassificationproblems:it permitstoincludethepriorprobabilitiesofeachclass,
π
t≡pY(t), and the costs of deciding class d when the true class is t, cd,t (theconditionalprobabilitydistributionpYˆ|Y(d|
t)characterizesthe probability ofthissituation).Inthisway,itcandealwiththeim- balanceandtherelativeimportanceofeachclassand,atthesame time,itallowstospecifytheimpactoftheordinalstructureinthe classificationerrors.Forinstance,makingcostsproportionaltothe difference between the decisionand the true class, cd,t∝|
d− t|
,can helpto force thisordinalstructure intheclassifier.Forthese reasons,inthisworkweproposeanewlossfunctiontotrainneu- ralnetworksthatisbasedontheBayescostin(10).
Intheproposedneuralnetworkclassifier,withdecisionrule(9), theconditionalprobabilitiesofdecidingclassd whenthetrueclass ist areobtainedbyintegratingtheconditional distributionofthe output of the network forsamples of the trueclass, fZ|Y(z
|
t), inthecorrespondingdecisionregion pYˆ|Y
(
d|
t)
="
Id
fZ|Y
(
z|
t)
dz. (11)However, inpractice thedistributions fZ|Y(z
|
t) are unknown. Wepropose touse insteadtheestimate provided bythe Parzenwin- dowsmethod[36]
fˆZ|Y
(
z|
t)
= 1 Nt!
k∈St
k
(
z− zk)
, (12)where the function k(z) is the Parzen window, which has to be a validPDF (nonnegative andwithunit area).The mostcommon windowisaGaussianPDFwithvariance
σ
2k
(
z)
=√ 1 2π σ
2e−2zσ22, (13)
butanyothervalidPDFcanbeused.TheParzenwindowsmethod hasbeenchosenbecauseitisanon-parametric estimatorthat al- lowstoobtainaclosedformexpressionforthegradientofthees- timatedBayesiancost,asitwillbeshownlater.
BydefiningtheintegraloftheParzenwindow K
(
z)
="z
−∞
k
(
x)
dx, (14)takingintoaccountthat
" ui
−∞
k
(
z− zk)
dz=K(
ui− zk)
(15) andintroducingtheParzenwindowsestimator(12)in(11),thees- timateofpYˆ|Y(d|
t)becomesˆ
pYˆ|Y
(
d|
t)
= 1 Nt"
Id
!
k∈St
k
(
z− zk)
dz= 1 Nt
!
k∈St
[K
(
ud− zk)
− K(
ud−1− zk)
]. (16)Thisresultsfollowsfrom1
1 Note that # I1 k (z − z k) dz = K(u 1 − z k) and #
IMk (z − z k) dz = 1 − K(u M−1 − z k) , given that K(−∞ ) = 0 and K(∞ ) = 1 by definition.
"
Id
k
(
z− zk)
dz=" ud ud−1
k
(
z− zk)
dz=K(
ud− zk)
− K(
ud−1− zk)
. (17) Finally,by introducing(16)in(10) theestimate oftheBayes cost isBˆC
(
w,u)
=!M t=1
π
tNt
!
k∈St
$
cM,t+
M!−1 d=1
%
cd,t− cd+1,t&
K
(
ud− zk) '
. (18)
The dependenceofthe cost onthe thresholdsand onthe neural networksparameters(zk,dependsontheseparameters, asshown in(6))hasnowbeenmadeexplicit.TheParzenwindowsestimator makes (18)derivable (K(z) is the integral of window k(z)). This estimateoftheBayescost(10)isproposedasthelossfunctionto beminimizedduringthetrainingoftheneuralnetworkclassifier.
4.3. Trainingalgorithm
Agradientdescenttrainingalgorithmisusedtominimize(18). Theparameterswandthethresholdsuareiterativelyupdatedas2 w
(
n+1)
=w(
n)
−µ ∂
BˆC(
w,u)
∂
w( ( ( (
w=w(n),u=u(n)
, (19)
u
(
n+1)
=u(
n)
−µ ∂
BˆC(
w,u)
∂
u( ( ( (
w=w(n),u=u(n)
, (20)
where
µ
is the step-size parameter. The gradient expression for each iteration can be computed sample-by-sample, batch (using thewholetrainingset)ormini-batch(usingasub-setofthetrain- ingsetineachstep).Thesample-by-sampleexpressionofthegra- dientforthenetworkparameterswis∂
BˆC(
w,u)
∂
w =∂
BˆC(
w,u)
∂
zk∂
zk∂
w. (21)Thesecond termontheright-hand sideof(21)isindependentof thecostfunction:itonlydependsonthenetworkarchitecture.The firstterm,forapatternofclasst,is
∂
BˆC(
w,u)
∂
zk( ( ( (
k∈St
=
π
tNt M!−1 d=1
%
cd+1,t− cd,t&
k
(
ud− zk)
. (22)Theexpressionforthegradientofthethresholdsissimpler,be- cause it does not depend on the neural network architecture as the thresholds do not depend explicitly on the network output.
Tosimultaneouslyupdateweightsandthresholds,we willusethe sample-by-samplecontributionofapatternxkofclasst without- putzktothegradient
∂
BˆC(
w,u)
∂
ud( ( ( (
k∈St
=
π
tNt
%
cd,t− cd+1,t&
k
(
ud− zk)
. (23)Inmanycases,theprobabilities ofthedifferentclassesarees- timatedfromthetrainingset,
π
t=NNt.Inthesecases,by defining thenormalizedcosts ¯cd,t= cd,tN (anormalizationofthesecostsdoes notaffectthesolutionoftheproblem)morecompactexpressions canbeobtainedfor(18),(22)and(23)BˆC
(
w,u)
=!M t=1
$
¯cM,tNt+!k∈St M!−1 d=1
%
¯cd,t− ¯cd+1,t&
K
(
ud− zk) '
, (24)
∂
BˆC(
w,u)
∂
zk( ( ( (
k∈St
=
M!−1 d=1
%
¯cd+1,t− ¯cd,t&
k
(
ud− zk)
, (25)2 For the sake of simplicity, the simplest gradient descent approach is shown.
Obviously, some other stochastic optimization approaches can also be employed, such as including momentum, or Adam optimization, just to cite some examples.
∂
BˆC(
w,u)
∂
ud( ( ( (
k∈St
=
%
¯cd,t− ¯cd+1,t
&
k
(
ud− zk)
. (26)WithrespecttothechoiceoftheParzenwindow,k(z),itmust be remarked that thegoalis notto obtain goodestimatesofthe conditional distributions fZ|Y(z
|
t) by (12), but to obtain a goodclassificationperformance3(adiscussionaboutthissubjectcanbe foundin[23]).
4.4. Computationalburden
Toanalyzethecomputationalburdenofthetrainingalgorithm, a multiclassneuralnetwork classifierwithM neuronsintheout- putlayer(oneper class),whoseburdeniswell-knownifthenet- work architecture is given, will be used as a reference. The up- dating expressions of the proposed method are compared with thoseassociatedtoacoupleofwell-knownlossfunctions,suchas theMeanSquaredError(MSE)andtheBinaryCross-Entropy(BCE) losses,whichare,respectively
LM
(
w)
=(
yk− zk)
2, (27)LB
(
w)
=−yklog(
zk)
−(
1− yk)
log(
1− zk)
. (28) ForeverylossfunctionL(w)thecontributionofapatternxkwith outputzktothegradientcanbewrittenas∂
L(
w)
∂
w =∂
L(
w)
∂
zk∂
zk∂
w (29)wherethesecondtermontheright-handsidedoesnotdependon thelossbutonlyonthenetworkarchitecture,asin(21).ForMSE andBCE
∂
LM(
w)
∂
zk =−2(
yk− zk)
, (30)∂
LB(
w)
∂
zk =−ykzk +1− yk
1− zk
. (31)
In a multiclass neural network, M terms (one per class)such as (30) or(31) mustbe computedfora pattern. Withthe proposed costfunction,thecomputationof(22)or(25)forapatternrequires the addition of M− 1 terms and the evaluation of the window function. Taking intoaccount that ∂zk
∂w isindependent of thecost function, andthat this termisthe main responsible ofthe com- putational burden(specially indeep networksbecauseitis back- propagated),itcanbeconcludedthatthecomputationalcomplex- ityusingtheproposedcostfunctionisofthesameorderofmag- nitudethanthatrequiredforamulticlassneuralnetworkclassifier usingMSE ofBCE lossfunctions.4 Theupdate ofthresholdsuhas not been considered in the discussion: The gradient expressions (23) or(26)donot depend on thenetwork architecture,andthe number ofthresholds in a practical network ismuch lower than thenumberofnetworkparametersinw.
5. Experiments
In this section, after presenting appropriate figures of merit to measure the performance in imbalanced ordinal classification methods, experimentswithasetofrealdatasets toevaluatethe performanceoftheproposedmethodswillbepresented.
3 In fact, these distributions are not estimated during the training procedure, but the formulation by using the Parzen window method allows to obtain closed form expressions for the gradient of the cost function (18) , as in (22) and (23) .
4 In a practical implementation, the small difference is mainly dependent on the cost for evaluating the Parzen windows k (·) , which basically depends on the imple- mentation platform.
5.1. Figuresofmerit
Thechoiceofanappropriatefigureofmerittocomparetheper- formanceofdifferentmethodsisveryimportantinanyapplication.
Inconventionalclassificationsproblems,theusualchoiceistheac- curacy (or its complement, theprobability oferror), which for a labeleddatasetisdefinedas
Acc= 1 N
!N k=1
I
(
yˆk=yk)
, (32)whereI(·)denotestheindicatorfunctionthatreturns1iftheargu- mentistrueand0ifitisfalse.Althoughaccuracyisareasonable performancemeassureforabalancedclassificationproblem,inim- balancedproblemsitover-representstheaccuracyinthemajority classesandunder-representstheaccuracyintheminorityclasses.
Forthisreason,itisfrequentlyreplacedbytheaverageorbalanced accuracy
AAcc= 1 M
!M k=1
Acct, (33)
whereAcct istheaccuracyinthedetectionofclasst Acct= 1
Nt
!
k∈St
I
(
yˆk=yk)
. (34)Forordinalproblems,thefigureofmerit mustincludethedis- tance between the true class andthe decision. The Mean Abso- luteError(MAE),whichincludesthisdistance,isatypicalfigureof meritforthiskindofproblems
MAE= 1 N
!N k=1
( (
yˆk− yk( (
. (35)Inimbalancedordinalproblems,toavoidtheover-representations oftheperformanceforthemajorityclasses,thismeasureisusually replacedbytheaverage(orbalanced)MAE
AMAE= 1 M
!M t=1
MAEt, (36)
whereMAEt istheMAEforpatternsofclasst MAEt= 1
Nt
!
k∈St
( (
yˆk− yk( (
. (37)Cohen’skappacoefficientisa measureofagreementbetweenob- servationsthat allowsto discardagreementsduetomere chance.
Ithasbeenappliedintheevaluationofclassificationalgorithmand ithasalsobeen usedasalossfunction inordinal problems[34]. ThequadraticweightedCohen’skappacoefficientis
κ
2=1−!M t=1
!M d=1
(
d− t)
2od,t!M t=1
!M d=1
(
d− t)
2ed,t, (38)
where od,t isthe observed disagreement for decision d and true class t and ed,t is the corresponding expected disagreement due to chance. This coefficient ranges from -1 (total disagreement) through0(randomclassification)to+1(totalagreement).
Inthiswork,MAEandAMAEwillbeusedtoevaluatetheper- formanceinordinalproblems.Notethatthesetwomeasuresgivea differentimportancetothemisclassificationsperclass: MAEgives the same importance to every pattern, which in practice over- represents the performance for the majority classes because the contribution of a class is proportional to its probability (to be more precise, to the number of examples in the data set). And
Table 1
Experimental datasets and their main characteristics.
Dataset Patterns Attr. Classes Patterns per class
ERA 1000 4 9 (92,142,181,172, 158,
118, 88, 31, 18)
ESL 488 4 9 (2,12,38,100,116,
135,62,19,4)
LEV 1000 4 5 (93,280,403,197,27)
SWD 1000 10 4 (32,352,399,217)
automobile 205 71 6 (3,22,67,54,32,27) balance-scale 625 4 3 (288,49,288)
bondrate 57 37 5 (6,33,12,5,1)
eucalyptus 736 91 5 (180,107,130,214,105)
newthyroid 215 5 3 (30,150,35)
pasture 36 25 3 (12,12,12)
squash-stored 52 51 3 (23,21,8)
squash-unstored 52 52 3 (24,24,4)
tae 151 54 3 (49,50,52)
toy 300 2 5 (35,87,79,68,31)
winequality-red 1599 11 6 (10,53,681,638,199,18)
AMAE givesthesame importance toall classes,independently of their probability.Cohen’s kappa coefficientwill be usedto assess theagreementbetweendecisionsandtrueclassesthatisobtained withtheproposedmethod.
5.2. Datasets
Experiments have been performed with the 15 datasets that wereusedtoevaluateseveralmethodsin[21],whereeachdataset wasarrangedin30partitionsintotraining(around3/4ofthedata) and test (around1/4 of the data) sets. The same partitions5 are considered here.Thereportedresultsaveragetheperformance for these30partitions.Thebasiccharacteristicsofeachdataset-num- berofpatterns,numberofattributes,numberofclassesandnum- berofpatternsperclass-appearinTable1.
5.3. Benchmarkmethods
The seven methods testedin [21]will be usedasbenchmark.
These methods include two ordinal variants of class switching [37] to generate ensembles of classifiers that were proposed in [21],ArithmeticOrdinalClassSwitching(AOCS),andGeometricOr- dinal ClassSwitching (GOCS), along withthe standard (Nominal) ClassSwitching(NCS)ensembleandaconventionalensemblethat doesnotuseclassswitching,denotedasOriginalensemble(Orig).
Moreover, results are provided for other three benchmark meth- ods: reduction from ordinal regression to binary support vector machines (REDSVM) [9], the reformulationof Gaussian processes for ordinal regression (GPOR) [8], and the ORBoost method with allmargins[38].Thedetailsofthedesignofthesemethodscanbe foundin[21].
5.4. Results
The proposed method has the flexibility provided by the pa- rameters
π
t andcd,t fort,d∈{
1,2,...,M}
to specifytherelativeimportanceofmisclassificationsasafunctionofthedifferencebe- tweentheclasslabelforthetrueclassandthedecisionandatthe same time the relative importance of each class in the objective function.Todealwiththerelativeimportanceofmisclassifications inanordinalproblem,wehaveconsidered
cd,t=
|
d− t|
. (39)5 Datasets and partitions are available at http://uco.es/grupos/ayrna/orreview (last access, October 2022)
Fig. 2. Parzen windows used in the experiments, k X(z) , X ∈ { G, U, L } , with G : Gaus- sian, U: uniform, L : linear.
Tosimulatetwodifferentsituationsthatcanhappeninrealprob- lems, we have considered two different Bayesian Ordinal Neural Network(BONN)solutions:
• BONN(MAE):theimportance oftheclassis proportionaltoits probabilityinthedataset.Forthiskindofsituation,theappro- priate figure ofmerit to evaluate theperformance ofthe dif- ferentmethodsisMAE.Thepriorprobabilityparametersofthe methodare
π
t =NtN. (40)
• BONN(AMAE):theimportanceisthesameforallclasses,inde- pendentlyoftheirprobability inthe dataset.Inthisscenario, whichcanhappeninmanyimbalancedproblems,theappropri- atefigure of merit is AMAE. The prior probability parameters forthiscaseare
π
t = 1M. (41)
Note that this setup is mathematically equivalent to consider thepriorprobabilityforeachclassasgivenbythedatasetbut tomodify themisclassification coststo be proportionalto the inverseofthepriorprobabilityofthetrueclass,i.e.
π
t =NtN,cd,t= 1
π
t|
d− t|
. (42)Inanycase,theresultisthat allclasseshavethesameweight intheproposedcostfunction(18)independentlyoftheirprob- ability.
A multilayer perceptron with a single hidden layer with L neurons is used in the experiments. For each dataset, L∈
{
10,20,30,40,50,100}
neurons, withboth tanh andReLU activa-tion functions, have beentested. The neuron of the output layer has a linear activation function. Three different Parzen windows have been tested for each dataset: Gaussian, uniform and linear windows,whichareplottedinFig.2.Theuniformandlinearwin- dows are constrained tothe domain [−1,1], andthe variance of theGaussianwindowis0.1507(99%ofprobabilityin[−1,1]).
kG
(
z)
=√ 1 2π σ
2e−2zσ22, with
σ
2=0.1507 (43)kU
(
z)
=)
12 if
|
z|
≤ 10 if
|
z|
> 1 (44)kL
(
z)
=)
12
(
z+1)
if|
z|
≤ 10 if
|
z|
> 1 (45)The choiceofthe windowweights differentlythe contributionof thesamples tothe updatesduringthetraining. The weight fora samplexk dependson the distance betweenits network output, zk,andthe decisionthresholds ud (seethegradient Eqs.(25)and (26),whichmaketheupdatesproportionaltok(ud− zk)).Theuni- formwindowweightsuniformlyallsamplesatdistancelowerthan 1.Withthe Gaussianwindow theweight decreases exponentially with the distance to the thresholds. The linear window weights
Table 2
MAE (average ± standard deviation in the 30 partitions) for each data set.
Dataset Best in [21] BONN(MAE)
ERA 1.219 ± 0.044 (REDSVM) 1.193 ± 0.044
ESL 0.301 ± 0.035 (GPOR) 0.291 ± 0.031
LEV 0.410 ± 0.023 (REDSVM) 0.389 ± 0.026 SWD 0.440 ± 0.032 (GPOR) 0.428 ± 0.029 automobile 0.263 ± 0.074 (NCS) 0.282 ± 0.063 balance-scale 0.001 ± 0.004 (REDSVM) 0.017 ± 0.008 bondrate 0.531 ± 0.110 (ORBoost) 0.438 ± 0.088 eucalyptus 0.331 ± 0.038 (GPOR) 0.352 ± 0.024 newthyroid 0.029 ± 0.022 (REDSVM) 0.017 ± 0.015 pasture 0.219 ± 0.147 (NCS) 0.229 ± 0.075 squash-stored 0.327 ± 0.122 (AOCS/GOCS) 0.281 ± 0.111 squash-unstored 0.161 ± 0.085 (AOCS/GOCS) 0.096 ± 0.056 tae 0.461 ± 0.060 (REDSVM) 0.494 ± 0.073 toy 0.024 ± 0.013 (REDSVM) 0.049 ± 0.017 winequality-red 0.348 ± 0.019 (GOCS) 0.410 ± 0.014
Table 3
AMAE (average ± standard deviation in the 30 partitions) for each data set.
Dataset Best in [21] BONN(AMAE)
ERA 1.370 ± 0.099 (AOCS) 1.277 ± 0.092 ESL 0.459 ± 0.116 (REDSVM) 0.432 ± 0.070 LEV 0.601 ± 0.051 (AOCS) 0.518 ± 0.056 SWD 0.576 ± 0.039 (ORBoost) 0.461 ± 0.046 automobile 0.313 ± 0.120 (GOCS) 0.346 ± 0.104 balance-scale 0.001 ± 0.003 (REDSVM) 0.015 ± 0.013 bondrate 0.839 ± 0.260 (ORBoost) 0.810 ± 0.243 eucalyptus 0.362 ± 0.040 (GPOR) 0.381 ± 0.029 newthyroid 0.048 ± 0.040 (REDSVM) 0.026 ± 0.034 pasture 0.219 ± 0.147 (NCS) 0.229 ± 0.075 squash-stored 0.368 ± 0.129 (ORBoost) 0.337 ± 0.163 squash-unstored 0.170 ± 0.137 (NCS) 0.153 ± 0.084 tae 0.459 ± 0.059 (REDSVM) 0.494 ± 0.082 toy 0.024 ± 0.016 (REDSVM) 0.054 ± 0.016 winequality-red 0.952 ± 0.076 (AOCS) 0.781 ± 0.121
linearlyandasymmetricallythesamplesatdistancelowerthan 1, withahigherweightforsamplesbelowthethresholds.
Thebestnetwork size,activation function,andParzenwindow foreachdatasetandperformancecriterionhavebeenselectedby cross-validation(3-foldcross-validationwiththetrainingset).
Tables2and3showtheresultingMAEandtheAMAE,respec- tively. To simplifythe comparison, inboth tables only the result ofthebestbenchmark methodin[21]foreach datasethasbeen included(thedetailedresultsofeachbenchmarkmethodforeach datasetcanbefoundin[21]).
Thebest(lowest)averagevalueofMAE/AMAEforeachdataset is underlined,andboldface highlights thecases wherethediffer- enceissignificant.Wedefineassignificantthosedifferenceswhere thehypothesis thatthetwomeans areequivalentcan berejected witha levelofsignificanceequalto5%,i.e., thosedatasets where theabsolutevalueofthedifferencebetweenmeanvaluesishigher than1.65
σ
[39],σ
beingtheestimateddeviationofthedifference betweentheMAE/AMAEofthetwomethodsσ
=* σ
bench2n +
σ
prop2n , (46)
where
σ
bench denotes the standard deviation for the best bench- mark methodin [21],σ
prop isthe standard deviation ofthe pro- posedmethod,andn=30isthenumberofaveragedpartitionsfor bothmethods.It can be seen that BONN(MAE)obtains the bestMAE in8 of the 15 data sets, with 5 significant wins, and with 5 significant losses in the remaining 7 data sets where the best method in [21] hasthebestMAE.Thebest resultsinAMAEareobtainedby
Table 4
Wins/Ties/Losses of the BONN(MAE) method in MAE against each one of the benchmark methods (considering wins with a level of significance of 5%), and T paired and Wilcoxom paired tests, including p-value and the hypothesis for the 5%
of significance.
Method Wins/Ties/Losses of BONN(MAE)
T-Test p-value (H) Wilcoxon p-value (H)
Orig 13/2/0 0.0003 (1) 0.0001 (1)
NCS 11/3/1 0.0127 (1) 0.0128 (1)
AOCS 9/5/1 0.0178 (1) 0.0067 (1)
GOCS 9/5/1 0.0183 (1) 0.0063 (1)
GPOR 11/3/1 0.0052 (1) 0.0014 (1)
ORBoost 12/2/1 0.0030 (1) 0.0009 (1)
REDSVM 11/1/3 0.0164 (1) 0.0215 (1)
Table 5
Wins/Ties/Losses of the BONN(AMAE) method in AMAE against each one of the benchmark methods (considering wins with a level of significance of 5%), and T paired and Wilcoxom paired tests, including p-value and the hypothesis for the 5%
of significance.
Method Wins/Ties/Losses of BONN(AMAE)
T-Test p-value (H) Wilcoxon p-value (H)
Orig 12/3/0 0.0018 (1) 0.0003 (1)
NCS 10/5/1 0.0087 (1) 0.0016 (1)
AOCS 9/6/0 0.0106 (1) 0.0012 (1)
GOCS 9/6/0 0.0101 (1) 0.0012 (1)
GPOR 13/0/2 0.0007 (1) 0.0002 (1)
ORBoost 11/4/0 0.0003 (1) 0.0001 (1)
REDSVM 10/2/3 0.0036 (1) 0.0026 (1)
BONN(AMAE)in9ofthe15datasets,with5significantwins,and with4significantlossesintheother6datasets.
It isimportantto remarkthat the bestresult in[21]for each dataset isobtainedby differentmethods. Thesemethodsare in- dicated in Tables 2 and 3. So, the performance of the proposed proceduresmustbe consideredverygood.Tofurthersupportthis conclusion,Tables 4and5presentsthenumberofwins,tiesand lossesofBONN(MAE)inMAEandBONN(AMAE)inAMAE,respec- tively, against each benchmark method (wins andlosses are de- finedwiththesignificancelevelof5%,andtiesarethecaseswhere the hypothesis ofthe means beingequivalentcannot be rejected withthissignificancelevel).Thesetablescontainalsothe p-value andthehypothesisforapairedTtest andapairedWilcoxontest.
ItcanbeseenthatBONN(MAE)hasaminimumof9winsinMAE – againstAOCSandGOCS withonly 1loss – anda maximumof 3 losses, against REDSVM butwith 11wins against thismethod.
InAMAE, BONN(AMAE)hasalso a minimumof 9wins – against AOCSandGOCS,nowwithnolosses– andamaximumof3losses againstREDSVM,butwith10winsagainstit.Inallcases,thenull hypothesis assumingthat thetrue meandifferenceis zeroisdis- carded withboth theT test andtheWilcoxon test. Theseresults permit toconcludethat theproposed approachhasa remarkably high performance:In a paircomparison it provides the best av- erageresultsandthebestwins/losses figureagainsteverybench- markmethod.Thedifferenceinperformanceisstatisticallysignifi- cantinallcases.
To test the differences between more than two models, the Friedmantest[40]isawellknownandveryusedmethod.Figure3 plotsthemeansandranksobtainedintheFriedmantest,bothfor MAEandAMAE.Thistestalsoshowsaclearadvantageintheper- formance oftheproposed method withrespectto all the bench- markmethods,speciallyinAMAE,thescenariowhereimbalanceis takenintoaccount.
Oneofthemaincharacteristicsoftheproposed methodisthe flexibilitythatisprovidedbytheBayesianformulation.Thisfeature allows it to provide good results in different scenarios withdif- ferentrequirements.Althoughthemethodisdesignedtoimprove
Fig. 3. Average means and ranks for Friedman test applied to the 8 methods under comparison for both MAE and AMAE (better models: on the left).
Table 6
Average figures of merit in the 15 datasets.
Method Acc AAcc MAE AMAE Cohen’s κ2
MLP (Softmax) 0.6842 0.6132 0.3938 0.5089 0.7180 BONN(Acc) 0.6772 0.5825 0.3990 0.5603 0.6701 BONN(AAcc) 0.6274 0.6414 0.4813 0.4836 0.6769 BONN(MAE) 0.6841 0.5877 0.3645 0.5137 0.7085 BONN(AMAE) 0.6286 0.6441 0.4415 0.4270 0.7177
the metricsthat are relatedwithordinal problems,includingim- balance, themethodcan alsobeused tomatchdifferentrequire- ments.Justasanexample,itcanbeusedtomaximizeaccuracyor balancedaccuracy.Ifthedecisioncostsarenow
cd,t=
)
0, d=t1, d̸=t, (47)
these costs along with (40) and (41) can be used to maximize (32) and(33), respectively:BONN(Acc) andBONN(AAcc) areused tonamethesetwonewapproaches.Table6comparestheaverage performanceobtainedinthe15datasetsusingtheproposedmodel withthe4differentconfigurations.Thefigures ofmeritare accu- racy, balanced accuracy, MAEand AMAE(for the sake of a more compactpresentation,thestandarddeviationsarenot includedin the table,buthavebeen usedtodetermine thewins/ties ineach figureofmeritasinthepreviouspresentedresults).Here,thebest results foreach figureofmerit are highlightedin boldface,while thosewithoutasignificantdifferencewiththebestresult(ties,us- ing(46)asbefore)areunderlined.Aconventionalmulti-classMLP withsoftmaxactivationintheoutputlayerhasalsobeenincluded inthecomparisonasabenchmarkforaccuracy.Toisolatetheef- fectofthecostfunctionintheresults,thesamebasicarchitecture has beenused withall methods in all datasets: an MLPwith 20 neuronsinthehiddenlayerandtanhactivationfunction(notethat theoutputlayerisdifferentfortheMLPwithsoftmaxactivation).
Withrespecttoimbalance,Table6showstheexpectedresults:
There is a trade-off between the metrics that consider andthat do not consider the imbalanceto measure the performance. The methods designedto improveAcc orMAEhave worse valuesfor AAccor AMAE(andvice versa). Theresults inTable 6also show that to use the ordinal information in the classification task can beuseful.Inaccuracy,theMLPwithsoftmaxprovidesslightlybet- ter results than BONN(Acc), as expected (in general, multi-class approacheswith softmaxhave shownin the literature better re- sults than regression based approaches in terms of accuracy for classificationproblems).ButtheaccuracyofBONN(MAE),whichis designedtaking theorder into account in theclassification costs cd,t tominimizeMAE,obtainsan equivalentaccuracyatthesame timethatobtainsthebest(lowest)MAE.Asimilarbehaviorisob- served whenthe imbalanceis considered:Forbalanced accuracy, an equivalent performance is obtained with BONN(AMAE), de- signedtakingintoaccounttheorder,andBONN(AAcc),whichdoes notincludetheorderinitsdesignbutisdesignedtomaximizebal- ancedaccuracy. Atthe sametime, BONN(AMAE)obtains the best AMAE, as expected. Therefore, to improve the ordinal classifica- tion(MAE/AMAE)canhelptoimprovetheaccuracy(Acc/AAcc).We wantto remarkthat Table 6showsthe averageresults inthe 15 datasets.InsomedatasetswhenMAE/AMAEisimprovedAcc/AAcc increases.Butinotherdatasetsa trade-off betweenMAEandAcc (orAMAEandAAcc)hasbeenobserved: Toincreaseone ofthem tends todecrease theother one (when closeto thebest metrics, obviously).Finally,Cohen’skappacoefficientshowsthatall meth- ods have a good agreement between decisions and true classes, faraway fromagreementby chance,withMLPwithsoftmaxand BONN(AMAE)havingthebestvaluesfor
κ
2.Theprevious experiments show thatthe proposed methodal- lowsto simultaneously deal withordinal problems andwith the imbalance,asBONN(AMAE) doeswiththe parameters in (42). In thiscaseallclasseshavethesameweightinthelossfunctioninde- pendentlyoftheirprobabilities.ButtheBayesianformulationpro- videsalsotheflexibilitytoconsidertherelativeimportanceamong classes. This can be usefulin applications were the detection of aclass,whichcanbe theminorityclass,ishighlyrelevanttoend users.Asanexampleofthisflexibility,Table7comparestheconfu- sionmatricesobtainedintheLEVdatasetwiththeMLP(Softmax), BONN(MAE), BONN(AMAE) and BONN(Emph-5), which denotes a solutionemphasizingclass5,theminorityclassinLEV(only 2.7%
ofsamples,comparedto40.3%forclass3,themajorityclass).This solutionisusingthefollowingparameters
π
t=NtN,cd,t=
)
1πt
|
d− t|
, t̸=52
πt
|
d− t|
, t=5. (48)To emphasize the importance of class 5, its contribution to the lossisnowdoubled.Thediagonalofthematrices, whichcontains the per class accuracy, Acct in (34), has been highlighted (bold- face)to facilitatethe analysis. Additionally, Acc,AAcc (it is equal totheaverageofthevaluesinthe diagonaloftheconfusionma- trix),MAEandAMAEareprovidedforeverysolution.TheMLPand BONN(MAE)showasimilarbehavior:Bothapproacheshaveabet- teraccuracyforthemajorityclasses,andtheminorityclass(class 5)isalmostignored.TheconventionalMLPobtains aslightlybet- teraccuracy,andBONN(MAE)obtainsaslightlybetterMAEbyre- ducingingeneralthemisclassificationerrorswithhigherclassdif- ferences, asexpected. Otherwise, BONN(AMAE) increasesnotably AMAEbyforcingamoreuniformperclassaccuracy(notethatalso averageaccuracy,AAcc,isnotablyimproved).Finally,BONN(Emph- 5)isabletoimprovetheaccuracyofclass5,evenalthoughitisthe minority class,obviouslyat thepriceof alower accuracy forthe other classes(specially forthe closestclass) andaslightly worse AMAEthanBONN(AMAE).