Neural network for ordinal classification of imbalanced data by minimizing a Bayesian cost

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Neural network for ordinal classification of imbalanced data by minimizing a Bayesian cost

Marcelino Lázaro

^∗

, Aníbal R. Figueiras-Vidal

Signal Theory and Communications Dept. Universidad Carlos III de Madrid, Spain

a rt i c l e i nf o

Article history:

Received 2 February 2022 Revised 1 December 2022 Accepted 4 January 2023 Available online 6 January 2023 Keywords:

Bayes cost Parzen windows Ordinal classification Imbalanced

a b s t r a c t

Ordinalclassificationofimbalanceddataisachallengingproblemthatappearsinmanyrealworldap- plications.Thechallengeistosimultaneouslyconsidertheorderoftheclassesandtheclassimbalance, whichcannotablyimprovetheperformancemetrics.TheBayesianformulationallowstodealwiththese twocharacteristicsjointly:Ittakesintoaccountthepriorprobabilityofeachclassandthedecisioncosts, whichcanbeused toincludetheimbalance andthe ordinalinformation,respectively. Weproposeto usetheBayesianformulationtotrainneuralnetworks,whichhaveshownexcellentresultsinmanyclas- sificationtasks.Alossfunctionisproposedtotrainnetworkswithasingleneuronintheoutputlayer andathresholdbaseddecisionrule.ThelossisanestimateoftheBayesianclassificationcost,basedon theParzenwindowsestimator,whichisfittedforathresholdeddecision.Experimentswithseveralreal datasetsshowthattheproposed methodprovidescompetitiveresultsindifferentscenarios,duetoits highflexibilitytospecifytherelativeimportanceoftheerrorsintheclassificationofpatternsofdifferent classes,consideringtheorderandindependentlyoftheprobabilityofeachclass.

ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)

1. Introduction

Ordinal classificationisaspecialformofmulti-classclassifica- tionwheretheclassesexhibitaninherentordering.Itisafrequent problembecausea naturalordering occursinmanyhumantasks, suchasthoseassociatedwitheverykindofgradinginhumande- visedscales:satisfactionsurveys,medicaldiagnosis,qualityassess- ment or credit rating are justsome examples (a detailedlist of applications indifferent research areascan be found inthe literature review[1]). Becauseofits relevance,although thisproblem has beenstudied formore than fourdecades [2]ordinal classifi- cationisstillreceivingattentioninseveraldifferentdirections:or- dinalclassificationmethods(seethetaxonomyin[1]),appropriate performance metrics [3,4] orthe interpretabilityof neural classi- fiersforordinalproblems[5,6]aresomeexamples.

There isawide varietyofordinal classificationmethods: from initial approaches that statedthe problem asa conventional re- gressionproblem[7],wheretheordinalclassesweremappedinto sortednumericvalues,tomorerecentproceduresthatupdatecon-

∗Corresponding author at: Departamento de Teoría de la Señal y Comunica- ciones, Universidad Carlos III de Madrid. Av. Universidad 30, 28911 Leganés, Madrid, Spain.

E-mail address: [email protected] (M. Lázaro) .

ventional techniques to include the ordinal constraints (such as Gaussian Processes (GP) [8], Support Vector Machines(SVM) [9], Neural Networks (NN) [10], or Learning Vector Quantizers (LVQ) [11]).Inthesurvey[1]thesemethodsaregroupedintothreecate- gories:

• Naive approaches: The problem is posed as another standard problem, such asregressionor nominalclassification, without considering the ordering of the classes (with evident limita- tions), or as a cost-sensitive classification problem, with dif- ferentcostsfordifferentmisclassificationerrors.Howtodeter- minethecost matrixwithoutapriori knowledgeoftheprob- lemisthemainlimitationofthesetechniques.

• Ordinalbinarydecompositionapproaches:Theordinaltargetvari- able is mapped into several binary variables, which are used to train a single multiple-output model orseveral models. In manycases,thebinarydecompositionviolatesthemonotonic- ityorconsistencyoftheordinaltarget,althoughconsistentso- lutionshavealsobeenproposed[12].

• Thresholdmethods:Theyarebasedonthedefinitionofthresh- oldsover a real valued measure (or latent variable) that will definethe intervalsassociated toeach class.Toforce amono- tonicbehavior in thislatent variable according to the ordinal structure of the problem can be helpful to improve the gen- eralizationabilityofthetrainedmodel.Apotentialproblemis

https://doi.org/10.1016/j.patcog.2023.109303

(2)

thatthe distancesamongtheorderedclassesarenotknowna priori.Becauseof this,mostmethods estimatethesedistances duringthelearningprocess.

Manyrealordinalproblemspresentastrongimbalanceamong classes. Conventionalclassifierstend tobe biasedtowards major- ityclassesandtoproducemoreerrorsfortheminorityclasses,al- thoughminorityeventsareoftenhighlyrelevantinmanyapplica- tions (medical diagnosis, fraud detection, etc.). Imbalanced problems are frequent and relevant, and consequently several methods have been proposed to solve them. The proposed solutions canbegroupedintothreecategories[13]:Datapreprocessingmeth- ods,such asresamplingordataaugmentationtechniques (SMOTE [14] is probably the best known technique). The effectiveness of sampling techniques depend on the problem and on the nature of the classifier, they can modify the class distributions, and in generalthere isnotguarantee ofanimproved performance.Cost- sensitivemethodsmodifythedecisionthresholdsorassignweights toinstancesaccordingtoacostmatrix,oraretrainedwithanob- jective function that is cost sensitive, asactive learning versions [15]. The Bayes decision theory has also been used in this con- text, usually with a learning method used to estimate the class posterior probabilities (although most ofthese estimates are not principled, which spoilsthe robustnessof theseapproaches).En- semblemethods,whichcombinemultipleclassifiers,arealsoused to deal with imbalanced problems [16,17], including binarization techniques[18].Mostoftheseprocedures areempiricaland, consequently,theyimplicatedegradationrisks.

Althoughimbalanceappearsinmanyrealordinalproblems,few workshaveproposedsolutionsconsideringtheordinalandtheim- balancecharacteristicssimultaneously.Inmostcases,methodsare focused on ordinal classification and imbalance is addressed using samplingtechniques[19,20] orensembles [21].Alsoveryfew works,such as[21],haveevaluated the performancewithappro- priatefiguresofmeritforthisscenario.

Ourobjectiveistoproposeaprincipledsolutionforordinaland imbalanced problems, and to evaluate it using appropriate metrics. The proposed solution is based on the combinationof neu- ralnetworkswiththeBayesdecisiontheory.Neuralnetworkshave shown excellentperformance inmany applications, includingor- dinal [10] and imbalanced problems [22] (further details of the useofNNs forordinal andimbalancedproblemscan befound in Section 2). The Bayes classification cost considersthe class prob- abilitiesandthedecisioncosts, whichallow tointroducethe imbalance and the ordinal information, respectively, in a solid sta- tistical framework. We propose to combine these two powerful tools to create an ordinal classification method capable of deal- ing withimbalanceddatadirectly,withoutusingdatapreprocess- ingtechniquesorensembles.Theproposedmethodcanbeapplied toanyneuralnetworkarchitecturewithaone-dimensionaloutput andathreshold-baseddecision(the thresholdswillbelearned,to dealwiththelackofknowledgeabouttherelativedistanceamong classes). Our main contributionis a newloss function that is an estimateoftheBayesclassificationcost,usingtheParzenwindows estimatortomodeltheconditionalprobabilityofthenetworkout- putforeachclass.Asimilarapproachwasproposedin[23]forim- balancedbinaryclassification,andlaterextendedin[24]forbinary example-dependent-cost(EDC)classification.Here,theformulation is fittedtoa multiclassordinal problemwithathresholded deci- sionrule.

Themanuscriptisorganizedasfollows.Section2reviewssome related works. Section 3 formally states an imbalanced ordinal problem. The proposed solution ispresented inSection 4.Exper- iments with real datasets are presented in Section 5 to evaluate the proposed method, andSection 6 closes themanuscript sum- marizingitscharacteristics.

2. Relatedworks

Therearemanyworksdedicatedtoordinalclassificationprob- lemswithimbalanceddata,butfewworksconsidertheordinalin- formationandtheimbalancejointly.Inthissection,relatedworks arereviewedtohighlight thedifferenceswiththeproposed solution.Becauseofthis,thereviewisfocusedonneuralnetworksolu- tions,cost-sensitiveapproachesandlossfunctionsusedinordinal classification.

2.1. Preprocessingmethodsorensemblestohandleimbalance

Many solutions are focused on ordinal classification, and the imbalanceishandledusingpreprocessingtechniques,suchassam- plingordataaugmentation,orensembles.Someexamples,[19–21], werementionedabove,andadditionalexampleswillbeaddressed inthefollowingsections.

In our work the imbalance is considered inthe loss function to be minimizedin the trainingof aneural network, which also considerstheorderinformation.

2.2. Useofordinalevaluationmetrics

Some methods proposeto use ordinalevaluation metricsas a validationtool,buttheordinal informationisnot includedinthe designoftheclassifier,suchas[25]whereensembles(stakedgen- eralization)were validated usingseveralordinal metrics.The im- balancewastreatedbyusingsamplingtechniques.

Inthisworktheordinal information,aswell astheimbalance, isconsideredinthedesignofthesolution.Appropriateevaluation metrics for an imbalanced ordinal classification problem will be usedtoevaluatetheproposedmethod.

2.3. UseoftheBayestheory

TheBayesclassificationtheoryhasbeenusedinthecontextof machine learning. Some cost-sensitiveapproaches usethe output ofalearningmachinetoprovideclassprobabilityestimates,anda riskfunctionbasedontheseestimatesalongwithacostmatrixis usedtomake decisions,asin[26].However, notall learningma- chinesareabletoprovideconsistentestimatesofposteriorproba- bilities[27],andinpracticesome heuristiccalibrationprocedures areusedtoimprovetheperformance,asin[28],whichharmsthe robustnessoftheseapproaches.

Thereare alsoprevious approachesusing theParzen windows estimatorwiththeBayestheory,asintheprobabilisticneuralnet- worksproposedbySpecht[29],orin[30].Theproposedapproach isnotablydifferent:Parzenwindows areappliedhereattheone- dimensional output space of a neural network instead of being applied in the multi-dimensional input space, as in [29] or [30]. Moreover,intheseworkstheParzen estimatorisusedinthe decision rule,which is based onits estimates.In theproposed approach,the decision isbased on thresholds defining decision regions over the one-dimensional output ofa neural network. The Parzenwindowestimatorisusedinthedefinitionofthelossfunc- tionusedtotraintheneuralnetworkclassifier.

2.4. Ordinallossfunctions

Commonlossfunctionsusedtotrain neuralnetworks, suchas cross-entropy,donotconsidertherelativeorderingamongclasses ortheimbalance(althoughweightedversionsoftheselosseshave been proposed to deal with imbalance). For this reason, several typesof lossfunctions havebeen proposed forordinal classification.

(3)

Some works,as[31],uselossesmodelingtheoutputprobabili- tiesasfollowingaunimodaldistribution.In[32]theselosses,along withadditionalquasi-unimodallosses, areevaluated usinga con- volutionalneural network(CNN)in3ordinal datasetsofRGBim- agesusingseveralmetrics.Inthiswork,theimbalanceisonlycon- sidered in the evaluation phase, by using appropriate evaluation metrics.

In [33],an ordinal loss function was proposed to train CNNs.

This loss function adds the cross-entropy per output ina multi- outputproblemdefinedbydecomposingtheM-classordinalprob- lem inM− 1binaryproblems.Authors implementedseveralcon- figurations, including ensembles to evaluate the performance in imbalanced datasets. Similarly, [12] combines a binary decomposition with a weighted cross-entropy loss for M-1 binary classifiers withtheoreticalguarantees forclassifierconsistency totrain CNNs. The weights inthe loss for the M-1tasks can be usedto takeintoaccounttheimbalance,withthelimitationsgivenbythe binary decomposition. However, all experiments were performed with uniformweights andthe evaluationmetrics, mean absolute error(MAE)androotmeansquarederror(RMSE),didnotconsider imbalance.

Thequadraticweightedkappa(QWK)metrichasbeenproposed asa loss functionto trainCNNs [34].Thekappa index, originally designedasameasureofagreementbetweenobservationsthatal- lows to discard agreements dueto mere chance, has been used to evaluatethe performance inordinal problemsbecauseit takes into account thedifference betweenthe decisionandthe correct class. In [34] the imbalanceis attackedby using data augmenta- tiontechniques.In[35]theQWKlosswascombinedwithcumula- tivelinkmodels(CLM)tosignificantlyimprovetheordinalclassifi- cationmetrics. Again,augmentationtechniqueswereusedtodeal withimbalance.

OurproposedlossfunctionisanestimateoftheBayesclassifi- cationcost,whichallowstoincludeinthetrainingofaneuralnet- work theordinal informationalongwiththeimbalance,providing alsoflexibilitytoconsidertherelativeimportanceofeveryclassin the problemathand.It canbeused withanyneural networkar- chitecture, swallowordeep,havingasingleneuronintheoutput layerandathreshold-baseddecisionrule.

3. Ordinalclassificationofimbalanceddata

In this section the problem ofordinal classification of imbalanced data is stated. Standard notation will be used: p_X(x) de- notesthepointmassfunctionofadiscreterandomvariableX,i.e., p_X(x)₌Pr(X=x), f_X(x) is theprobability density function(PDF) of a continuousrandomvariable X,

|

·

|

îs ^theâbsolute ^value ôp-

erator when applied to scalars or the cardinality of a set, andaˆ denotestheestimateorthedecisionthatismadefora.

Amulticlassclassificationproblemcanbestatedasfollows:the objectiveistoassignagivenpatternofdimensionA,x∈IR^A,toone classoutofasetofM possibleclassesorhypotheses

H=

{

^H¹^,^H²^,^.^.^.^,^H^M

}

^. ⁽¹⁾

Inanordinalproblemtheclassesexhibitanaturalorder,whichis denotedas

H1≺ H2≺ · · · ≺ HM, (2)

where≺ expressesarelationoforder.Tosimplifythenotation, in thefollowingH_i=iwillbeassumed.Notethatinthiscasethecu- mulative distributionfunction(CDF)oftheclassesiswell defined andnaturallycontainstheorder.

Theavailableinformationabouttheproblemisasetoflabeled samples(typicallynamedthetrainingset)

T =

{ ⁽

^xk,yk

) |

^k∈Z,1≤ k≤ N

}

^, ⁽³⁾

withy_k∈H beingthelabelthatindicates theclassthepatternx_k belongsto.ThesetSt isthesetofindexesofpatternsthatbelong toclasst

St=

{

^k

|

^y^k=t,yk∈T

}

^,^t∈H, (4) andNt isthenumberofelementsinthesetSt,i.e.,Nt=

|

St

|

^.^Ob-

viously,

!M t=1

Nt=N. (5)

If the problemis imbalanced, the numbers of samples of the M classes,Nt,t∈H,aresignificantlydifferent.

4. Proposedmethod

Inthissectionwewillpresenttheproposedsolutiontoordinal classificationofimbalanceddata.

4.1. Neuralnetworkarchitectureanddecisionrule

Inconventionalclassificationproblems,whenaneuralnetwork istrainedwitha labeleddataset,multiclass problemsaresolved typically by using networkswith a neuron per class inthe out- putlayerwithsoftmaxactivation(thedecisionforapatternisthe classassociatedtotheneuronwiththehighestoutputforthispat- tern).Tosolvetheordinalclassificationproblemweproposetouse a neural network witha singleneuron in the outputlayer (one- dimensionaloutputspace).Tohaveasingleoutputwithamono- tonicdecisionrule(thehighertheoutput,thehighertheclassla- belintheordinalarrangement)allowstomatchthenetworkout- putwiththeordinalstructureofthecorrespondingproblem.Fora giveninputpatternx_ktheoutputofthenetworkis

z_k=g

(

x_k,w

)

_∈IR, (6)

wherethe functiong(_·,w)dependson theneural network archi- tecturethroughthesetofparametersofthenetwork,thatarein- cludedinthevectorw.Fromthisone-dimensionalnetworkoutput, theclassifierhastodecideaclassforpatternx_k

ˆ

y_k=decision

(

z_k

)

. (7)

In the proposed scheme the decision is based on thresholds defining decisionregions. Inparticular, M− 1ordered thresholds,

{

^u1,u₂,...,u_M₋₁

}

^, ^with ^ud−1< u_d, define M decision regions, as showninFig.1

I_d=

(

u_d₋₁,u_d],d∈H. (8)

With these decision regions, the decision rule of the proposed methodis

ˆ

y_k=difz_k∈I_d. (9)

The M− 1 thresholds are included in the vector u. To have an appropriate definitionforregions I₁ andI_M in (8), we can define u₀=−∞andu_M=+∞.

4.2. Lossfunction:AnestimateoftheBayescost

To simultaneously deal with imbalance andordinal classifica- tionanewlossfunction,inspiredbytheBayesdecisiontheory,is

Fig. 1. Thresholds and decision regions for the decision rule (9) .

(4)

proposed:AnestimateoftheBayescostisfittedtotheNNarchi- tecture anddecision rule presented in the previous section. The BayesianclassificationtheoryisbasedontheaverageBayescost BC=!

t∈H

π

t

!

d∈H

c_d,tp_Y_ˆ_|_Y

(

d

|

^t

⁾

^, ⁽¹⁰⁾

which has to be minimizedby the classifier.The variableY rep- resentsthetrueclass,andYˆrepresentsthedecision.Thisformula- tionisconvenientforimbalancedordinalclassificationproblems:it permitstoincludethepriorprobabilitiesofeachclass,

π

t≡pY(t), and the costs of deciding class d when the true class is t, c_d,t (theconditionalprobabilitydistributionp_Y_ˆ_|_Y(d

|

^t⁾characterizesthe probability ofthissituation).Inthisway,itcandealwiththeim- balanceandtherelativeimportanceofeachclassand,atthesame time,itallowstospecifytheimpactoftheordinalstructureinthe classificationerrors.Forinstance,makingcostsproportionaltothe difference between the decisionand the true class, c_d,t∝

|

^d^{− t}

|

^,

can helpto force thisordinalstructure intheclassifier.Forthese reasons,inthisworkweproposeanewlossfunctiontotrainneu- ralnetworksthatisbasedontheBayescostin(10).

Intheproposedneuralnetworkclassifier,withdecisionrule(9), theconditionalprobabilitiesofdecidingclassd whenthetrueclass ist areobtainedbyintegratingtheconditional distributionofthe output of the network forsamples of the trueclass, f_Z_|_Y(z

|

^t⁾^, ⁱⁿ

thecorrespondingdecisionregion p_Y_ˆ_|_Y

(

d

|

^t

⁾

=

"

Id

f_Z_|_Y

(

z

|

^t

⁾

^dz. ⁽¹¹⁾

However, inpractice thedistributions f_Z_|_Y(z

|

^t⁾ ^are ^unknown. ^We

propose touse insteadtheestimate provided bythe Parzenwin- dowsmethod[36]

fˆZ|^Y

(

z

|

^t

⁾

= 1 Nt

!

k∈S^t

k

(

z− zk

)

, (12)

where the function k(z) is the Parzen window, which has to be a validPDF (nonnegative andwithunit area).The mostcommon windowisaGaussianPDFwithvariance

σ

²

k

(

z

)

₌_√ ¹ 2

π σ

²^e

−₂^z_σ²2, (13)

butanyothervalidPDFcanbeused.TheParzenwindowsmethod hasbeenchosenbecauseitisanon-parametric estimatorthat al- lowstoobtainaclosedformexpressionforthegradientofthees- timatedBayesiancost,asitwillbeshownlater.

BydefiningtheintegraloftheParzenwindow K

(

z

)

₌

"z

−∞

k

(

x

)

dx, (14)

takingintoaccountthat

" ui

−∞

k

(

z− zk

)

dz=K

(

u_i− zk

)

(15) andintroducingtheParzenwindowsestimator(12)in(11),thees- timateofp_Y_ˆ_|_Y(d

|

^t⁾^becomes

ˆ

p_Y_ˆ_|_Y

(

d

|

^t

⁾

= 1 Nt

"

Id

!

k∈S^t

k

(

z− zk

)

dz

= 1 Nt

!

k∈S^t

[K

(

ud− zk

)

_{− K}

(

ud−1− zk

)

]. (16)

Thisresultsfollowsfrom¹

1 Note that # _I₁ k (z − z k)_{dz = K}(u 1 − z k) and #

IMk (z − z k) dz = 1 − K(u M−1 − z k) , given that K(−∞ ) = 0 and K(∞ ) = 1 by definition.

"

Id

k

(

z− zk

)

dz=

" ud ud−1

k

(

z− zk

)

dz=K

(

ud− zk

)

− K

(

ud−1− zk

)

. (17) Finally,by introducing(16)in(10) theestimate oftheBayes cost is

BˆC

(

w,u

)

₌

!M t=1

π

t

Nt

!

k∈St

$

cM,t+

M!−1 d=1

%

c_d,t− cd+1,t

&

K

(

u_d− zk

) '

. (18)

The dependenceofthe cost onthe thresholdsand onthe neural networksparameters(z_k,dependsontheseparameters, asshown in(6))hasnowbeenmadeexplicit.TheParzenwindowsestimator makes (18)derivable (K(z) is the integral of window k(z)). This estimateoftheBayescost(10)isproposedasthelossfunctionto beminimizedduringthetrainingoftheneuralnetworkclassifier.

4.3. Trainingalgorithm

Agradientdescenttrainingalgorithmisusedtominimize(18). Theparameterswandthethresholdsuareiterativelyupdatedas² w

(

n+1

)

=w

(

n

)

−

µ ∂

_B^ˆ_C

(

w,u

)

∂

w

( ( ( (

_w

=w(n),u=u(n)

, (19)

u

(

n+1

)

=u

(

n

)

−

µ ∂

_B^ˆ_C

(

w,u

)

∂

u

( ( ( (

_w

=w(n),u=u(n)

, (20)

where

µ

is the step-size parameter. The gradient expression for each iteration can be computed sample-by-sample, batch (using thewholetrainingset)ormini-batch(usingasub-setofthetrain- ingsetineachstep).Thesample-by-sampleexpressionofthegra- dientforthenetworkparameterswis

∂

_B^ˆ_C

(

w,u

)

∂

w =

∂

_B^ˆ_C

(

w,u

)

∂

z_k

∂

zk

∂

w. (21)

Thesecond termontheright-hand sideof(21)isindependentof thecostfunction:itonlydependsonthenetworkarchitecture.The firstterm,forapatternofclasst,is

∂

B^ˆC

(

w,u

)

∂

z_k

( ( ( (

_k

∈St

=

π

t

Nt M!−1 d=1

%

c_d₊₁_,t− cd,t

&

k

(

u_d− zk

)

. (22)

Theexpressionforthegradientofthethresholdsissimpler,because it does not depend on the neural network architecture as the thresholds do not depend explicitly on the network output.

Tosimultaneouslyupdateweightsandthresholds,we willusethe sample-by-samplecontributionofapatternx_kofclasst without- putz_ktothegradient

∂

_B^ˆ_C

(

w,u

)

∂

ud

( ( ( (

_k

∈S^t

=

π

t

Nt

%

cd,t− cd+1,t

&

k

(

ud− zk

)

. (23)

Inmanycases,theprobabilities ofthedifferentclassesarees- timatedfromthetrainingset,

π

t=^N_N^t.Inthesecases,by defining thenormalizedcosts ¯c_d,t= ^c^d,t_N (anormalizationofthesecostsdoes notaffectthesolutionoftheproblem)morecompactexpressions canbeobtainedfor(18),(22)and(23)

BˆC

(

w,u

)

₌

!M t=1

$

¯c_M,tNt+!

k∈S^t M!−1 d=1

%

¯c_d,t− ¯cd+1,t

&

K

(

ud− zk

) '

, (24)

∂

_B^ˆ_C

(

w,u

)

∂

z_k

( ( ( (

_k

∈S^t

=

M!−1 d=1

%

¯c_d₊₁_,t− ¯cd,t

&

k

(

u_d− zk

)

, (25)

2 For the sake of simplicity, the simplest gradient descent approach is shown.

Obviously, some other stochastic optimization approaches can also be employed, such as including momentum, or Adam optimization, just to cite some examples.

(5)

∂

B^ˆC

(

w,u

)

∂

ud

( ( ( (

_k

∈St

=

%

¯cd,t− ¯cd+1,t

&

k

(

ud− zk

)

. (26)

WithrespecttothechoiceoftheParzenwindow,k(z),itmust be remarked that thegoalis notto obtain goodestimatesofthe conditional distributions f_Z_|_Y(z

|

^t⁾ ^by ⁽¹²⁾^, ^but ^to ^obtain ^a ^good

classificationperformance³(adiscussionaboutthissubjectcanbe foundin[23]).

4.4. Computationalburden

Toanalyzethecomputationalburdenofthetrainingalgorithm, a multiclassneuralnetwork classifierwithM neuronsintheout- putlayer(oneper class),whoseburdeniswell-knownifthenet- work architecture is given, will be used as a reference. The up- dating expressions of the proposed method are compared with thoseassociatedtoacoupleofwell-knownlossfunctions,suchas theMeanSquaredError(MSE)andtheBinaryCross-Entropy(BCE) losses,whichare,respectively

LM

(

w

)

₌

(

y_k− zk

)

², (27)

LB

(

w

)

=−yklog

(

zk

)

−

(

1− yk

)

log

(

1− zk

)

. (28) ForeverylossfunctionL(w)thecontributionofapatternx_kwith outputz_ktothegradientcanbewrittenas

∂

L

(

w

)

∂

w =

∂

L

(

w

)

∂

zk

∂

z_k

∂

w (29)

wherethesecondtermontheright-handsidedoesnotdependon thelossbutonlyonthenetworkarchitecture,asin(21).ForMSE andBCE

∂

LM

(

w

)

∂

z_k =−2

(

y_k− zk

)

, (30)

∂

_LB

(

w

)

∂

z_k =−yk

z_k +1− yk

1− zk

. (31)

In a multiclass neural network, M terms (one per class)such as (30) or(31) mustbe computedfora pattern. Withthe proposed costfunction,thecomputationof(22)or(25)forapatternrequires the addition of M− 1 terms and the evaluation of the window function. Taking intoaccount that ^∂^z^k

∂w isindependent of thecost function, andthat this termisthe main responsible ofthe com- putational burden(specially indeep networksbecauseitis back- propagated),itcanbeconcludedthatthecomputationalcomplex- ityusingtheproposedcostfunctionisofthesameorderofmag- nitudethanthatrequiredforamulticlassneuralnetworkclassifier usingMSE ofBCE lossfunctions.⁴ Theupdate ofthresholdsuhas not been considered in the discussion: The gradient expressions (23) or(26)donot depend on thenetwork architecture,andthe number ofthresholds in a practical network ismuch lower than thenumberofnetworkparametersinw.

5. Experiments

In this section, after presenting appropriate figures of merit to measure the performance in imbalanced ordinal classification methods, experimentswithasetofrealdatasets toevaluatethe performanceoftheproposedmethodswillbepresented.

3 In fact, these distributions are not estimated during the training procedure, but the formulation by using the Parzen window method allows to obtain closed form expressions for the gradient of the cost function (18) , as in (22) and (23) .

4 In a practical implementation, the small difference is mainly dependent on the cost for evaluating the Parzen windows k (·) , which basically depends on the implementation platform.

5.1. Figuresofmerit

Thechoiceofanappropriatefigureofmerittocomparetheper- formanceofdifferentmethodsisveryimportantinanyapplication.

Inconventionalclassificationsproblems,theusualchoiceistheac- curacy (or its complement, theprobability oferror), which for a labeleddatasetisdefinedas

Acc= 1 N

!N k=1

I

(

yˆ_k=y_k

)

, (32)

whereI(_·)denotestheindicatorfunctionthatreturns1iftheargu- mentistrueand0ifitisfalse.Althoughaccuracyisareasonable performancemeassureforabalancedclassificationproblem,inim- balancedproblemsitover-representstheaccuracyinthemajority classesandunder-representstheaccuracyintheminorityclasses.

Forthisreason,itisfrequentlyreplacedbytheaverageorbalanced accuracy

AAcc= 1 M

!M k=1

Acct, (33)

whereAcct istheaccuracyinthedetectionofclasst Acct= 1

Nt

!

k∈S^t

I

(

yˆ_k=y_k

)

. (34)

Forordinalproblems,thefigureofmerit mustincludethedis- tance between the true class andthe decision. The Mean Abso- luteError(MAE),whichincludesthisdistance,isatypicalfigureof meritforthiskindofproblems

MAE= 1 N

!N k=1

( (

yˆ_k− yk

( (

. (35)

Inimbalancedordinalproblems,toavoidtheover-representations oftheperformanceforthemajorityclasses,thismeasureisusually replacedbytheaverage(orbalanced)MAE

AMAE= 1 M

!M t=1

MAEt, (36)

whereMAEt istheMAEforpatternsofclasst MAEt= 1

Nt

!

k∈St

( (

yˆ_k− yk

( (

. (37)

Cohen’skappacoeﬃcientisa measureofagreementbetweenob- servationsthat allowsto discardagreementsduetomere chance.

Ithasbeenappliedintheevaluationofclassificationalgorithmand ithasalsobeen usedasalossfunction inordinal problems[34]. ThequadraticweightedCohen’skappacoeﬃcientis

κ

²₌1−

!M t=1

!M d=1

(

d− t

)

²od,t

!M t=1

!M d=1

(

d− t

)

²e_d,t

, (38)

where o_d,t isthe observed disagreement for decision d and true class t and e_d,t is the corresponding expected disagreement due to chance. This coeﬃcient ranges from -1 (total disagreement) through0(randomclassification)to+1(totalagreement).

Inthiswork,MAEandAMAEwillbeusedtoevaluatetheper- formanceinordinalproblems.Notethatthesetwomeasuresgivea differentimportancetothemisclassificationsperclass: MAEgives the same importance to every pattern, which in practice over- represents the performance for the majority classes because the contribution of a class is proportional to its probability (to be more precise, to the number of examples in the data set). And

(6)

Table 1

Experimental datasets and their main characteristics.

Dataset Patterns Attr. Classes Patterns per class

ERA 1000 4 9 (92,142,181,172, 158,

118, 88, 31, 18)

ESL 488 4 9 (2,12,38,100,116,

135,62,19,4)

LEV 1000 4 5 (93,280,403,197,27)

SWD 1000 10 4 (32,352,399,217)

automobile 205 71 6 (3,22,67,54,32,27) balance-scale 625 4 3 (288,49,288)

bondrate 57 37 5 (6,33,12,5,1)

eucalyptus 736 91 5 (180,107,130,214,105)

newthyroid 215 5 3 (30,150,35)

pasture 36 25 3 (12,12,12)

squash-stored 52 51 3 (23,21,8)

squash-unstored 52 52 3 (24,24,4)

tae 151 54 3 (49,50,52)

toy 300 2 5 (35,87,79,68,31)

winequality-red 1599 11 6 (10,53,681,638,199,18)

AMAE givesthesame importance toall classes,independently of their probability.Cohen’s kappa coeﬃcientwill be usedto assess theagreementbetweendecisionsandtrueclassesthatisobtained withtheproposedmethod.

5.2. Datasets

Experiments have been performed with the 15 datasets that wereusedtoevaluateseveralmethodsin[21],whereeachdataset wasarrangedin30partitionsintotraining(around3/4ofthedata) and test (around1/4 of the data) sets. The same partitions⁵ are considered here.Thereportedresultsaveragetheperformance for these30partitions.Thebasiccharacteristicsofeachdataset-num- berofpatterns,numberofattributes,numberofclassesandnum- berofpatternsperclass-appearinTable1.

5.3. Benchmarkmethods

The seven methods testedin [21]will be usedasbenchmark.

These methods include two ordinal variants of class switching [37] to generate ensembles of classifiers that were proposed in [21],ArithmeticOrdinalClassSwitching(AOCS),andGeometricOr- dinal ClassSwitching (GOCS), along withthe standard (Nominal) ClassSwitching(NCS)ensembleandaconventionalensemblethat doesnotuseclassswitching,denotedasOriginalensemble(Orig).

Moreover, results are provided for other three benchmark methods: reduction from ordinal regression to binary support vector machines (REDSVM) [9], the reformulationof Gaussian processes for ordinal regression (GPOR) [8], and the ORBoost method with allmargins[38].Thedetailsofthedesignofthesemethodscanbe foundin[21].

5.4. Results

The proposed method has the flexibility provided by the parameters

π

t andc_d,t fort,d∈

{

¹^,²^,^.^.^.^,^M

}

^to ^specify^the^relative

importanceofmisclassificationsasafunctionofthedifferencebe- tweentheclasslabelforthetrueclassandthedecisionandatthe same time the relative importance of each class in the objective function.Todealwiththerelativeimportanceofmisclassifications inanordinalproblem,wehaveconsidered

cd,t=

|

^d− t

|

^. ⁽³⁹⁾

5 Datasets and partitions are available at http://uco.es/grupos/ayrna/orreview (last access, October 2022)

Fig. 2. Parzen windows used in the experiments, k ^X(z)_{, X ∈}{^{G, U, L}} , with G : Gaus- sian, U: uniform, L : linear.

Tosimulatetwodifferentsituationsthatcanhappeninrealprob- lems, we have considered two different Bayesian Ordinal Neural Network(BONN)solutions:

• BONN(MAE):theimportance oftheclassis proportionaltoits probabilityinthedataset.Forthiskindofsituation,theappro- priate figure ofmerit to evaluate theperformance ofthe dif- ferentmethodsisMAE.Thepriorprobabilityparametersofthe methodare

π

t =Nt

N. (40)

• BONN(AMAE):theimportanceisthesameforallclasses,inde- pendentlyoftheirprobability inthe dataset.Inthisscenario, whichcanhappeninmanyimbalancedproblems,theappropri- atefigure of merit is AMAE. The prior probability parameters forthiscaseare

π

t = 1

M. (41)

Note that this setup is mathematically equivalent to consider thepriorprobabilityforeachclassasgivenbythedatasetbut tomodify themisclassification coststo be proportionalto the inverseofthepriorprobabilityofthetrueclass,i.e.

π

t =Nt

N,c_d,t= 1

π

t

|

^d− t

|

^. ⁽⁴²⁾

Inanycase,theresultisthat allclasseshavethesameweight intheproposedcostfunction(18)independentlyoftheirprob- ability.

A multilayer perceptron with a single hidden layer with L neurons is used in the experiments. For each dataset, L∈

{

¹⁰^,²⁰^,³⁰^,⁴⁰^,⁵⁰^,¹⁰⁰

}

^neurons, ^with^both ^tanh ^and^ReLU ^activa-

tion functions, have beentested. The neuron of the output layer has a linear activation function. Three different Parzen windows have been tested for each dataset: Gaussian, uniform and linear windows,whichareplottedinFig.2.Theuniformandlinearwin- dows are constrained tothe domain [−1,1], andthe variance of theGaussianwindowis0.1507(99%ofprobabilityin[−1,1]).

k^G

(

z

)

₌_√ ¹ 2

π σ

²^e

−₂^z_σ²2, with

σ

²=0.1507 (43)

k^U

(

z

)

₌

)

₁

2 if

|

^z

|

≤ 1

0 if

|

^z

|

^> ¹ ⁽⁴⁴⁾

k^L

(

z

)

₌

)

1

2

(

z+1

)

if

|

^z

|

≤ 1

0 if

|

^z

|

^> ¹ ⁽⁴⁵⁾

The choiceofthe windowweights differentlythe contributionof thesamples tothe updatesduringthetraining. The weight fora samplex_k dependson the distance betweenits network output, z_k,andthe decisionthresholds u_d (seethegradient Eqs.(25)and (26),whichmaketheupdatesproportionaltok(u_d− zk)).Theuni- formwindowweightsuniformlyallsamplesatdistancelowerthan 1.Withthe Gaussianwindow theweight decreases exponentially with the distance to the thresholds. The linear window weights

(7)

Table 2

MAE (average ± standard deviation in the 30 partitions) for each data set.

Dataset Best in [21] BONN(MAE)

ERA 1.219 ± 0.044 (REDSVM) 1.193 ± 0.044

ESL 0.301 ± 0.035 (GPOR) 0.291 ± 0.031

LEV 0.410 ± 0.023 (REDSVM) 0.389 ± 0.026 SWD 0.440 ± 0.032 (GPOR) 0.428 ± 0.029 automobile 0.263 ± 0.074 (NCS) 0.282 ± 0.063 balance-scale 0.001 ± 0.004 (REDSVM) 0.017 ± 0.008 bondrate 0.531 ± 0.110 (ORBoost) 0.438 ± 0.088 eucalyptus 0.331 ± 0.038 (GPOR) 0.352 ± 0.024 newthyroid 0.029 ± 0.022 (REDSVM) 0.017 ± 0.015 pasture 0.219 ± 0.147 (NCS) 0.229 ± 0.075 squash-stored 0.327 ± 0.122 (AOCS/GOCS) 0.281 ± 0.111 squash-unstored 0.161 ± 0.085 (AOCS/GOCS) 0.096 ± 0.056 tae 0.461 ± 0.060 (REDSVM) 0.494 ± 0.073 toy 0.024 ± 0.013 (REDSVM) 0.049 ± 0.017 winequality-red 0.348 ± 0.019 (GOCS) 0.410 ± 0.014

Table 3

AMAE (average ± standard deviation in the 30 partitions) for each data set.

Dataset Best in [21] BONN(AMAE)

ERA 1.370 ± 0.099 (AOCS) 1.277 ± 0.092 ESL 0.459 ± 0.116 (REDSVM) 0.432 ± 0.070 LEV 0.601 ± 0.051 (AOCS) 0.518 ± 0.056 SWD 0.576 ± 0.039 (ORBoost) 0.461 ± 0.046 automobile 0.313 ± 0.120 (GOCS) 0.346 ± 0.104 balance-scale 0.001 ± 0.003 (REDSVM) 0.015 ± 0.013 bondrate 0.839 ± 0.260 (ORBoost) 0.810 ± 0.243 eucalyptus 0.362 ± 0.040 (GPOR) 0.381 ± 0.029 newthyroid 0.048 ± 0.040 (REDSVM) 0.026 ± 0.034 pasture 0.219 ± 0.147 (NCS) 0.229 ± 0.075 squash-stored 0.368 ± 0.129 (ORBoost) 0.337 ± 0.163 squash-unstored 0.170 ± 0.137 (NCS) 0.153 ± 0.084 tae 0.459 ± 0.059 (REDSVM) 0.494 ± 0.082 toy 0.024 ± 0.016 (REDSVM) 0.054 ± 0.016 winequality-red 0.952 ± 0.076 (AOCS) 0.781 ± 0.121

linearlyandasymmetricallythesamplesatdistancelowerthan 1, withahigherweightforsamplesbelowthethresholds.

Thebestnetwork size,activation function,andParzenwindow foreachdatasetandperformancecriterionhavebeenselectedby cross-validation(3-foldcross-validationwiththetrainingset).

Tables2and3showtheresultingMAEandtheAMAE,respectively. To simplifythe comparison, inboth tables only the result ofthebestbenchmark methodin[21]foreach datasethasbeen included(thedetailedresultsofeachbenchmarkmethodforeach datasetcanbefoundin[21]).

Thebest(lowest)averagevalueofMAE/AMAEforeachdataset is underlined,andboldface highlights thecases wherethediffer- enceissignificant.Wedefineassignificantthosedifferenceswhere thehypothesis thatthetwomeans areequivalentcan berejected witha levelofsignificanceequalto5%,i.e., thosedatasets where theabsolutevalueofthedifferencebetweenmeanvaluesishigher than1.65

σ

[39],

σ

beingtheestimateddeviationofthedifference betweentheMAE/AMAEofthetwomethods

σ

=

* σ

_bench²

n +

σ

prop²

n , (46)

where

σ

_bench denotes the standard deviation for the best benchmark methodin [21],

σ

prop isthe standard deviation ofthe proposedmethod,andn=30isthenumberofaveragedpartitionsfor bothmethods.

It can be seen that BONN(MAE)obtains the bestMAE in8 of the 15 data sets, with 5 significant wins, and with 5 significant losses in the remaining 7 data sets where the best method in [21] hasthebestMAE.Thebest resultsinAMAEareobtainedby

Table 4

Wins/Ties/Losses of the BONN(MAE) method in MAE against each one of the benchmark methods (considering wins with a level of significance of 5%), and T paired and Wilcoxom paired tests, including p-value and the hypothesis for the 5%

of significance.

Method Wins/Ties/Losses of BONN(MAE)

T-Test p-value (H) Wilcoxon p-value (H)

Orig 13/2/0 0.0003 (1) 0.0001 (1)

NCS 11/3/1 0.0127 (1) 0.0128 (1)

AOCS 9/5/1 0.0178 (1) 0.0067 (1)

GOCS 9/5/1 0.0183 (1) 0.0063 (1)

GPOR 11/3/1 0.0052 (1) 0.0014 (1)

ORBoost 12/2/1 0.0030 (1) 0.0009 (1)

REDSVM 11/1/3 0.0164 (1) 0.0215 (1)

Table 5

Wins/Ties/Losses of the BONN(AMAE) method in AMAE against each one of the benchmark methods (considering wins with a level of significance of 5%), and T paired and Wilcoxom paired tests, including p-value and the hypothesis for the 5%

of significance.

Method Wins/Ties/Losses of BONN(AMAE)

T-Test p-value (H) Wilcoxon p-value (H)

Orig 12/3/0 0.0018 (1) 0.0003 (1)

NCS 10/5/1 0.0087 (1) 0.0016 (1)

AOCS 9/6/0 0.0106 (1) 0.0012 (1)

GOCS 9/6/0 0.0101 (1) 0.0012 (1)

GPOR 13/0/2 0.0007 (1) 0.0002 (1)

ORBoost 11/4/0 0.0003 (1) 0.0001 (1)

REDSVM 10/2/3 0.0036 (1) 0.0026 (1)

BONN(AMAE)in9ofthe15datasets,with5significantwins,and with4significantlossesintheother6datasets.

It isimportantto remarkthat the bestresult in[21]for each dataset isobtainedby differentmethods. Thesemethodsare in- dicated in Tables 2 and 3. So, the performance of the proposed proceduresmustbe consideredverygood.Tofurthersupportthis conclusion,Tables 4and5presentsthenumberofwins,tiesand lossesofBONN(MAE)inMAEandBONN(AMAE)inAMAE,respectively, against each benchmark method (wins andlosses are de- finedwiththesignificancelevelof5%,andtiesarethecaseswhere the hypothesis ofthe means beingequivalentcannot be rejected withthissignificancelevel).Thesetablescontainalsothe p-value andthehypothesisforapairedTtest andapairedWilcoxontest.

ItcanbeseenthatBONN(MAE)hasaminimumof9winsinMAE – againstAOCSandGOCS withonly 1loss – anda maximumof 3 losses, against REDSVM butwith 11wins against thismethod.

InAMAE, BONN(AMAE)hasalso a minimumof 9wins – against AOCSandGOCS,nowwithnolosses– andamaximumof3losses againstREDSVM,butwith10winsagainstit.Inallcases,thenull hypothesis assumingthat thetrue meandifferenceis zeroisdis- carded withboth theT test andtheWilcoxon test. Theseresults permit toconcludethat theproposed approachhasa remarkably high performance:In a paircomparison it provides the best av- erageresultsandthebestwins/losses figureagainsteverybench- markmethod.Thedifferenceinperformanceisstatisticallysignifi- cantinallcases.

To test the differences between more than two models, the Friedmantest[40]isawellknownandveryusedmethod.Figure3 plotsthemeansandranksobtainedintheFriedmantest,bothfor MAEandAMAE.Thistestalsoshowsaclearadvantageintheper- formance oftheproposed method withrespectto all the benchmarkmethods,speciallyinAMAE,thescenariowhereimbalanceis takenintoaccount.

Oneofthemaincharacteristicsoftheproposed methodisthe flexibilitythatisprovidedbytheBayesianformulation.Thisfeature allows it to provide good results in different scenarios withdif- ferentrequirements.Althoughthemethodisdesignedtoimprove

(8)

Fig. 3. Average means and ranks for Friedman test applied to the 8 methods under comparison for both MAE and AMAE (better models: on the left).

Table 6

Average figures of merit in the 15 datasets.

Method Acc AAcc MAE AMAE Cohen’s κ²

MLP (Softmax) 0.6842 0.6132 0.3938 0.5089 0.7180 BONN(Acc) 0.6772 0.5825 0.3990 0.5603 0.6701 BONN(AAcc) 0.6274 0.6414 0.4813 0.4836 0.6769 BONN(MAE) 0.6841 0.5877 0.3645 0.5137 0.7085 BONN(AMAE) 0.6286 0.6441 0.4415 0.4270 0.7177

the metricsthat are relatedwithordinal problems,includingim- balance, themethodcan alsobeused tomatchdifferentrequire- ments.Justasanexample,itcanbeusedtomaximizeaccuracyor balancedaccuracy.Ifthedecisioncostsarenow

cd,t=

)

0, d=t

1, d̸=t, (47)

these costs along with (40) and (41) can be used to maximize (32) and(33), respectively:BONN(Acc) andBONN(AAcc) areused tonamethesetwonewapproaches.Table6comparestheaverage performanceobtainedinthe15datasetsusingtheproposedmodel withthe4differentconfigurations.Thefigures ofmeritare accuracy, balanced accuracy, MAEand AMAE(for the sake of a more compactpresentation,thestandarddeviationsarenot includedin the table,buthavebeen usedtodetermine thewins/ties ineach figureofmeritasinthepreviouspresentedresults).Here,thebest results foreach figureofmerit are highlightedin boldface,while thosewithoutasignificantdifferencewiththebestresult(ties,using(46)asbefore)areunderlined.Aconventionalmulti-classMLP withsoftmaxactivationintheoutputlayerhasalsobeenincluded inthecomparisonasabenchmarkforaccuracy.Toisolatetheef- fectofthecostfunctionintheresults,thesamebasicarchitecture has beenused withall methods in all datasets: an MLPwith 20 neuronsinthehiddenlayerandtanhactivationfunction(notethat theoutputlayerisdifferentfortheMLPwithsoftmaxactivation).

Withrespecttoimbalance,Table6showstheexpectedresults:

There is a trade-off between the metrics that consider andthat do not consider the imbalanceto measure the performance. The methods designedto improveAcc orMAEhave worse valuesfor AAccor AMAE(andvice versa). Theresults inTable 6also show that to use the ordinal information in the classification task can beuseful.Inaccuracy,theMLPwithsoftmaxprovidesslightlybet- ter results than BONN(Acc), as expected (in general, multi-class approacheswith softmaxhave shownin the literature better results than regression based approaches in terms of accuracy for classificationproblems).ButtheaccuracyofBONN(MAE),whichis designedtaking theorder into account in theclassification costs c_d,t tominimizeMAE,obtainsan equivalentaccuracyatthesame timethatobtainsthebest(lowest)MAE.Asimilarbehaviorisob- served whenthe imbalanceis considered:Forbalanced accuracy, an equivalent performance is obtained with BONN(AMAE), de- signedtakingintoaccounttheorder,andBONN(AAcc),whichdoes notincludetheorderinitsdesignbutisdesignedtomaximizebal- ancedaccuracy. Atthe sametime, BONN(AMAE)obtains the best AMAE, as expected. Therefore, to improve the ordinal classification(MAE/AMAE)canhelptoimprovetheaccuracy(Acc/AAcc).We wantto remarkthat Table 6showsthe averageresults inthe 15 datasets.InsomedatasetswhenMAE/AMAEisimprovedAcc/AAcc increases.Butinotherdatasetsa trade-off betweenMAEandAcc (orAMAEandAAcc)hasbeenobserved: Toincreaseone ofthem tends todecrease theother one (when closeto thebest metrics, obviously).Finally,Cohen’skappacoeﬃcientshowsthatall methods have a good agreement between decisions and true classes, faraway fromagreementby chance,withMLPwithsoftmaxand BONN(AMAE)havingthebestvaluesfor

κ

².

Theprevious experiments show thatthe proposed methodal- lowsto simultaneously deal withordinal problems andwith the imbalance,asBONN(AMAE) doeswiththe parameters in (42). In thiscaseallclasseshavethesameweightinthelossfunctioninde- pendentlyoftheirprobabilities.ButtheBayesianformulationpro- videsalsotheflexibilitytoconsidertherelativeimportanceamong classes. This can be usefulin applications were the detection of aclass,whichcanbe theminorityclass,ishighlyrelevanttoend users.Asanexampleofthisflexibility,Table7comparestheconfu- sionmatricesobtainedintheLEVdatasetwiththeMLP(Softmax), BONN(MAE), BONN(AMAE) and BONN(Emph-5), which denotes a solutionemphasizingclass5,theminorityclassinLEV(only 2.7%

ofsamples,comparedto40.3%forclass3,themajorityclass).This solutionisusingthefollowingparameters

π

t=Nt

N,c_d,t=

)

1

πt

|

^d− t

|

^, ^t̸=5

2

πt

|

^d− t

|

^, ^t=5. (48)

To emphasize the importance of class 5, its contribution to the lossisnowdoubled.Thediagonalofthematrices, whichcontains the per class accuracy, Acct in (34), has been highlighted (boldface)to facilitatethe analysis. Additionally, Acc,AAcc (it is equal totheaverageofthevaluesinthe diagonaloftheconfusionma- trix),MAEandAMAEareprovidedforeverysolution.TheMLPand BONN(MAE)showasimilarbehavior:Bothapproacheshaveabet- teraccuracyforthemajorityclasses,andtheminorityclass(class 5)isalmostignored.TheconventionalMLPobtains aslightlybet- teraccuracy,andBONN(MAE)obtainsaslightlybetterMAEbyre- ducingingeneralthemisclassificationerrorswithhigherclassdif- ferences, asexpected. Otherwise, BONN(AMAE) increasesnotably AMAEbyforcingamoreuniformperclassaccuracy(notethatalso averageaccuracy,AAcc,isnotablyimproved).Finally,BONN(Emph- 5)isabletoimprovetheaccuracyofclass5,evenalthoughitisthe minority class,obviouslyat thepriceof alower accuracy forthe other classes(specially forthe closestclass) andaslightly worse AMAEthanBONN(AMAE).

Neural network for ordinal classification of imbalanced data by minimizing a Bayesian cost

Pattern Recognition