ContentslistsavailableatScienceDirect
Pattern Recognition
journalhomepage:www.elsevier.com/locate/patcog
Unimodal regularisation based on beta distribution for deep ordinal regression
Víctor Manuel Vargas
∗, Pedro Antonio Gutiérrez, César Hervás-Martínez
Department of Computer Science and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, “Albert Einstein” building, 3rd floor.
14014 Córdoba, Spain
a rt i c l e i nf o
Article history:
Received 3 March 2021 Revised 6 September 2021 Accepted 7 September 2021 Available online 10 September 2021 Keywords:
Ordinal regression Unimodal distribution Convolutional network Beta distribution Stick-breaking
a b s t r a c t
Currently,theuseofdeeplearning forsolvingordinal classificationproblems,where categoriesfollow anaturalorder,hasnotreceivedmuchattention.Inthispaper,weproposeanunimodalregularisation basedonthebetadistributionappliedtothecross-entropyloss.Thisregularisationencouragesthedis- tributionofthelabelstobeasoftunimodaldistribution,moreappropriateforordinalproblems.Given thatthebetadistributionhastwoparametersthatmustbeadjusted,amethodtoautomaticallydeter- minethemisproposed.Theregularisedlossfunctionisusedtotrainadeepneuralnetworkmodelwith anordinalschemeintheoutputlayer.Theresultsobtainedarestatisticallyanalysedandshowthatthe combinationofthesemethodsincreasestheperformance inordinal problems.Moreover,theproposed betadistributionperformsbetterthanotherdistributionsproposedinpreviousworks,achievingalsoa reducedcomputationalcost.
© 2021TheAuthor(s).PublishedbyElsevierLtd.
ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)
1. Introduction
Inthelastdecade,ordinalclassification/regressionhasreceived anincreasing interestintheliterature[1–3].Themethodsfocused onsolvingthiskindofproblemsaimtodeterminethediscretecat- egoryorrankingofapatterninanordinalscale.Thisordinalscale is given by the naturalordering ofthe categories existing inthe problemconsidered [4].Forinstance, inmedicalproblems where we obtain a diagnosisfromimages, thecategory isusually inan ordinalscale(e.g.DiabeticRetinopathy(DR)detection[5]withfive levels ofthe disease). Anotherpossibleexample isthe prediction oftheagerangeofpeoplefromphotographsoftheirfaces[6].
There are manyrealworld problems wheretheavailable data have an underlying ordinal structure. In [7], the authors aim to predict the level of the Parkinson’s disease based on volumetric imagesobtainedthroughencephalograms.The patientsare classi- fied dependingonthestate ofthispathology (1:healthypatient, 2: slightalteration, 3: more advanced alteration, etc.). In [8]the authors tryto predictconvective situations intheMadrid-Barajas airportinSpain,whichiscrucialforthiskindoftransportationfa- cilitiesasitcancausesevereimpactinflightschedulingandsafety.
Thesesituationscanbepresentinseveraldegrees,resultingindif-
∗Corresponding author at: Campus Universitario de Rabanales, “Albert Einstein”
building, 3rd floor. 14014 Córdoba, Spain.
E-mail address: [email protected] (V. Manuel Vargas).
ferent classes that follow a natural order. Finally, in [9] authors try to automatically detect prostate cancer on different degrees based on the Gleason Score, which is a standard for measuring the aggressiveness of thistype ofcancer. According to thismet- ric,prostatecancerisdividedinto5categories,wherethefirstone doesnotrequireatreatmentwhiletheothersdo.Also,depending ontheaggressivenessdegree,thetreatmentmustbedifferent,and itisimportanttodeterminethislevelaccuratelytoavoidexcessive orinsufficienttreatments.
In any of these real world problems, misclassifying a pattern in an adjacent class is always lessimportant than misclassifying itindistantclasses.Thisisthemainreasonwhytakingtheordi- nal informationintoaccount when solving thiskind ofproblems isessential.Moreover,whenordinalclassifiersareused,theorder ofthelabelsisexplicitlyconsideredinthemodel,whatgenerally acceleratesthe learning process andreduces the amountof data neededfortraining.
Machine learning methods based on Deep Learning [10] have beenusedforawidevarietyoftasks.DeepNeuralNetworks(DNN) have the ability to obtain high level abstract representations of low levelfeatures. Each layerof thenetwork extract higherlevel featuresfromtheprevious layer.Specifically,ConvolutionalNeural Networks(CNN)take animageongray-scaleorRGBcolourasin- putdataandextractasetoffeatures thatareusedtoclassifythe patterninoneofthedifferentcategoriesorrankings.Thiskindof models have been used in problems relatedwith image classifi-
https://doi.org/10.1016/j.patcog.2021.108310
0031-3203/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
cation [11],speech recognition[12],control problems [13], object detection[14],privacyandsecurityprotection[15],recoveryofhu- manpose[16],semanticsegmentation[17],imageretrieval[18,19], visualrecognition[20],etc.
However, the resolution of ordinal problemswith deeplearn- ing models has not received much attention. The most common approachtosolveordinaldataproblemsistotreatthemasmulti- class problemsandoptimise themodelusingthe standard cross- entropy(CE)loss[21].Thisapproachhasamajordrawback:itdoes not takes into account the order between categories.Some pre- vious works [22,23] haveexplored different lossfunctionsto ad- dress this problem. Another common approach is to convertthe ordinal regressionproblemtoa standardregressionone [24].This approach keepstheorderbetweenrankingsbutassumesthatthe discretecategoriesarecontinuousandequispaced.
Another problemwhenworking withordinal datais found in the waythe labelsare encoded. Usually, theone-hotencoding is used,whichrepresentsthelabelasbinaryvectorwherethe j ele- mentis1whenthetrueclassis j.However,thewaythisencoding representsthelabelsincursinthesamepenaltyforallmisclassifi- cationerrors,withouttakingintoaccount thedistancetothetrue label.Abetterapproachforordinalregressionproblemsistousea smooth labelrepresentation,insuchawaythat classeswhichare closetothereallabelproduceasmallererrorthanclassesthatare far.Thismethod,knownaslabelsmoothingorunimodalregulari- sationforthelossfunction,hasbeenusedinpreviousworksand differentdistributionfunctionshavebeenusedtomodeltheshape ofthesesmoothlabels(poisson,binomial[25]orexponential[26]).
In thiswork, we propose to use betadistributions forthe la- bel smoothing method together with an approach to automati- cally determine the parameters of these distributions. To evalu- ate the performance of the proposed method, we use the uni- modal regularisedloss totrain a CNNmodelwiththree separate images datasets. The unimodal regularisation is combined with the recentlyproposed stick-breaking schemefor theoutput layer [27]butalsotestedwiththestandardsoftmaxfunction.Asshown inthefollowingsections,combiningthesetwoelementsresultsin improvedperformanceforordinalproblemswithrespecttoprevi- ouslypublishedalternatives.
The restofthispaperisorganisedasfollows:previous related works,includingthestick-breakingandthelossregularisation,are described in Section 2; in Section 3, the new cross-entropyloss regularisationwiththebetadistributionisexplained;inSection4, thedesignoftheexperimentsandthedatasetsusedaredescribed;
in Section 5,theresults ofthe experimentsare shownandcom- paredwithprevious works; and,finally, inSection 6,theconclu- sionsofthisworkarepresented.
2. Relatedworks 2.1. Stick-breaking
Thestick-breakingapproachconsiderstheproblemofbreaking a stick of length 1 into J segments. This methodology is related tonon-parametricBayesianmethodsandcanbeconsideredasub- setoftherandomallocationprocesses[28].Also, thismethodhas been applied as a generalisation of the continuation ratio mod- els[29].
In Latent Gaussian Models (LGMs), the latent variables fol- low a Gaussian distribution where the probability of each cat- egorical variable is parametrised in terms of a linear projection
η
1,...,η
J [30], whereJ isthe number ofclasses ofthe problem.The probability of the firstcategory is modelled as
σ
(η
1) whereσ
(x)=1/(1+e−x).Thisrepresentsthefirstpieceofthestick.The length oftheremainder ofthestickis (1−σ
(η
1)).The probabil- ity of the second category is a fractionσ
(η
2) of the remainderof the stick. Following this process, we can define the probabil- ityofeachoftheJ categories.Theprobabilities(sticklengths)are all positive and sumto one;they thus define a validprobability distribution.Adifferentfunctionfor
σ
(x)canbeused(suchasthe probitfunction).However, thelogitallowsustouseefficientvari- ationalbounds. Thestick-breakingparametrisationcanbe written compactlyas:p
(
y=Cj| η
q)
=j−1
i=1
(
1−σ ( η
i)) σ ( η
j)
,j=1,...,J. (1)Recently,astick-breakingapproachwaspresentedin[27]asan alternativetothestandardsoftmaxforordinalclassificationprob- lems wherethe output distribution should be unimodal.The au- thorsdefine astickofunit lengthandsequentiallybreakoff parts ofthestick. Thelength ofthegeneratedstick fragmentscan rep- resentthediscreteprobabilitiesforeachclass.
Inthe firststick-breaking,two partswiththe length of
σ
(η
1) and1−σ
(η
1)are created. Thesefragmentsrepresentthe proba- bilityofthefirstclass:p
(
yi=C1)
=σ ( η
1)
=σ (
f1(
x))
= 11+e− f1(x), (2) andthe probability of the remaining classes in the ordinal scale (yi C1):
p
(
yi C1)
=1−σ ( η
1)
=1−σ (
f1(
x))
==1− 1
1+e− f1(x) = 1
1+ef1(x). (3) Then,theremaining part1−
σ
(η
1)isbroken,obtainingtwoparts oflengthσ
(η
2)(1−σ
(η
1))and(1−σ
(η
2))(1−σ
(η
1)).Thisbreakingprocesscanbemathematicallywrittenas:
p
(
y=C1| η
1)
=l1=σ ( η
1)
, (4)p
(
y=Cj|{ η
k}
kj=1)
=lj==
σ ( η
j)
j−1i=1
(
1−σ ( η
i))
, j=2,...,J− 1, (5) p(
y=CJ|{ η
k}
Jk−1=1)
=lJ=J−1
i=1
(
1−σ ( η
i))
==
J−1
i=1
1 1+efi(x),
(6)
wherethelengthofeachbitcanbeusedtoformulatetheproba- bilityofeachclass.
The stick-breaking process is used for training deep ordinal neuralnetworks[27].Todothis, theauthorssetJ− 1outputneu- rons for a problem with J ranks or ordinal categories. fi(x) is a scalardenotingtheithoutputoftheneuralnetworkandreplaces the linear projections (
η
i) of the LGMs. The conventional cross- entropyloss,CE,canbeusedtotrainthemodel.Itcan bederived that eachoutput associatedwith fi(x)isac- tuallytheratio:
p
(
y=Ci|
x)
p(
yCi|
x)
=p
(
y=Ci|
x)
p
(
y=Ci|
x)
+p(
y Ci|
x)
==
efi(x) 1+efi(x)
i−1
l=1
1 1+efl(x) efi(x)
1+efi(x)
i−1 l=1
1
1+efl(x)+i l=1
1 1+efl(x)
=
= efi(x)
1+efi(x) = 1
1+e− fi(x)=
σ (
fi(
x))
.(7)
Consequently, fi(x)canbeinterpretedasdefiningdecisionbound- aries that try to separate the ith class from all the classes that come afterit. Bydoing so,the predictionisstill a discrete prob- ability.
Aninterestingpropertyofthismethodisthat,unlikeotherap- proaches that onlyoutputa singledistributionvalue [25,31],it is more expressive, because each boundary of two adjacent classes hasitsownscalaroutput fi(x).
2.2. Unimodalregularisation
Labelsmoothingisageneralregularisationtoaddressthenoisy label problem, which encourages the model to be less confi- dent[27].Inthecaseofaone-hotlabel,thedistributionofalabel probabilityisq(i)=
δ
i,1,where1isthegroundtruthclass,δ
i,1isa Diracdelta,whichequalsto1fori=1,and0otherwise.This labelsmoothing can be appliedto the cross-entropyloss andreplacesq(i):
L=
J
i=1
q
(
i)
[− logp(
y=Ci|
x)
] (8)withamoreconservativetargetdistribution:
L=
J
i=1
q
(
i)
[− logp(
y=Ci|
x)
] (9)whereq(i)=(1−
η
)δ
i,1+η
1J andη
istheparameterthatcontrolsthelinearcombination.
In ordinal classification, errors in classifying a pattern in its real class are more likely to be caused by the classifier classify- ing them intheclosest classes.Therefore,buildingunimodaldis- tributions whichhavetheirmodeinthecentreoftheinterval,for the case ofmiddle classes,or in theupper orlower bounds, for extremeclasses,should report amore accurate losscomputation.
Moreover,itisquiteimportantthattheprobabilitydistributionhas smallvariance andthemajorityofitsprobability massisconcen- tratedintheintervalassociatedwiththerealclass.Inthisway,the probabilityisdrasticallyreducedaslongaswegofurtherfromthe correctclass.
Thedistributionsproposedinpreviousworks[25–27]tomodel thetargetsinasoftmannerhaveimprovedtheperformanceofor- dinal classifiersconcerning the standard one-hot encoding. How- ever,theyhavehighvarianceordonot offertherequiredflexibil- ityto positionthemode ofeach distributioninthecentreofthe classinterval whilepreservingasmallvariance.Also,some ofthe proposed methods require to adjust experimentally different pa- rameters.
In [25], the authors used Poisson distributions to model the probabilities. The meanandvariance ofthiskind ofdistributions is equalto thedistributionparameter
λ
.Therefore,it haslimitedflexibilitytoobtainasmallvariance.Forthisreason,theyalsoused thebinomialdistribution,whichhastwoparameters:thenumber ofclasses,J,andtheprobability,p.Eventhoughthemean(J p)and thevariance(J p(1− p))havedifferentexpressions,itisnoteasyto positionthemodeintherightpointoftheintervalwhileobtaining asmallvariance.Finally,theauthorsof[27]proposedtosampleon an exponentialfunction e−|iτ−l|,wherel istheclass ofthepattern andi=1,...,J,followedbyasoftmaxnormalisation.However,the valueof
τ
mustbeadjustedexperimentallyand,insomecases,the probability massisnot sufficientlyconcentratedintheinterval of thecorrectclass.ToovercometheissuesrelatedtothePoissonandbinomialdis- tributions and theexponential function described, we proposein thisworkasetofprobabilitydistributionsassociatedwiththebeta distribution, giventhat theirvariance is smallandthedomain of
thedistributionisbetween0and1.Asagraphicalexample,Fig.1 illustratestheshapeofthedistributionassociatedwitheach class andtype of distribution fora problemwith five classes.For the discretedistributions,theclassnumberisrepresentedinthexaxis whiletheclassintervalsareusedforthecontinuousdistributions.
3. Proposedmethod
3.1. Betaregularisedcross-entropyloss
Themain ideabehindthiswork istouseprobability distribu- tionstomodelthetargetsasunimodaldistributionsinsteadofus- ing the one-hot encoding. In this way,we obtain the soft target distributions q(i) discussedinSection 2.Todothat,we consider thebetadistribution definedintherange[0,1], thereforethereis noneedtoapply anynormalisation,and, also,itdoesnot leadto highvariance.Thebetadistributionhasbeenappliedtomodelthe behaviourofrandomvariableslimitedtointervalsoffinitelengths inawidevarietyofdisciplines.Somepropertiesofthisdistribution aredescribedbelow.
Initsstandardform,thebetadistribution,
β
(a,b),isacontinu- ousdistribution,anditsprobabilitydensityfunction(pdf)is:f
(
x,a,b)
=xa−1(
1− x)
b−1B
(
a,b)
, (10)where0<x<1,a>0andb>0.Thisisalsoknownastheclassi- calbetadistributionortheIncompleteBetafunction.Thefunction B(a,b)hastheform:
B
(
a,b)
=1
0
xa−1
(
1− x)
b−1dx=(
a)(
b)
(
a+b)
, (11)where(a)=(a− 1)!.Whena,b>1, f(x)hasauniquemode at
a−1
(a+b−2) andiszeroatx=0andx=1.Ifa=1orb=1,then f(x) hasacorrespondingterminalvalueb ora.Finally,ifa=b=1,then
f(x)becomestheuniformdistribution.
Sincetherangeof f(x)isfinite,allitsmomentsexist.Itsmean isgivenbytheexpressionE(x)= a+abanditsvarianceisdefinedas V(x)=(a+b)2ab(a+b+1).
Inordertoanalyse thebehaviour ofthisdistribution,wecon- sideranordinalclassificationproblemwithfiveclasses(J=5).We assume that thedistributions ofthe labels arebeta distributions, class 1 takes random values in the range [0,0.2], class 2 in the range[0.2,0.4], class3in[0.4,0.6], class4in[0.6,0.8], andclass5 in[0.8,1.0].
First,wefindthevaluefortheparametersaandb whichmakes the beta distribution associated to the class 1 have its centre in themiddleofthe[0,0.2]interval.Inthiscase,we chosea=1and b=9, what leadsto E(x)=0.1andV(x)=0.008.The associated densityfunctionisgivenby:
f
(
x)
=9(
1− x)
8, 0≤ x≤ 1. (12)Inthis way,the probability ofeach class can be calculatedas follows:
p1=p
(
y=C1)
=0.2 0
9
(
1− x)
8dx=0.8758, (13)and,therefore,p2=0.1241,p3=0.0098,p4=2.6× 10−4andp5= 5.1× 10−6.
In the same way, we can compute the distributions and the probabilitiesassociatedwiththeother classesfindingtheaandb parametersthat makethe distributionbe centredintheintervals [0.2,0.4],[0.4,0.6],[0.6,0.8]and[0.8,1.0].
However, adjusting these parameters by trial anderror is not thebestmethod,asitrequiresadditionalcomputational time.So, inthenextsection,analternativemethodtodetermineaandb is proposed.
Fig. 1. Label distributions shape for N = 5 and different real classes: 0 (red), 1 (green), 2 (blue), 3 (purple), 4 (black). The x axis represents the labels for the discrete distributions and the class intervals for the beta. The y axis shows the probability for the Poisson, binomial and exponential distributions and the value of the probability density function for the beta distribution. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
3.2. Betadistributionparametersbasedonnumberofclasses
In thissection, we propose a method to find the parameters (a,b)foreachclassonanyordinalproblembasedonthenumber ofclasses.
First,wedefinethethresholdsofeachclassbasedonthenum- ber ofclasses (J). For anyvalue of J, thecentres of the intervals can be obtainedas1/(2J), 3/(2J), ..., (2J− 1)/(2J). The length of thefirstintervalshouldbe1/J,andthemeanvalueis1/(2J).Con- sequently:
E
(
x)
= a a+b= 12J,⇒b=a
(
2J− 1)
. (14) Then,thevariancecanbedefinedas:V
(
x)
= 2J− 14J2
(
2Ja+1)
, (15)andthestandarddeviation:
S
(
x)
= 21J 2J− 12Ja+1. (16)
We assume that mostofthe valuesofthe distributionshould be intherangeE(x)± S(x).Inthisway,weobtaintheconstraints forthefirstinterval:
0< 1 2J± 1
2J
2J− 1 2Ja+1<1J. (17)
Solvingthefirstinequality,weget:
2J− 1
2Ja+1 <1⇒a>J− 1
J . (18)
Asaconsequence,foranyJ,wecanusea=1andb=a(2J− 1). Inthe sameway,weobtainthe parametersforthesecond in- terval. Now,E(x)=a+ab=23J andb= a(2J−33 ).Thevarianceandthe
standarddeviationare:
V
(
x)
= 9(
2J− 3)
4J2
(
2Ja+3)
, (19)S
(
x)
= 3 2J 2J− 32Ja+3. (20)
Andtheconstraintsforthisintervalaregivenby:
1 J < 3
2J± 3 2J
2J− 3 2Ja+3<2J, (21)
whatleadsto:
2J− 3 2Ja+3<1
9 ⇒a>9
(
2J− 3)
− 32J . (22)
In the same way, we can obtain the parameters for the rest oftheintervals.The parametersofthebetadistributions foreach classareshowninTable1fordifferentnumberofclasses(J≤ 8).
Finally,thebetaregularisedcross-entropylosscanbeexpressed as:
L=
J
i=1
q
(
i)
[− logp(
y=Ci|
x)
], (23)whereq(i)=(1−
η
)δ
i,1+η
f(x,a,b)and f(x,a,b)istheprobabil- ityvalue sampled froma betadistribution that iscentred inx=2J−1
2J anduses theaandb parametersobtainedusingthemethod describedinthissection.
3.3. Betadistributionpropertiesregardingordinalclasses representation
The main benefit ofusing the beta distribution formodelling the probability distribution is the fact that it has two parame- tersthat allow obtaining different distribution shapeswithsmall
Table 1
Beta parameters (a,b) for each class and each number of classes.
(a, b) parameters for class
J C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8
3 (1,4) (4,4) (4,1)
4 (1,6) (6,10) (10,6) (6,1) 5 (1,8) (6,14) (12,12) (14,6) (8,1)
6 (1,10) (7,20) (15,20) (20,15) (20,7) (10,1)
7 (1,12) (7,26) (16,28) (24,24) (28,16) (26,7) (12,1) 8 (1,14) (7,31) (17,37) (27,35) (35,27) (37,17) (31,7) (14,1)
Fig. 2. Beta distribution shapes for extreme, middle and intermediate classes.
variance.Thus,itcanrepresentthedistributionofextremeclasses along withthe symmetricdistribution ofthemiddle class.When a=1,theprobabilitydensityfunctionisgivenby:
f
(
x,1,b)
=b(
1− x)
b−1, 0≤ x≤ 1, (24) whilethepdfforb=1is:f
(
x,a,1)
=axa−1, 0≤ x≤ 1. (25)As showninFig.2a,thesefunctionscan easilyrepresentthe dis- tributionsassociatedwiththeextremeclasses.
Ontheotherhand,thebetadistributionwitha=b andb→∞ is similar to anormal distribution.Let therandom variableX be associated toa
β
(b,b) distributionwithprobability densityfunc- tion:fX
(
x,b)
=(
2b)(
x(
1− x))
b−12
(
b)
, 0≤ x≤ 1, (26)whereb is arealpositiveparameter. ThemeanofXisE[X]=0.5 andthevarianceofXisV[X]=4(21b+1).
PropositionThe
β
(b,b)distributionconvergestothenormaldis- tributionwhenb→∞,thatis:β (
b,b)
−→d N 12, 1 4
(
2b+1)
. (27)TheproofofthispropositionisincludedinAppendixA.
Therefore, the beta distribution can also accurately represent the distribution of the middle class through a symmetric distri- bution when theproblem hasan odd numberofclasses keeping thevariancesmall(seeFig.2b).Whenthevalueofb increases,the varianceofthedistributionbecomessmaller.Thesymmetricprop- erty ofthedistributionintheaforementionedcasescan beeasily checkedwiththeskewnesscoefficient,whichiscalculatedas:
Skewness=2
(
b− a)
√a+b+1(
a+b+2)
√ab , (28)whichiszeroduetothefactthata=b.
As mentioned before, the restof the classeshave asymmetri- caldistributionswithsmallvariance,ascanbeobservedinFig.2c.
However, whenthenumberofclassesincreases,theresultingdis- tributions getsclosertoanormaldistributionwithsmallvariance (seedistributionfor(17,37)or(37,17)inFig.2c).
Thepropertiesdescribedinthissectionmakethebetadistribu- tionbean excellent choiceformodellingtheprobability distribu- tionofeach classinanordinal problem,asitcanpreciselyrepre- sentboththeextremeandthemiddleclasses.
4. Experiments 4.1. Data
The ordinal classification of images has not been widely ex- ploredyetand,thereforetherearenotmanyordinalimagesbench- markdatasetsthatcanbeusedtotestourapproach.Wehaveeval- uated the differentproposals usingthe mostwell-known ordinal imagesdatasets.
4.1.1. DiabeticRretinopathy
Diabetic Retinopathy(DR) isa dataset consistingofextremely high-resolution fundusimage data. Itwasused ina Kagglecom- petition1 andhasbeenusedinseveralpreviousworks[32,33]asa benchmarkdatasetforordinal classification.Thetrainingsetcon- sists of 17563 pairs of images (where a pair includes a left and righteyeimagecorrespondingtoapatient).Inthisdataset,wetry topredictthecorrectcategoryfromfivelevelsofDR:noDR(25810 images),mild DR(2443 images),moderateDR (5292images),se- vereDR(873images),orproliferativeDR(708images).Thetestset contains26788pairsofimages,whicharedistributedinthesame five classes with the following proportions: 39532, 3762, 7860, 1214and1208images.Theseimages aretakeninvariablecondi- tions:by differentcameras, conditionsofilluminationandresolu- tions. Theycomefromthe EyePACS datasetthat wasused inthe DRdetectioncompetitionhostedontheKaggleplatform.Theim- ages are resized to 128× 128 pixels andthe value of each pixel isstandardisedusingthemeanandthestandarddeviationofthe trainingset.SomeimagesfromthetestsetarepresentedinFig.3a.
4.1.2. Adience
Adience2datasetconsistsof26580facesbelongingto2284sub- jects. It has been used in previous works [34] for gender and
1https://www.kaggle.com/c/diabetic-retinopathy-detection/data
2http://www.openu.ac.il/home/hassner/Adience/data.html
Fig. 3. Examples from different classes taken from the test set of Diabetic Retinopathy and Adience.
Fig. 4. Network architecture.
age classification.The faces that appearin theoriginal images of this dataset havebeen pre-cropped andaligned in orderto ease the trainingprocess. Also,imageshavebeenresized to256× 256 pixels andcontrast-normalised andthe distribution of the pixels was standardised. The original dataset was split into five cross- validationfolds.Thetrainingsetconsistsofmergingthefirstfour foldswhichcompriseatotalof15554images.Thelastfoldisused astestset.Fig.3bshowssomeimagestakenfromthetestset.
4.1.3. FGNet
FGNet3 isthesmallestdatasetconsidered inthiswork.It con- sists of 1002 128× 128 colour images of faces from 82 different subjects.Fromtheseimages,we took80%fortrainingandthere- maining 20%fortesting.Thesepartitionsweredoneinastratified way. Each image was labelled with the exact age that the sub- ject hadat themoment that the picturewas taken. We grouped theseagesintosixcategoriesbasedonageranges(0-3,3-11,11-16, 16-24,24-40,>40).
4.2. Model
ThemodelconsideredforthisworkisaResidualConvolutional Network [27], as it can achieve good generalisation capabilities with a reducednumber ofparameters. Figure 4 showsmore de- tailsaboutthelayersthatcomposethenetworkarchitecture. Ker- nel sizeandstrideisspecifiedforeachconvolutionalandpooling layer. The structure of aresidual block ResBlock NxNsSis shown in Fig. 4 too. The output of each residual block is concatenated with theinput. The parameters of every convolutional andbatch
3https://yanweifu.github.io/FG _ NET _ data/index.html
normalisationlayerareL2normalised(10−4).Henormalinitialisa- tion[35] has beenusedforthe weights andbiasof theselayers.
The globalaverage pooling layer replaces each channel of its in- putwiththemeanvalueofallthepixelsofthechannel.Thislayer achievesa highreductionof datadimensionality,significantly re- ducingthenumberofparametersattheendofthenetworkwhile obtaininggoodperformance.
In the output layer of the model, two different alternatives are considered:(1) a dense layer with N units and the standard softmax function, (2)a dense layer with N− 1 neurons, sigmoid activation and followed by the stick breaking layer described in Section2.1.
4.3. Experimentaldesign
Themodeldescribedabove istrainedusingthethree datasets describedinSection4.1.Theconvolutionalnetworkmodelwasop- timisedusingthewell-knownbatchbasedfirst-orderoptimisation algorithmcalledAdam[36].Theinitiallearningrate(
η
=10−4)of the optimiser and the batch size (128) were adjusted by cross- validation.Thetrainingprocessisrunfor100epochsandrepeated 10times followinga 10-foldvalidation scheme,wherewe take 9 foldsfortrainingandtheremainingforvalidation.Toeasefurther comparison andmake possible the reproducibility of the experi- ments,thefoldsconsideredwerethesameforalltheexperiments.Using the aforementioned validation set, an early stopping mechanismisapplied inorderto stop thetrainingprocess when thevalidationlosshasnotimprovedfor20epochs.Also,whenthe validation loss has not decreased for 8 epochs, the learningrate willbemultipliedbya0.5factoruntilitreaches10−6.
Data augmentation techniques are applied as previous works[37]haveprovedthattheyavoidthemodelover-fittingand
Table 2
Results for each dataset and method.
Retinopathy
Method QWK CCR 1-off MS Time (s)
S CE 0 . 0839 0.0182 0 . 7253 0.0046 0 . 7987 0.0043 0 . 0012 0.0015 2329 . 31 107.87
S CE-P 0 . 1018 0.0161 0 . 6285 0.0597 0 . 7398 0.0514 0 . 0099 0.0166 2415 . 96 36.91
S CE-B 0 . 1249 0.0199 0 . 6653 0.0368 0 . 7959 0.0204 0 . 0121 0.0159 2402 . 81 67.65
S CE-E 0 . 1082 0.0182 0 . 6833 0.0458 0 . 7904 0.0129 0 . 0108 0.0056 2371 . 53 203.26
S CE- β 0 . 1115 0.0152 0 . 7097 0.0095 0 . 7846 0.0082 0 . 0 0 02 0.0004 2140 . 22 68.35
SB CE 0 . 0941 0.0179 0 . 7145 0.0121 0 . 7914 0.0086 0 . 0046 0.0034 1394 . 49 119.91
SB CE-P 0 . 0937 0.0197 0 . 6673 0.0534 0 . 7757 0.0338 0 . 0126 0.0090 2421 . 71 41.02
SB CE-B 0 . 1285 0.0082 0 . 6762 0.0133 0 . 7962 0.0032 0 . 0067 0.0033 2414 . 11 47.94
SB CE-E 0 . 0989 0.0202 0 . 6745 0.0554 0 . 7993 0.0168 0 . 0086 0.0057 2425 . 29 46.06
SB CE- β 0 . 1093 0.0198 0 . 7122 0.0120 0 . 7876 0.0104 0 . 0 0 02 0.0005 2102 . 78 108.06
Adience
Method QWK CCR 1-off MS Time (s)
S CE 0 . 6604 0.0249 0 . 3903 0.0267 0 . 7132 0.0158 0 . 0296 0.0204 972 . 12 104.77
S CE-P 0 . 6558 0.0370 0 . 3536 0.0480 0 . 7095 0.0435 0 . 0172 0.0269 3416 . 035 300.92
S CE-B 0 . 7403 0.228 0 . 4063 0.0290 0 . 7795 0.0175 0 . 0296 0.0191 3637 . 84 19.22
S CE-E 0 . 7279 0.0213 0 . 3997 0.0442 0 . 7691 0.0218 0 . 0382 0.0287 3644 . 54 24.84
S CE- β 0 . 7306 0.0136 0 . 4160 0.0128 0 . 7647 0.0135 0 . 0938 0.0481 2911 . 61 474.94
SB CE 0 . 7016 0.0162 0 . 3975 0.0128 0 . 7477 0.0072 0 . 0946 0.0219 2717 . 54 271.43
SB CE-P 0 . 5794 0.0881 0 . 2954 0.0705 0 . 6884 0.0498 0 . 0172 0.0233 3789 . 46 453.72
SB CE-B 0 . 7133 0.0598 0 . 3870 0.0514 0 . 7662 0.0305 0 . 0240 0.0352 3386 . 15 14.52
SB CE-E 0 . 7246 0.0438 0 . 3915 0.0622 0 . 7671 0.0246 0 . 0490 0.0295 3636 . 84 19.71
SB CE- β 0 . 7416 0.0078 0 . 4123 0.0102 0 . 7640 0.0082 0 . 0955 0.0252 2643 . 58 557.93
FG-Net
Method QWK CCR 1-off MS Time (s)
S CE 0 . 4855 0.0689 0 . 3844 0.0318 0 . 7449 0.0282 0 . 1275 0.0664 92 . 27 35.16
S CE-P 0 . 4621 0.0505 0 . 3355 0.0318 0 . 7206 0.0357 0 . 1267 0.0801 121 . 63 15.54
S CE-B 0 . 6452 0.0378 0 . 3899 0.0267 0 . 8182 0.0319 0 . 1500 0.0597 108 . 90 17.68
S CE-E 0 . 6118 0.0375 0 . 3894 0.0262 0 . 7959 0.0203 0 . 1764 0.0678 111 . 27 20.58
S CE- β 0 . 6037 0.0551 0 . 3934 0.0302 0 . 7959 0.0258 0 . 2071 0.0642 105 . 27 24.41
SB CE 0 . 5478 0.1650 0 . 3768 0.0578 0 . 7529 0.1035 0 . 1367 0.0652 86 . 46 16.16
SB CE-P 0 . 4907 0.0677 0 . 3381 0.0248 0 . 7342 0.0279 0 . 1153 0.0621 121 . 18 19.14
SB CE-B 0 . 6594 0.0394 0 . 3634 0.0257 0 . 8212 0.0256 0 . 1407 0.0601 104 . 38 17.48
SB CE-E 0 . 6293 0.0467 0 . 3723 0.0237 0 . 8048 0.0259 0 . 1524 0.0520 106 . 77 19.26
SB CE- β 0 . 6416 0.0334 0 . 3791 0.0275 0 . 8027 0.0267 0 . 1744 0.0468 97 . 88 17.60
reducetheamountofdataneededtotrainadeeplearningmodel.
We considered the followingtransformations: horizontalflipping, random zoom in orout within a [−20%,20%]range andrandom width shifting within a [−10%,10%] range. They are individually appliedtoeveryimageinthetrainingsetwithacertainprobabil- ity. Inthis way,more thanone transformation can be appliedto thesameimage.Also,forthezoomin/out andthewidthshifting, the magnitude of the transformation is randomly selected from therangesdescribed.
In terms of the loss function used for the optimisation algo- rithm, wehaveconsideredfivedifferentalternatives, allbasedon thestandardcross-entropyloss:
• Standardcross-entropy.
• Cross-entropylosswithpoissonregularisation(CE-P) [26].
• Cross-entropylosswithbinomialregularisation(CE-B)[26].
• Cross-entropylosswithexponentialregularisation(CE-E)[27].
• Cross-entropylosswiththebetaregularisation(CE-
β
)proposedinthiswork(Section3.1).Theparametersusedforthedistribu- tionareobtainedusingthemethoddescribedinSection3.2. Since datasets are imbalanced, the loss function is weighted based onthea prioriprobabilities oftheclasses(consideringthe numberofinstancesofeachclassinthetrainingset)followingthe methoddescribedin[38].Classeswithfewsampleshaveahigher weightthanclasseswithmanyinstances.
Considering the differentalternatives forthe output layer de- scribedinSection4.2andtheseparatelossfunctionsdescribedin thisSection,tendifferentexperimentswererun.Asmentionedbe-
fore,eachoftheseexperimentswasrepeatedtentimesusingthe described10-foldcross-validationscheme.Theseexperiments can bereproducedrunningthecodeavailableinourpublicrepository4. 5. Results
The results of the experiments described in Section 4 are presented in this section. The evaluation metrics used are the Quadratic Weighted Kappa (QWK) [33], the Correct Classification Rate (CCR) or accuracy, the Minimum Sensitivity (MS) [39] and the execution time. All the values presented in Table 2 are the meanandthestandarddeviationofalltheexecutionsranforeach methodinthetestset.Theexperimentswithsoftmaxintheout- put layer are denoted as S andthe experiments using the stick- breaking scheme as SB. The best result of each metric is high- lightedinboldfontface,whilethesecond oneisinitalics.Allthe metricsmustbemaximised,excepttheexecutiontime.
5.1. Statisticalanalysis
In this Section, a statistical analysis have been carried out in orderto obtainrobust conclusionsfromtheexperimental results.
Each of the metrics presented in Section 5 were analysed sepa- rately.
First,the Kolmogorov-Smirnov testforthe QWKreportedthat thevaluesofthismetricarenormallydistributed.Then,anANOVA
4https://github.com/victormvy/beta-regularisation-cnn
Table 3
HSD Tukey’s test results for QWK (30 samples for each method).
Method Subsets
1 2 3 4 5 6
S CE-P 0.4288 SB CE-P 0.4291
SB CE 0.5062
S CE 0.5111 0.5111
S CE- β 0.5307 0.5307 0.5307
S CE-E 0.5343 0.5343 0.5343
S CE-E 0.5423 0.5423 0.5423
SB CE- β 0.5551 0.5551 0.5551
S CE-B 0.5602 0.5602
SB CE-B 0.5640
p-values 0.1000 0.1450 0.1000 0.0640 0.0000
Table 4
HSD Tukey’s test results for CCR (30 samples for each method).
Method Subsets
1 2 3 4
SB CE-P 0.3954
S CE-P 0.3977
SB CE-B 0.4307
SB CE-E 0.4366 0.4366
S CE-B 0.4483 0.4483 0.4483
S CE-E 0.4502 0.4502 0.4502
SB CE 0.4510 0.4510 0.4510
SB CE- β 0.4523 0.4523
S CE 0.4538 0.4538
S CE- β 0.4612
p-values 0.0640 0.2130 0.6240
II Test [40] withthe methodand thedataset asfactors wasper- formed,inordertocheckwhetherthesefactorshadanyimpacton the value ofthesemetric.The parametrictest reportedthat both factors were significant(p-value<0.001) andthat thereisan in- teractionbetweenthem.
Giventhatthefactorsconsideredaresignificant,aposthocHSD Tukey’stestwasperformed[41].Theresultsofthistestareshown in Table 3. The stick breaking withcross-entropy binomial regu- larisedlossobtainedthebestmeanresults.However,thereareno significantdifferenceswithSCE-B,SBCE-
β
andSBCE-E.The same analysis was carried out for the CCR metric. The Kolmogorov-Smirnov reported that the values are normally dis- tributed,andthe ANOVAIItest foundsignificant influenceofthe factors considered as well as an interaction between them. The posthoc test results are shown in Table 4. In this case, the best methodology isthe one thatuses the softmaxoutputlayer com- binedwiththebetaregularisationforthecross-entropyloss.How- ever,therearenosignificantdifferenceswithS,SCE-
β
,SB,SCE-EandSCE-B.
InthecaseoftheMS metric,thevaluesarealsonormallydis- tributed,andtherearesignificantdifferencesbasedonthetwofac- tors considered. The posthoc Tukey’stest results aredisplayed in Table5.Again,thebestmethodwastheonethatusessoftmaxin theoutputlayerandthebetaregularisedcross-entropyloss.How- ever,therearenosignificantdifferenceswithSBCE-
β
andSCE-E.Finally, the experiment time is also normally distributed, and the ANOVA II test reported significant differences based on the factors considered. The posthoc test (Table 6) showed that the method withthe bestaverage time isthe standard softmaxcou- pled withthe cross-entropy, followed by the stick-breaking with cross-entropy.Withinthemethodsthatuseregularisation,thebeta regularisedcross-entropywithsoftmaxorstick-breakingistheone withthebesttime.
Table 5
HSD Tukey’s test results for MS (30 samples for each method).
Method Subsets
1 2 3 4
SB CE-P 0.0752
S CE-P 0.0814
S CE 0.0827
SB CE-B 0.0906 0.0906
S CE-B 0.0983 0.0983
S CE-E 0.1029 0.1029 0.1029
SB CE 0.1049 0.1049 0.1049
S CE-E 0.1156 0.1156 0.1156
SB CE- β 0.1238 0.1238
S CE- β 0.1430
p-values 0.2670 0.2480 0.1590
Table 6
HSD Tukey’s test results for Time (30 samples for each method).
Method Subsets
1 2 3 4 5
S CE 715.65
SB CE 875.62
SB CE- β 1007.00
S CE- β 1073.53
SB CE-B 1222.68
S CE-P 1239.38 1239.38
S CE-E 1269.98 1269.98
S CE-B 1273.47 1273.47
SB CE-E 1276.49 1276.49
SB CE-P 1314.94
p-values 1.0000 0.3700 0.6580 0.1810
When we analysethe resultsof all themetrics combined, we findthatthemethodthatusesstickbreakingwithbetaregularised loss(SBCE-
β
)achievesthebestresultforQWKandCCR,andthesecondbestforMS.Also,asmentionedbefore,itobtainsthebest timeamongthemethods thatuseregularisation.Thesefactsturn thismethodintoa competitive alternativethat can beapplied to solveotherordinalclassificationproblems.
6. Conclusions
Inthiswork, wehaveproposed theapplicationof aunimodal regularisation based on beta distributions for the cross-entropy loss. The method described improved the performance on prob- lems where classes follow a natural ordering. The regularisation proposedbenefitfromthefactthat,inordinalregressionproblems, misclassificationtendstobeinadjacentclasses,and,consequently, slightlymodifying thelabels considering theordinal scale should increasetherobustnessofthemodelinthepresenceofnoisytar- gets. Thus, the main advantage of the proposed regularised loss is that it encourages the classification errors to be in the adja- centclassesandminimisesthenumberoferrorsindistantclasses, achievingmoreaccurateresultsforordinalproblems.
The distribution used to regularise the loss function has two parameters.Therefore,amethodtoautomaticallydeterminethese parametershasbeenintroduced.Thismethodavoidslearningthem from the training data, thus improving the computational time withrespectto other alternatives withfree parametersto be ad- justed. The parameters obtained through this method have been usedforthelabelsmoothingthat hasbeenappliedasaregulari- sationmethodforthelossfunctionandtestedwiththreedatasets and one CNN model. Even though the model used was a deep learning method, the proposal of this work is also suitable for otherkindsofmodellingtechniques.
Thisregularisedlosshasbeencombined,inonehand,withthe standard softmax function in the output layer and, in the other,