• No se han encontrado resultados

regression Unimodal regularisation based on beta distribution for deep ordinal Pattern Recognition

N/A
N/A
Protected

Academic year: 2022

Share "regression Unimodal regularisation based on beta distribution for deep ordinal Pattern Recognition"

Copied!
10
0
0

Texto completo

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Unimodal regularisation based on beta distribution for deep ordinal regression

Víctor Manuel Vargas

, Pedro Antonio Gutiérrez, César Hervás-Martínez

Department of Computer Science and Numerical Analysis, University of Córdoba, Campus Universitario de Rabanales, “Albert Einstein” building, 3rd floor.

14014 Córdoba, Spain

a rt i c l e i nf o

Article history:

Received 3 March 2021 Revised 6 September 2021 Accepted 7 September 2021 Available online 10 September 2021 Keywords:

Ordinal regression Unimodal distribution Convolutional network Beta distribution Stick-breaking

a b s t r a c t

Currently,theuseofdeeplearning forsolvingordinal classificationproblems,where categoriesfollow anaturalorder,hasnotreceivedmuchattention.Inthispaper,weproposeanunimodalregularisation basedonthebetadistributionappliedtothecross-entropyloss.Thisregularisationencouragesthedis- tributionofthelabelstobeasoftunimodaldistribution,moreappropriateforordinalproblems.Given thatthebetadistributionhastwoparametersthatmustbeadjusted,amethodtoautomaticallydeter- minethemisproposed.Theregularisedlossfunctionisusedtotrainadeepneuralnetworkmodelwith anordinalschemeintheoutputlayer.Theresultsobtainedarestatisticallyanalysedandshowthatthe combinationofthesemethodsincreasestheperformance inordinal problems.Moreover,theproposed betadistributionperformsbetterthanotherdistributionsproposedinpreviousworks,achievingalsoa reducedcomputationalcost.

© 2021TheAuthor(s).PublishedbyElsevierLtd.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Inthelastdecade,ordinalclassification/regressionhasreceived anincreasing interestintheliterature[1–3].Themethodsfocused onsolvingthiskindofproblemsaimtodeterminethediscretecat- egoryorrankingofapatterninanordinalscale.Thisordinalscale is given by the naturalordering ofthe categories existing inthe problemconsidered [4].Forinstance, inmedicalproblems where we obtain a diagnosisfromimages, thecategory isusually inan ordinalscale(e.g.DiabeticRetinopathy(DR)detection[5]withfive levels ofthe disease). Anotherpossibleexample isthe prediction oftheagerangeofpeoplefromphotographsoftheirfaces[6].

There are manyrealworld problems wheretheavailable data have an underlying ordinal structure. In [7], the authors aim to predict the level of the Parkinson’s disease based on volumetric imagesobtainedthroughencephalograms.The patientsare classi- fied dependingonthestate ofthispathology (1:healthypatient, 2: slightalteration, 3: more advanced alteration, etc.). In [8]the authors tryto predictconvective situations intheMadrid-Barajas airportinSpain,whichiscrucialforthiskindoftransportationfa- cilitiesasitcancausesevereimpactinflightschedulingandsafety.

Thesesituationscanbepresentinseveraldegrees,resultingindif-

Corresponding author at: Campus Universitario de Rabanales, “Albert Einstein”

building, 3rd floor. 14014 Córdoba, Spain.

E-mail address: [email protected] (V. Manuel Vargas).

ferent classes that follow a natural order. Finally, in [9] authors try to automatically detect prostate cancer on different degrees based on the Gleason Score, which is a standard for measuring the aggressiveness of thistype ofcancer. According to thismet- ric,prostatecancerisdividedinto5categories,wherethefirstone doesnotrequireatreatmentwhiletheothersdo.Also,depending ontheaggressivenessdegree,thetreatmentmustbedifferent,and itisimportanttodeterminethislevelaccuratelytoavoidexcessive orinsufficienttreatments.

In any of these real world problems, misclassifying a pattern in an adjacent class is always lessimportant than misclassifying itindistantclasses.Thisisthemainreasonwhytakingtheordi- nal informationintoaccount when solving thiskind ofproblems isessential.Moreover,whenordinalclassifiersareused,theorder ofthelabelsisexplicitlyconsideredinthemodel,whatgenerally acceleratesthe learning process andreduces the amountof data neededfortraining.

Machine learning methods based on Deep Learning [10] have beenusedforawidevarietyoftasks.DeepNeuralNetworks(DNN) have the ability to obtain high level abstract representations of low levelfeatures. Each layerof thenetwork extract higherlevel featuresfromtheprevious layer.Specifically,ConvolutionalNeural Networks(CNN)take animageongray-scaleorRGBcolourasin- putdataandextractasetoffeatures thatareusedtoclassifythe patterninoneofthedifferentcategoriesorrankings.Thiskindof models have been used in problems relatedwith image classifi-

https://doi.org/10.1016/j.patcog.2021.108310

0031-3203/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

(2)

cation [11],speech recognition[12],control problems [13], object detection[14],privacyandsecurityprotection[15],recoveryofhu- manpose[16],semanticsegmentation[17],imageretrieval[18,19], visualrecognition[20],etc.

However, the resolution of ordinal problemswith deeplearn- ing models has not received much attention. The most common approachtosolveordinaldataproblemsistotreatthemasmulti- class problemsandoptimise themodelusingthe standard cross- entropy(CE)loss[21].Thisapproachhasamajordrawback:itdoes not takes into account the order between categories.Some pre- vious works [22,23] haveexplored different lossfunctionsto ad- dress this problem. Another common approach is to convertthe ordinal regressionproblemtoa standardregressionone [24].This approach keepstheorderbetweenrankingsbutassumesthatthe discretecategoriesarecontinuousandequispaced.

Another problemwhenworking withordinal datais found in the waythe labelsare encoded. Usually, theone-hotencoding is used,whichrepresentsthelabelasbinaryvectorwherethe j ele- mentis1whenthetrueclassis j.However,thewaythisencoding representsthelabelsincursinthesamepenaltyforallmisclassifi- cationerrors,withouttakingintoaccount thedistancetothetrue label.Abetterapproachforordinalregressionproblemsistousea smooth labelrepresentation,insuchawaythat classeswhichare closetothereallabelproduceasmallererrorthanclassesthatare far.Thismethod,knownaslabelsmoothingorunimodalregulari- sationforthelossfunction,hasbeenusedinpreviousworksand differentdistributionfunctionshavebeenusedtomodeltheshape ofthesesmoothlabels(poisson,binomial[25]orexponential[26]).

In thiswork, we propose to use betadistributions forthe la- bel smoothing method together with an approach to automati- cally determine the parameters of these distributions. To evalu- ate the performance of the proposed method, we use the uni- modal regularisedloss totrain a CNNmodelwiththree separate images datasets. The unimodal regularisation is combined with the recentlyproposed stick-breaking schemefor theoutput layer [27]butalsotestedwiththestandardsoftmaxfunction.Asshown inthefollowingsections,combiningthesetwoelementsresultsin improvedperformanceforordinalproblemswithrespecttoprevi- ouslypublishedalternatives.

The restofthispaperisorganisedasfollows:previous related works,includingthestick-breakingandthelossregularisation,are described in Section 2; in Section 3, the new cross-entropyloss regularisationwiththebetadistributionisexplained;inSection4, thedesignoftheexperimentsandthedatasetsusedaredescribed;

in Section 5,theresults ofthe experimentsare shownandcom- paredwithprevious works; and,finally, inSection 6,theconclu- sionsofthisworkarepresented.

2. Relatedworks 2.1. Stick-breaking

Thestick-breakingapproachconsiderstheproblemofbreaking a stick of length 1 into J segments. This methodology is related tonon-parametricBayesianmethodsandcanbeconsideredasub- setoftherandomallocationprocesses[28].Also, thismethodhas been applied as a generalisation of the continuation ratio mod- els[29].

In Latent Gaussian Models (LGMs), the latent variables fol- low a Gaussian distribution where the probability of each cat- egorical variable is parametrised in terms of a linear projection

η

1,...,

η

J [30], whereJ isthe number ofclasses ofthe problem.

The probability of the firstcategory is modelled as

σ

(

η

1) where

σ

(x)=1/(1+e−x).Thisrepresentsthefirstpieceofthestick.The length oftheremainder ofthestickis (1

σ

(

η

1)).The probabil- ity of the second category is a fraction

σ

(

η

2) of the remainder

of the stick. Following this process, we can define the probabil- ityofeachoftheJ categories.Theprobabilities(sticklengths)are all positive and sumto one;they thus define a validprobability distribution.Adifferentfunctionfor

σ

(x)canbeused(suchasthe probitfunction).However, thelogitallowsustouseefficientvari- ationalbounds. Thestick-breakingparametrisationcanbe written compactlyas:

p

(

y=Cj

| η

q

)

=

j−1



i=1

(

1

σ ( η

i

)) σ ( η

j

)

,j=1,...,J. (1)

Recently,astick-breakingapproachwaspresentedin[27]asan alternativetothestandardsoftmaxforordinalclassificationprob- lems wherethe output distribution should be unimodal.The au- thorsdefine astickofunit lengthandsequentiallybreakoff parts ofthestick. Thelength ofthegeneratedstick fragmentscan rep- resentthediscreteprobabilitiesforeachclass.

Inthe firststick-breaking,two partswiththe length of

σ

(

η

1) and1−

σ

(

η

1)are created. Thesefragmentsrepresentthe proba- bilityofthefirstclass:

p

(

yi=C1

)

=

σ ( η

1

)

=

σ (

f1

(

x

))

= 1

1+e− f1(x), (2) andthe probability of the remaining classes in the ordinal scale (yi C1):

p

(

yi C1

)

=1

σ ( η

1

)

=1

σ (

f1

(

x

))

=

=1− 1

1+e− f1(x) = 1

1+ef1(x). (3) Then,theremaining part1−

σ

(

η

1)isbroken,obtainingtwoparts oflength

σ

(

η

2)(1

σ

(

η

1))and(1

σ

(

η

2))(1

σ

(

η

1)).

Thisbreakingprocesscanbemathematicallywrittenas:

p

(

y=C1

| η

1

)

=l1=

σ ( η

1

)

, (4)

p

(

y=Cj

|{ η

k

}

kj=1

)

=lj=

=

σ ( η

j

)

j−1

i=1

(

1

σ ( η

i

))

, j=2,...,J− 1, (5) p

(

y=CJ

|{ η

k

}

Jk−1=1

)

=lJ=

J−1



i=1

(

1

σ ( η

i

))

=

=

J−1



i=1

1 1+efi(x),

(6)

wherethelengthofeachbitcanbeusedtoformulatetheproba- bilityofeachclass.

The stick-breaking process is used for training deep ordinal neuralnetworks[27].Todothis, theauthorssetJ− 1outputneu- rons for a problem with J ranks or ordinal categories. fi(x) is a scalardenotingtheithoutputoftheneuralnetworkandreplaces the linear projections (

η

i) of the LGMs. The conventional cross- entropyloss,CE,canbeusedtotrainthemodel.

Itcan bederived that eachoutput associatedwith fi(x)isac- tuallytheratio:

p

(

y=Ci

|

x

)

p

(

yCi

|

x

)

=

p

(

y=Ci

|

x

)

p

(

y=Ci

|

x

)

+p

(

y Ci

|

x

)

=

=

efi(x) 1+efi(x)

i−1

l=1

1 1+efl(x) efi(x)

1+efi(x)

i−1 l=1

1

1+efl(x)+i l=1

1 1+efl(x)

=

= efi(x)

1+efi(x) = 1

1+e− fi(x)=

σ (

fi

(

x

))

.

(7)

(3)

Consequently, fi(x)canbeinterpretedasdefiningdecisionbound- aries that try to separate the ith class from all the classes that come afterit. Bydoing so,the predictionisstill a discrete prob- ability.

Aninterestingpropertyofthismethodisthat,unlikeotherap- proaches that onlyoutputa singledistributionvalue [25,31],it is more expressive, because each boundary of two adjacent classes hasitsownscalaroutput fi(x).

2.2. Unimodalregularisation

Labelsmoothingisageneralregularisationtoaddressthenoisy label problem, which encourages the model to be less confi- dent[27].Inthecaseofaone-hotlabel,thedistributionofalabel probabilityisq(i)=

δ

i,1,where1isthegroundtruthclass,

δ

i,1isa Diracdelta,whichequalsto1fori=1,and0otherwise.

This labelsmoothing can be appliedto the cross-entropyloss andreplacesq(i):

L=

J

i=1

q

(

i

)

[− logp

(

y=Ci

|

x

)

] (8)

withamoreconservativetargetdistribution:

L=

J

i=1

q

(

i

)

[− logp

(

y=Ci

|

x

)

] (9)

whereq(i)=(1

η

)

δ

i,1+

η

1J and

η

istheparameterthatcontrols

thelinearcombination.

In ordinal classification, errors in classifying a pattern in its real class are more likely to be caused by the classifier classify- ing them intheclosest classes.Therefore,buildingunimodaldis- tributions whichhavetheirmodeinthecentreoftheinterval,for the case ofmiddle classes,or in theupper orlower bounds, for extremeclasses,should report amore accurate losscomputation.

Moreover,itisquiteimportantthattheprobabilitydistributionhas smallvariance andthemajorityofitsprobability massisconcen- tratedintheintervalassociatedwiththerealclass.Inthisway,the probabilityisdrasticallyreducedaslongaswegofurtherfromthe correctclass.

Thedistributionsproposedinpreviousworks[25–27]tomodel thetargetsinasoftmannerhaveimprovedtheperformanceofor- dinal classifiersconcerning the standard one-hot encoding. How- ever,theyhavehighvarianceordonot offertherequiredflexibil- ityto positionthemode ofeach distributioninthecentreofthe classinterval whilepreservingasmallvariance.Also,some ofthe proposed methods require to adjust experimentally different pa- rameters.

In [25], the authors used Poisson distributions to model the probabilities. The meanandvariance ofthiskind ofdistributions is equalto thedistributionparameter

λ

.Therefore,it haslimited

flexibilitytoobtainasmallvariance.Forthisreason,theyalsoused thebinomialdistribution,whichhastwoparameters:thenumber ofclasses,J,andtheprobability,p.Eventhoughthemean(J p)and thevariance(J p(1− p))havedifferentexpressions,itisnoteasyto positionthemodeintherightpointoftheintervalwhileobtaining asmallvariance.Finally,theauthorsof[27]proposedtosampleon an exponentialfunction e|iτ−l|,wherel istheclass ofthepattern andi=1,...,J,followedbyasoftmaxnormalisation.However,the valueof

τ

mustbeadjustedexperimentallyand,insomecases,the probability massisnot sufficientlyconcentratedintheinterval of thecorrectclass.

ToovercometheissuesrelatedtothePoissonandbinomialdis- tributions and theexponential function described, we proposein thisworkasetofprobabilitydistributionsassociatedwiththebeta distribution, giventhat theirvariance is smallandthedomain of

thedistributionisbetween0and1.Asagraphicalexample,Fig.1 illustratestheshapeofthedistributionassociatedwitheach class andtype of distribution fora problemwith five classes.For the discretedistributions,theclassnumberisrepresentedinthexaxis whiletheclassintervalsareusedforthecontinuousdistributions.

3. Proposedmethod

3.1. Betaregularisedcross-entropyloss

Themain ideabehindthiswork istouseprobability distribu- tionstomodelthetargetsasunimodaldistributionsinsteadofus- ing the one-hot encoding. In this way,we obtain the soft target distributions q(i) discussedinSection 2.Todothat,we consider thebetadistribution definedintherange[0,1], thereforethereis noneedtoapply anynormalisation,and, also,itdoesnot leadto highvariance.Thebetadistributionhasbeenappliedtomodelthe behaviourofrandomvariableslimitedtointervalsoffinitelengths inawidevarietyofdisciplines.Somepropertiesofthisdistribution aredescribedbelow.

Initsstandardform,thebetadistribution,

β

(a,b),isacontinu- ousdistribution,anditsprobabilitydensityfunction(pdf)is:

f

(

x,a,b

)

=xa−1

(

1− x

)

b−1

B

(

a,b

)

, (10)

where0<x<1,a>0andb>0.Thisisalsoknownastheclassi- calbetadistributionortheIncompleteBetafunction.Thefunction B(a,b)hastheform:

B

(

a,b

)

=

1

0

xa−1

(

1− x

)

b−1dx=

(

a

)(

b

)

(

a+b

)

, (11)

where(a)=(a− 1)!.Whena,b>1, f(x)hasauniquemode at

a−1

(a+b−2) andiszeroatx=0andx=1.Ifa=1orb=1,then f(x) hasacorrespondingterminalvalueb ora.Finally,ifa=b=1,then

f(x)becomestheuniformdistribution.

Sincetherangeof f(x)isfinite,allitsmomentsexist.Itsmean isgivenbytheexpressionE(x)= a+abanditsvarianceisdefinedas V(x)=(a+b)2ab(a+b+1).

Inordertoanalyse thebehaviour ofthisdistribution,wecon- sideranordinalclassificationproblemwithfiveclasses(J=5).We assume that thedistributions ofthe labels arebeta distributions, class 1 takes random values in the range [0,0.2], class 2 in the range[0.2,0.4], class3in[0.4,0.6], class4in[0.6,0.8], andclass5 in[0.8,1.0].

First,wefindthevaluefortheparametersaandb whichmakes the beta distribution associated to the class 1 have its centre in themiddleofthe[0,0.2]interval.Inthiscase,we chosea=1and b=9, what leadsto E(x)=0.1andV(x)=0.008.The associated densityfunctionisgivenby:

f

(

x

)

=9

(

1− x

)

8, 0≤ x≤ 1. (12)

Inthis way,the probability ofeach class can be calculatedas follows:

p1=p

(

y=C1

)

=

 0.2 0

9

(

1− x

)

8dx=0.8758, (13)

and,therefore,p2=0.1241,p3=0.0098,p4=2.6× 10−4andp5= 5.1× 10−6.

In the same way, we can compute the distributions and the probabilitiesassociatedwiththeother classesfindingtheaandb parametersthat makethe distributionbe centredintheintervals [0.2,0.4],[0.4,0.6],[0.6,0.8]and[0.8,1.0].

However, adjusting these parameters by trial anderror is not thebestmethod,asitrequiresadditionalcomputational time.So, inthenextsection,analternativemethodtodetermineaandb is proposed.

(4)

Fig. 1. Label distributions shape for N = 5 and different real classes: 0 (red), 1 (green), 2 (blue), 3 (purple), 4 (black). The x axis represents the labels for the discrete distributions and the class intervals for the beta. The y axis shows the probability for the Poisson, binomial and exponential distributions and the value of the probability density function for the beta distribution. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

3.2. Betadistributionparametersbasedonnumberofclasses

In thissection, we propose a method to find the parameters (a,b)foreachclassonanyordinalproblembasedonthenumber ofclasses.

First,wedefinethethresholdsofeachclassbasedonthenum- ber ofclasses (J). For anyvalue of J, thecentres of the intervals can be obtainedas1/(2J), 3/(2J), ..., (2J− 1)/(2J). The length of thefirstintervalshouldbe1/J,andthemeanvalueis1/(2J).Con- sequently:

E

(

x

)

= a a+b= 1

2J,b=a

(

2J− 1

)

. (14) Then,thevariancecanbedefinedas:

V

(

x

)

= 2J− 1

4J2

(

2Ja+1

)

, (15)

andthestandarddeviation:

S

(

x

)

= 21J



2J− 1

2Ja+1. (16)

We assume that mostofthe valuesofthe distributionshould be intherangeE(x)± S(x).Inthisway,weobtaintheconstraints forthefirstinterval:

0< 1 2J± 1

2J



2J− 1 2Ja+1<1

J. (17)

Solvingthefirstinequality,weget:

2J− 1

2Ja+1 <1⇒a>J− 1

J . (18)

Asaconsequence,foranyJ,wecanusea=1andb=a(2J− 1). Inthe sameway,weobtainthe parametersforthesecond in- terval. Now,E(x)=a+ab=23J andb= a(2J−33 ).Thevarianceandthe

standarddeviationare:

V

(

x

)

= 9

(

2J− 3

)

4J2

(

2Ja+3

)

, (19)

S

(

x

)

= 3 2J



2J− 3

2Ja+3. (20)

Andtheconstraintsforthisintervalaregivenby:

1 J < 3

2J± 3 2J



2J− 3 2Ja+3<2

J, (21)

whatleadsto:

2J− 3 2Ja+3<1

9 ⇒a>9

(

2J− 3

)

− 3

2J . (22)

In the same way, we can obtain the parameters for the rest oftheintervals.The parametersofthebetadistributions foreach classareshowninTable1fordifferentnumberofclasses(J≤ 8).

Finally,thebetaregularisedcross-entropylosscanbeexpressed as:

L=

J

i=1

q

(

i

)

[− logp

(

y=Ci

|

x

)

], (23)

whereq(i)=(1

η

)

δ

i,1+

η

f(x,a,b)and f(x,a,b)istheprobabil- ityvalue sampled froma betadistribution that iscentred inx=

2J−1

2J anduses theaandb parametersobtainedusingthemethod describedinthissection.

3.3. Betadistributionpropertiesregardingordinalclasses representation

The main benefit ofusing the beta distribution formodelling the probability distribution is the fact that it has two parame- tersthat allow obtaining different distribution shapeswithsmall

(5)

Table 1

Beta parameters (a,b) for each class and each number of classes.

(a, b) parameters for class

J C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8

3 (1,4) (4,4) (4,1)

4 (1,6) (6,10) (10,6) (6,1) 5 (1,8) (6,14) (12,12) (14,6) (8,1)

6 (1,10) (7,20) (15,20) (20,15) (20,7) (10,1)

7 (1,12) (7,26) (16,28) (24,24) (28,16) (26,7) (12,1) 8 (1,14) (7,31) (17,37) (27,35) (35,27) (37,17) (31,7) (14,1)

Fig. 2. Beta distribution shapes for extreme, middle and intermediate classes.

variance.Thus,itcanrepresentthedistributionofextremeclasses along withthe symmetricdistribution ofthemiddle class.When a=1,theprobabilitydensityfunctionisgivenby:

f

(

x,1,b

)

=b

(

1− x

)

b−1, 0≤ x≤ 1, (24) whilethepdfforb=1is:

f

(

x,a,1

)

=axa−1, 0≤ x≤ 1. (25)

As showninFig.2a,thesefunctionscan easilyrepresentthe dis- tributionsassociatedwiththeextremeclasses.

Ontheotherhand,thebetadistributionwitha=b andb→∞ is similar to anormal distribution.Let therandom variableX be associated toa

β

(b,b) distributionwithprobability densityfunc- tion:

fX

(

x,b

)

=

(

2b

)(

x

(

1− x

))

b−1



2

(

b

)

, 0≤ x≤ 1, (26)

whereb is arealpositiveparameter. ThemeanofXisE[X]=0.5 andthevarianceofXisV[X]=4(21b+1).

PropositionThe

β

(b,b)distributionconvergestothenormaldis- tributionwhenb→∞,thatis:

β (

b,b

)

−→d N



1

2, 1 4

(

2b+1

)



. (27)

TheproofofthispropositionisincludedinAppendixA.

Therefore, the beta distribution can also accurately represent the distribution of the middle class through a symmetric distri- bution when theproblem hasan odd numberofclasses keeping thevariancesmall(seeFig.2b).Whenthevalueofb increases,the varianceofthedistributionbecomessmaller.Thesymmetricprop- erty ofthedistributionintheaforementionedcasescan beeasily checkedwiththeskewnesscoefficient,whichiscalculatedas:

Skewness=2

(

b− a

)

a+b+1

(

a+b+2

)

ab , (28)

whichiszeroduetothefactthata=b.

As mentioned before, the restof the classeshave asymmetri- caldistributionswithsmallvariance,ascanbeobservedinFig.2c.

However, whenthenumberofclassesincreases,theresultingdis- tributions getsclosertoanormaldistributionwithsmallvariance (seedistributionfor(17,37)or(37,17)inFig.2c).

Thepropertiesdescribedinthissectionmakethebetadistribu- tionbean excellent choiceformodellingtheprobability distribu- tionofeach classinanordinal problem,asitcanpreciselyrepre- sentboththeextremeandthemiddleclasses.

4. Experiments 4.1. Data

The ordinal classification of images has not been widely ex- ploredyetand,thereforetherearenotmanyordinalimagesbench- markdatasetsthatcanbeusedtotestourapproach.Wehaveeval- uated the differentproposals usingthe mostwell-known ordinal imagesdatasets.

4.1.1. DiabeticRretinopathy

Diabetic Retinopathy(DR) isa dataset consistingofextremely high-resolution fundusimage data. Itwasused ina Kagglecom- petition1 andhasbeenusedinseveralpreviousworks[32,33]asa benchmarkdatasetforordinal classification.Thetrainingsetcon- sists of 17563 pairs of images (where a pair includes a left and righteyeimagecorrespondingtoapatient).Inthisdataset,wetry topredictthecorrectcategoryfromfivelevelsofDR:noDR(25810 images),mild DR(2443 images),moderateDR (5292images),se- vereDR(873images),orproliferativeDR(708images).Thetestset contains26788pairsofimages,whicharedistributedinthesame five classes with the following proportions: 39532, 3762, 7860, 1214and1208images.Theseimages aretakeninvariablecondi- tions:by differentcameras, conditionsofilluminationandresolu- tions. Theycomefromthe EyePACS datasetthat wasused inthe DRdetectioncompetitionhostedontheKaggleplatform.Theim- ages are resized to 128× 128 pixels andthe value of each pixel isstandardisedusingthemeanandthestandarddeviationofthe trainingset.SomeimagesfromthetestsetarepresentedinFig.3a.

4.1.2. Adience

Adience2datasetconsistsof26580facesbelongingto2284sub- jects. It has been used in previous works [34] for gender and

1https://www.kaggle.com/c/diabetic-retinopathy-detection/data

2http://www.openu.ac.il/home/hassner/Adience/data.html

(6)

Fig. 3. Examples from different classes taken from the test set of Diabetic Retinopathy and Adience.

Fig. 4. Network architecture.

age classification.The faces that appearin theoriginal images of this dataset havebeen pre-cropped andaligned in orderto ease the trainingprocess. Also,imageshavebeenresized to256× 256 pixels andcontrast-normalised andthe distribution of the pixels was standardised. The original dataset was split into five cross- validationfolds.Thetrainingsetconsistsofmergingthefirstfour foldswhichcompriseatotalof15554images.Thelastfoldisused astestset.Fig.3bshowssomeimagestakenfromthetestset.

4.1.3. FGNet

FGNet3 isthesmallestdatasetconsidered inthiswork.It con- sists of 1002 128× 128 colour images of faces from 82 different subjects.Fromtheseimages,we took80%fortrainingandthere- maining 20%fortesting.Thesepartitionsweredoneinastratified way. Each image was labelled with the exact age that the sub- ject hadat themoment that the picturewas taken. We grouped theseagesintosixcategoriesbasedonageranges(0-3,3-11,11-16, 16-24,24-40,>40).

4.2. Model

ThemodelconsideredforthisworkisaResidualConvolutional Network [27], as it can achieve good generalisation capabilities with a reducednumber ofparameters. Figure 4 showsmore de- tailsaboutthelayersthatcomposethenetworkarchitecture. Ker- nel sizeandstrideisspecifiedforeachconvolutionalandpooling layer. The structure of aresidual block ResBlock NxNsSis shown in Fig. 4 too. The output of each residual block is concatenated with theinput. The parameters of every convolutional andbatch

3https://yanweifu.github.io/FG _ NET _ data/index.html

normalisationlayerareL2normalised(10−4).Henormalinitialisa- tion[35] has beenusedforthe weights andbiasof theselayers.

The globalaverage pooling layer replaces each channel of its in- putwiththemeanvalueofallthepixelsofthechannel.Thislayer achievesa highreductionof datadimensionality,significantly re- ducingthenumberofparametersattheendofthenetworkwhile obtaininggoodperformance.

In the output layer of the model, two different alternatives are considered:(1) a dense layer with N units and the standard softmax function, (2)a dense layer with N− 1 neurons, sigmoid activation and followed by the stick breaking layer described in Section2.1.

4.3. Experimentaldesign

Themodeldescribedabove istrainedusingthethree datasets describedinSection4.1.Theconvolutionalnetworkmodelwasop- timisedusingthewell-knownbatchbasedfirst-orderoptimisation algorithmcalledAdam[36].Theinitiallearningrate(

η

=10−4)of the optimiser and the batch size (128) were adjusted by cross- validation.Thetrainingprocessisrunfor100epochsandrepeated 10times followinga 10-foldvalidation scheme,wherewe take 9 foldsfortrainingandtheremainingforvalidation.Toeasefurther comparison andmake possible the reproducibility of the experi- ments,thefoldsconsideredwerethesameforalltheexperiments.

Using the aforementioned validation set, an early stopping mechanismisapplied inorderto stop thetrainingprocess when thevalidationlosshasnotimprovedfor20epochs.Also,whenthe validation loss has not decreased for 8 epochs, the learningrate willbemultipliedbya0.5factoruntilitreaches10−6.

Data augmentation techniques are applied as previous works[37]haveprovedthattheyavoidthemodelover-fittingand

(7)

Table 2

Results for each dataset and method.

Retinopathy

Method QWK CCR 1-off MS Time (s)

S CE 0 . 0839 0.0182 0 . 7253 0.0046 0 . 7987 0.0043 0 . 0012 0.0015 2329 . 31 107.87

S CE-P 0 . 1018 0.0161 0 . 6285 0.0597 0 . 7398 0.0514 0 . 0099 0.0166 2415 . 96 36.91

S CE-B 0 . 1249 0.0199 0 . 6653 0.0368 0 . 7959 0.0204 0 . 0121 0.0159 2402 . 81 67.65

S CE-E 0 . 1082 0.0182 0 . 6833 0.0458 0 . 7904 0.0129 0 . 0108 0.0056 2371 . 53 203.26

S CE- β 0 . 1115 0.0152 0 . 7097 0.0095 0 . 7846 0.0082 0 . 0 0 02 0.0004 2140 . 22 68.35

SB CE 0 . 0941 0.0179 0 . 7145 0.0121 0 . 7914 0.0086 0 . 0046 0.0034 1394 . 49 119.91

SB CE-P 0 . 0937 0.0197 0 . 6673 0.0534 0 . 7757 0.0338 0 . 0126 0.0090 2421 . 71 41.02

SB CE-B 0 . 1285 0.0082 0 . 6762 0.0133 0 . 7962 0.0032 0 . 0067 0.0033 2414 . 11 47.94

SB CE-E 0 . 0989 0.0202 0 . 6745 0.0554 0 . 7993 0.0168 0 . 0086 0.0057 2425 . 29 46.06

SB CE- β 0 . 1093 0.0198 0 . 7122 0.0120 0 . 7876 0.0104 0 . 0 0 02 0.0005 2102 . 78 108.06

Adience

Method QWK CCR 1-off MS Time (s)

S CE 0 . 6604 0.0249 0 . 3903 0.0267 0 . 7132 0.0158 0 . 0296 0.0204 972 . 12 104.77

S CE-P 0 . 6558 0.0370 0 . 3536 0.0480 0 . 7095 0.0435 0 . 0172 0.0269 3416 . 035 300.92

S CE-B 0 . 7403 0.228 0 . 4063 0.0290 0 . 7795 0.0175 0 . 0296 0.0191 3637 . 84 19.22

S CE-E 0 . 7279 0.0213 0 . 3997 0.0442 0 . 7691 0.0218 0 . 0382 0.0287 3644 . 54 24.84

S CE- β 0 . 7306 0.0136 0 . 4160 0.0128 0 . 7647 0.0135 0 . 0938 0.0481 2911 . 61 474.94

SB CE 0 . 7016 0.0162 0 . 3975 0.0128 0 . 7477 0.0072 0 . 0946 0.0219 2717 . 54 271.43

SB CE-P 0 . 5794 0.0881 0 . 2954 0.0705 0 . 6884 0.0498 0 . 0172 0.0233 3789 . 46 453.72

SB CE-B 0 . 7133 0.0598 0 . 3870 0.0514 0 . 7662 0.0305 0 . 0240 0.0352 3386 . 15 14.52

SB CE-E 0 . 7246 0.0438 0 . 3915 0.0622 0 . 7671 0.0246 0 . 0490 0.0295 3636 . 84 19.71

SB CE- β 0 . 7416 0.0078 0 . 4123 0.0102 0 . 7640 0.0082 0 . 0955 0.0252 2643 . 58 557.93

FG-Net

Method QWK CCR 1-off MS Time (s)

S CE 0 . 4855 0.0689 0 . 3844 0.0318 0 . 7449 0.0282 0 . 1275 0.0664 92 . 27 35.16

S CE-P 0 . 4621 0.0505 0 . 3355 0.0318 0 . 7206 0.0357 0 . 1267 0.0801 121 . 63 15.54

S CE-B 0 . 6452 0.0378 0 . 3899 0.0267 0 . 8182 0.0319 0 . 1500 0.0597 108 . 90 17.68

S CE-E 0 . 6118 0.0375 0 . 3894 0.0262 0 . 7959 0.0203 0 . 1764 0.0678 111 . 27 20.58

S CE- β 0 . 6037 0.0551 0 . 3934 0.0302 0 . 7959 0.0258 0 . 2071 0.0642 105 . 27 24.41

SB CE 0 . 5478 0.1650 0 . 3768 0.0578 0 . 7529 0.1035 0 . 1367 0.0652 86 . 46 16.16

SB CE-P 0 . 4907 0.0677 0 . 3381 0.0248 0 . 7342 0.0279 0 . 1153 0.0621 121 . 18 19.14

SB CE-B 0 . 6594 0.0394 0 . 3634 0.0257 0 . 8212 0.0256 0 . 1407 0.0601 104 . 38 17.48

SB CE-E 0 . 6293 0.0467 0 . 3723 0.0237 0 . 8048 0.0259 0 . 1524 0.0520 106 . 77 19.26

SB CE- β 0 . 6416 0.0334 0 . 3791 0.0275 0 . 8027 0.0267 0 . 1744 0.0468 97 . 88 17.60

reducetheamountofdataneededtotrainadeeplearningmodel.

We considered the followingtransformations: horizontalflipping, random zoom in orout within a [−20%,20%]range andrandom width shifting within a [−10%,10%] range. They are individually appliedtoeveryimageinthetrainingsetwithacertainprobabil- ity. Inthis way,more thanone transformation can be appliedto thesameimage.Also,forthezoomin/out andthewidthshifting, the magnitude of the transformation is randomly selected from therangesdescribed.

In terms of the loss function used for the optimisation algo- rithm, wehaveconsideredfivedifferentalternatives, allbasedon thestandardcross-entropyloss:

Standardcross-entropy.

Cross-entropylosswithpoissonregularisation(CE-P) [26].

Cross-entropylosswithbinomialregularisation(CE-B)[26].

Cross-entropylosswithexponentialregularisation(CE-E)[27].

Cross-entropylosswiththebetaregularisation(CE-

β

)proposed

inthiswork(Section3.1).Theparametersusedforthedistribu- tionareobtainedusingthemethoddescribedinSection3.2. Since datasets are imbalanced, the loss function is weighted based onthea prioriprobabilities oftheclasses(consideringthe numberofinstancesofeachclassinthetrainingset)followingthe methoddescribedin[38].Classeswithfewsampleshaveahigher weightthanclasseswithmanyinstances.

Considering the differentalternatives forthe output layer de- scribedinSection4.2andtheseparatelossfunctionsdescribedin thisSection,tendifferentexperimentswererun.Asmentionedbe-

fore,eachoftheseexperimentswasrepeatedtentimesusingthe described10-foldcross-validationscheme.Theseexperiments can bereproducedrunningthecodeavailableinourpublicrepository4. 5. Results

The results of the experiments described in Section 4 are presented in this section. The evaluation metrics used are the Quadratic Weighted Kappa (QWK) [33], the Correct Classification Rate (CCR) or accuracy, the Minimum Sensitivity (MS) [39] and the execution time. All the values presented in Table 2 are the meanandthestandarddeviationofalltheexecutionsranforeach methodinthetestset.Theexperimentswithsoftmaxintheout- put layer are denoted as S andthe experiments using the stick- breaking scheme as SB. The best result of each metric is high- lightedinboldfontface,whilethesecond oneisinitalics.Allthe metricsmustbemaximised,excepttheexecutiontime.

5.1. Statisticalanalysis

In this Section, a statistical analysis have been carried out in orderto obtainrobust conclusionsfromtheexperimental results.

Each of the metrics presented in Section 5 were analysed sepa- rately.

First,the Kolmogorov-Smirnov testforthe QWKreportedthat thevaluesofthismetricarenormallydistributed.Then,anANOVA

4https://github.com/victormvy/beta-regularisation-cnn

(8)

Table 3

HSD Tukey’s test results for QWK (30 samples for each method).

Method Subsets

1 2 3 4 5 6

S CE-P 0.4288 SB CE-P 0.4291

SB CE 0.5062

S CE 0.5111 0.5111

S CE- β 0.5307 0.5307 0.5307

S CE-E 0.5343 0.5343 0.5343

S CE-E 0.5423 0.5423 0.5423

SB CE- β 0.5551 0.5551 0.5551

S CE-B 0.5602 0.5602

SB CE-B 0.5640

p-values 0.1000 0.1450 0.1000 0.0640 0.0000

Table 4

HSD Tukey’s test results for CCR (30 samples for each method).

Method Subsets

1 2 3 4

SB CE-P 0.3954

S CE-P 0.3977

SB CE-B 0.4307

SB CE-E 0.4366 0.4366

S CE-B 0.4483 0.4483 0.4483

S CE-E 0.4502 0.4502 0.4502

SB CE 0.4510 0.4510 0.4510

SB CE- β 0.4523 0.4523

S CE 0.4538 0.4538

S CE- β 0.4612

p-values 0.0640 0.2130 0.6240

II Test [40] withthe methodand thedataset asfactors wasper- formed,inordertocheckwhetherthesefactorshadanyimpacton the value ofthesemetric.The parametrictest reportedthat both factors were significant(p-value<0.001) andthat thereisan in- teractionbetweenthem.

Giventhatthefactorsconsideredaresignificant,aposthocHSD Tukey’stestwasperformed[41].Theresultsofthistestareshown in Table 3. The stick breaking withcross-entropy binomial regu- larisedlossobtainedthebestmeanresults.However,thereareno significantdifferenceswithSCE-B,SBCE-

β

andSBCE-E.

The same analysis was carried out for the CCR metric. The Kolmogorov-Smirnov reported that the values are normally dis- tributed,andthe ANOVAIItest foundsignificant influenceofthe factors considered as well as an interaction between them. The posthoc test results are shown in Table 4. In this case, the best methodology isthe one thatuses the softmaxoutputlayer com- binedwiththebetaregularisationforthecross-entropyloss.How- ever,therearenosignificantdifferenceswithS,SCE-

β

,SB,SCE-E

andSCE-B.

InthecaseoftheMS metric,thevaluesarealsonormallydis- tributed,andtherearesignificantdifferencesbasedonthetwofac- tors considered. The posthoc Tukey’stest results aredisplayed in Table5.Again,thebestmethodwastheonethatusessoftmaxin theoutputlayerandthebetaregularisedcross-entropyloss.How- ever,therearenosignificantdifferenceswithSBCE-

β

andSCE-E.

Finally, the experiment time is also normally distributed, and the ANOVA II test reported significant differences based on the factors considered. The posthoc test (Table 6) showed that the method withthe bestaverage time isthe standard softmaxcou- pled withthe cross-entropy, followed by the stick-breaking with cross-entropy.Withinthemethodsthatuseregularisation,thebeta regularisedcross-entropywithsoftmaxorstick-breakingistheone withthebesttime.

Table 5

HSD Tukey’s test results for MS (30 samples for each method).

Method Subsets

1 2 3 4

SB CE-P 0.0752

S CE-P 0.0814

S CE 0.0827

SB CE-B 0.0906 0.0906

S CE-B 0.0983 0.0983

S CE-E 0.1029 0.1029 0.1029

SB CE 0.1049 0.1049 0.1049

S CE-E 0.1156 0.1156 0.1156

SB CE- β 0.1238 0.1238

S CE- β 0.1430

p-values 0.2670 0.2480 0.1590

Table 6

HSD Tukey’s test results for Time (30 samples for each method).

Method Subsets

1 2 3 4 5

S CE 715.65

SB CE 875.62

SB CE- β 1007.00

S CE- β 1073.53

SB CE-B 1222.68

S CE-P 1239.38 1239.38

S CE-E 1269.98 1269.98

S CE-B 1273.47 1273.47

SB CE-E 1276.49 1276.49

SB CE-P 1314.94

p-values 1.0000 0.3700 0.6580 0.1810

When we analysethe resultsof all themetrics combined, we findthatthemethodthatusesstickbreakingwithbetaregularised loss(SBCE-

β

)achievesthebestresultforQWKandCCR,andthe

secondbestforMS.Also,asmentionedbefore,itobtainsthebest timeamongthemethods thatuseregularisation.Thesefactsturn thismethodintoa competitive alternativethat can beapplied to solveotherordinalclassificationproblems.

6. Conclusions

Inthiswork, wehaveproposed theapplicationof aunimodal regularisation based on beta distributions for the cross-entropy loss. The method described improved the performance on prob- lems where classes follow a natural ordering. The regularisation proposedbenefitfromthefactthat,inordinalregressionproblems, misclassificationtendstobeinadjacentclasses,and,consequently, slightlymodifying thelabels considering theordinal scale should increasetherobustnessofthemodelinthepresenceofnoisytar- gets. Thus, the main advantage of the proposed regularised loss is that it encourages the classification errors to be in the adja- centclassesandminimisesthenumberoferrorsindistantclasses, achievingmoreaccurateresultsforordinalproblems.

The distribution used to regularise the loss function has two parameters.Therefore,amethodtoautomaticallydeterminethese parametershasbeenintroduced.Thismethodavoidslearningthem from the training data, thus improving the computational time withrespectto other alternatives withfree parametersto be ad- justed. The parameters obtained through this method have been usedforthelabelsmoothingthat hasbeenappliedasaregulari- sationmethodforthelossfunctionandtestedwiththreedatasets and one CNN model. Even though the model used was a deep learning method, the proposal of this work is also suitable for otherkindsofmodellingtechniques.

Thisregularisedlosshasbeencombined,inonehand,withthe standard softmax function in the output layer and, in the other,

Referencias

Documento similar

With all the above, the objective of this work would be to analyze, based on deep learning techniques involving convolutional neural network, different image preprocessing

• Shifted MobileNetV2 and Shifted GoogLeNet: The pro- posed shifting model, using the same configuration explained in Subsection IV-B, is used with the deep network to test the

The Dwellers in the Garden of Allah 109... The Dwellers in the Garden of Allah

Government policy varies between nations and this guidance sets out the need for balanced decision-making about ways of working, and the ongoing safety considerations

After a deep study of the state of the art for the task of person detection, it has been possible to train a neural network based on the Yolo v2 detection, using MobileNet and Tiny

Our proposed loss function is an estimate of the Bayes classifi- cation cost, which allows to include in the training of a neural net- work the ordinal information along with

First, we present the results of Deep NoRBERT, in which the GMVAE stochastic layer is placed in an intermediate BERT encoder layer, see Section 2.5.2 for more details.. We show

Among the different areas of Deep Learning, the combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) is an important novelty for