• No se han encontrado resultados

A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

N/A
N/A
Protected

Academic year: 2022

Share "A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction"

Copied!
13
0
0

Texto completo

(1)

ContentslistsavailableatScienceDirect

Computer Methods and Programs in Biomedicine

journalhomepage:www.elsevier.com/locate/cmpb

A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

Elena Martín-González

a

, Elisa Moya-Sáez

a

, Rosa-María Menchón-Lara

a

,

Javier Royuela-del-Val

c

, César Palencia-de-Lara

b

, Manuel Rodríguez-Cayetano

a

, Federico Simmross-Wattenberg

a,

, Carlos Alberola-López

a

a Laboratorio de Procesado de Imagen (Image Processing Laboratory), Universidad de Valladolid, Valladolid 47011, Spain

b Departamento de Matemática Aplicada, Universidad de Valladolid, Valladolid, Spain

c Health Time Group, Córdoba, Spain

a rt i c l e i nf o

Article history:

Received 18 December 2020 Accepted 25 April 2021

Keywords:

Magnetic resonance imaging GPU

OpenCLIPER OpenCL

a b s t r a c t

Backgroundandobjective:RecentresearchhasreportedmethodsthatreconstructcardiacMRimagesac- quiredwithaccelerationfactorsashighas15inCartesiancoordinates.However,thecomputationalcost ofthesetechniquesisquitehigh,takingabout40minofCPUtimeinatypicalcurrentmachine.Thisde- laybetweenacquisitionandfinalresultcancompletelyruleouttheuseofMRIinclinicalenvironments infavorofothertechniques,suchasCT.Inspiteofthis,reconstructionmethodsreportedelsewherecan beparallelizedtoahighdegree,afactthatmakesthemsuitableforGPU-typecomputingdevices.This papercontributesavendor-independent,device-agnosticimplementationofsuchamethodtoreconstruct 2Dmotion-compensated,compressed-sensingMRIsequencesinclinicallyviabletimes.

Methods:ByleveragingourOpenCLIPERframework,theproposedsystemworksinanycomputingdevice (CPU,GPU,DSP,FPGA,etc.),aslongasanOpenCLimplementationisavailable,anddevelopmentissig- nificantlysimplifiedversusapureOpenCLimplementation.InOpenCLIPER,theproblemispartitionedin independentblackboxeswhichmaybeconnectedasneeded,whiledeviceinitializationandmaintenance ishandledautomatically.ParallelimplementationsofbothagroupwiseFFD-basedregistrationmethod,as wellasamulticoilextensionoftheNESTAalgorithmhavebeencarriedoutasprocessesofOpenCLIPER.

Ourplatformalsoincludes significantdevelopmentand debugging aids.HIP code andprecompiledli- braries canbe integratedseamlessly as wellsince OpenCLIPER makesdataobjectsshareable between OpenCLand HIP.ThisalsoopensanopportunitytoincludeCUDAsourcecode(viaHIP)inprospective developments.

Results:Theproposedsolutioncanreconstructawhole12–14sliceCINEvolumeacquiredin19–32coils and20phases,withanaccelerationfactorofranging4–8,inafewseconds,withresultscomparableto anotherpopularplatform(BART).Ifmotioncompensationisincluded,reconstructiontimeisintheorder ofoneminute.

Conclusions:Wehaveobtainedclinically-viabletimesinGPUsfromdifferentvendors,withdelaysinsome platforms thatdo not have correspondencewith its priceinthe market.We alsocontribute aparal- lelgroupwiseregistrationsubsystemformotionestimation/compensationandaparallelmulticoilNESTA subsystemforl1− l2-normproblemsolving.

© 2021TheAuthors.PublishedbyElsevierB.V.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

Corresponding author.

E-mail addresses: emargon@lpi.tel.uva.es , marcma@lpi.tel.uva.es (E. Martín- González), emoysae@lpi.tel.uva.es (E. Moya-Sáez), rmenchon@lpi.tel.uva.es (R.-M.

Menchón-Lara), j.royuela.v@htime.org (J. Royuela-del-Val), palencia@mac.uva.es (C. Palencia-de-Lara), manrod@lpi.tel.uva.es (M. Rodríguez-Cayetano),

fedsim@tel.uva.es , fedesim@lpi.tel.uva.es (F. Simmross-Wattenberg), carlos@lpi.tel.uva.es (C. Alberola-López).

https://doi.org/10.1016/j.cmpb.2021.106143

0169-2607/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

(2)

1. Introduction

Magneticresonance(MR) isa high-qualityandhighlyversatile medical imaging modality.Not only anatomical images with dif- ferent contrasts can be acquired, butalso a numberof properly- designedpulsesequences provide additionalinformation,such as diffusion, perfusion, fMRI and a few others. Dynamic imaging is alsopossibleandMRcurrentlyconstitutesthegoldenstandardfor functional studies of the heart [1]. However, its major drawback is its inherent slowness relative to other alternatives such as CT andUS.AcompletestudyinMRImaytake30–45min,whereasCT takesafewminutesandUSisobservedinrealtime.Thishasthe consequenceofpatientdiscomfort,whichleadstomotionartifacts intheimages,theneedofsynchronizationfordynamicstudiesand poses additionaldifficulties for imagingnon-cooperative patients, suchaschildrenortheelderly.

This being the case, the MR community is doing a consider- ableefforttoincreasethevelocityofMRacquisitionandtodimin- ish the need of additional hardware (such asnavigators, ECG or PPGsynchronization, etc.).Apartfromimprovementsinthehard- ware of the scanners —that provide higher and more homoge- neous fields, more intense and faster gradients, etc.—, many dif- ferentsoftwaresolutionshavebeendescribed,whichincludepar- allelimaging(PI)andcompressedsensing(CS).Intermsofrecon- struction[2],differentalgorithmsarecapableofreconstructingim- ages fromundersampledk-spaceinformation; leavingasidemod- ern deeplearning-basedapproaches, forwhichtheirbottleneckis thetrainingstage,classicalalgorithmsareoptimization-basedand theymaybecomputationallydemanding,dependingontheircom- plexity. Hence fast implementations and appropriate computing hardwaremake adifference soastoobtain solutionsinclinically viable times. Here,clinical viable time is interpreted asthe time takentoreconstructanimagethatiscompatiblewithrepeatingan acquisition —wheneverthereconstructionisconsideredofinsuffi- cientquality— withoutcausinganunacceptableoverloadinclinical routine1.The MRreconstruction workloadturns out to be highly parallelizable so proper implementations on natively-parallel de- vices, such as(butnot limitedto) GPUs,are currentlyplayingan outstandingrole.

Manydifferentattemptshavebeenreportedtospeeduprecon- struction algorithms by means of parallel devices.Section 2 pro- vides an overview of the field. As of today, the field of paral- lel computinghasseenatremendousgrowthoftailoredsolutions that workexclusivelyon aparticularvendorGPU,namely,nVidia.

Nevertheless,other alternatives exist.Oneofthem isOpenCL [4]; OpenCL is an open standard for parallel computing, similar in its objectives to the nVidia-exclusive language CUDA, but covers a much wider set of computing devices (including nVidia GPUs) while ensuringthat sourcecodeanddataareunique,thus avoid- ing the need for code and data replication among prospectively supporteddevices.However,itisgenerallyregardedasmorecom- plicated andlessmaturethanCUDA. Asa matterof fact,OpenCL programmers must deal on their own with device selection and initialization, memory management, kernel loading andcompila- tion,host-deviceinteraction,andadministrationoverload.

Anotherexample isHIP [5];thisinitiative helpsdevelopers in convertingCUDAsourcecodeintoamoreportableform,sothatit canrun onbothnVidiaandAMDGPUs.However,asoftoday,the HIP API is not compatible with OpenCL, so data objects defined inOpenCL cannotbeusedasparameterstoHIP kernelsandvice- versa,eventhoughOpenCL dataisundistinguishable fromHIP(or CUDA)dataintheGPUmemory.

1 Issues related to DICOM interoperability are a step further and well-known so- lutions are available [3] . We concentrate on operations on the reconstruction prob- lem once all the necessary data is available.

The framework OpenCLIPER [6] was developed to address all theseshortcomingsofOpenCL.Inthispaper,OpenCLIPERhasbeen used to provide a parallel implementation of our previous pro- posal [7] on 2D cardiac cine imaging. This algorithm combines CS with Groupwise Motion Compensation (MC) to achieve CINE sequences with acceleration factors (AFs) of 8–15x, which show higher quality than other methods with pairwise approaches for motion compensation. Nevertheless, in its non-parallel version, thepost-acquisitionreconstructiontakesasignificantcomputation time (about40minina Corei7-4790CPUforthe fullheartcov- erage),whichis farfrombeingclinicallyviable.In thispaper, we show how OpenCLIPER eases dramatically the process of paral- lelizing a complex algorithm previously programmed in a script- ing language.2 As abyproduct, wehaveenlarged theOpenCLIPER librarywithadditionalparallel functionality,such asa groupwise version of a registration algorithm based on free form deforma- tions(FFD)[8]oramulticoilparalleladaptationoftheNESTAopti- mizationalgorithm[9].Noneofthesealgorithms,tothebestofour knowledge,arecurrentlyavailableonGPUs.Acomparisonwithan- otherpopularframework (BART,seeSection 2)isincluded.Inad- dition,weshowhowtoseamlesslyincorporateHIPcodetoOpen- CLIPERbysolvingthe integrationdifficulty referredto above;this opensupthepossibilityofusingCUDAcodeinourframework—as longasthiscodecanbeconvertedtoHIP— atprogrammerwill.

The rest of this paper is organized as follows: Section 2 re- views the subject of current parallel implementations of under- sampled MRI reconstruction algorithms. Section 3 describes the innards ofour proposed implementation.Section 4 evaluates our proposalperformance invariouscomputingdevicesandcompares it,aspreviouslymentioned,withanotherpopularframework. The discussioncan befound inSections5and6concludesthepaper.

Anumberofappendicesofbothmathematicalandalgorithmicde- tailshavebeenincludedforthesakeofself-completeness.

2. Relatedwork

ThefieldofMRreconstructionisveryactiveandquiteanactiv- ityhasbeenfocused onalgorithm implementationonGPUs.Two recentsurvey papers [10,11]give a generaloverviewof thefield;

the formeris dedicated to GPU-basedmedicalimage reconstruc- tionwhile thelatterisspecificallytargetedto MRreconstruction.

Bothsurveypapers,however,haveasimilar structureasfortheir commontopic; they review FFT-based methods, either forCarte- sian and non-Cartesian data, PI and CS-based applications. The 2018 survey also includes a section on reconstruction based on deep-learning,probablyspurredbytheenormousactivitythathas beenreportedinthelasttwo years.Bothpapersalsoquotesome recentwork on multi-GPUsolutions. Thereader iskindlyinvited toconsultthereferencestherein.

Apartfrom particularimplementations of reconstruction algo- rithms, the field has been enriched with a number of libraries, someexamplesofwhichareAGILE[12],BART[13],Gadgetron[14]

and Impatient [15], to explicitly mention four of them. Addi- tionallibrariesarementionedinourformerpublication onOpen- CLIPER [6]. These libraries are conceived to ease the process of algorithm prototyping and testing; Gadgetron shows additional client-servercharacteristics,wheretheservertakescareofthere- constructionprocesswhiletheclientfocusesondatahandling,i.e., Gadgetroncouldbeconsideredbothasalibraryandasanetwork service.

Thecommongroundofthesereferences,bothintermsofspe- cific algorithms and of reconstruction toolkits, is that they are based on theCUDA language and, consequently, they are vendor

2 OpenCLIPER code available at http://opencliper.lpi.tel.uva.es/ .

(3)

dependent. nVidia, since its onset in 2007 with the release of its CUDAapplicationprogramminginterface[16],hasbecomethe main supplier inthe field ofparallel computing withGPUs,both inacademiaandinindustry.Itsmainstreamconditionisaccompa- nied,however,withsomeunavoidableconsequenceswhichcanbe summarized intwo,namely, theneed ofcodeduplicationshould theapplicationberuninGPUandinCPU(orinanyotherdevice), andtheneedofdatareplicationintheplatformsinwhichtheap- plicationisdeployed.

Alternativestothistechnologyarealsoavailable,mainly,Direct Compute issued by Microsoft and, as previously stated, OpenCL, an open source project led by the Khronos Group [11]. The lat- ter has the benefits of being both open-source and device ag- nostic; the second feature makes it comply with the write-once run-anywhere(WORA)paradigm. Anadditionalsolutionhasbeen proposed, whichisadomainspecific computationalimage recon- struction language, referred to asIndigo [17]. Itprovides a front- endwithanumberofimage reconstructionfacilities,aswell asa back-endthatisresponsibletoevaluatetheoperatorsprovidedby the front-endon aparticular platform.As long asdifferentback- ends are available, thelanguage will followthe WORAparadigm, although asoftodayit seemslimitedtoCUDAasforGPUimple- mentation.

As we mentioned in the Introduction, in this paper we illus- tratehow OpenCLIPERcanbeused tofindasolutiontoa2D dy- namiccardiac imagereconstruction algorithm,that makeswhole- heart singlebreath-hold reconstruction possible [18] in clinically feasibletimes.Ouralgorithmmakesuseofanoptimizer,forwhich we haveusedNESTA[9]aswellasFFD[8]formotioncompensa- tion.Furtheralgorithmicdetailsofthemethodaredescribedelse- where [7,19]. As forthe NESTA algorithm, we are only aware of a GPU version, whichwas recentlydescribedin Dinhetal. [20]; however,inthat implementation,asinglecoilisapparentlyused.

Inourimplementation,multiplecoilsareemployed andthealgo- rithm is fullyparallel in the coilsas well. Asfor theFFDs, some GPU-based implementations were describedin Modat etal. [21], Ruijtersetal.[22],Duetal.[23],Ellingwoodetal.[24],Punithaku- maretal.[25];however,theseproposalsarepairwiseanddevice- specific. We show in Royuela-del Val et al.[7] that a groupwise registrationhasaddedbenefitsintermsofreconstructionquality;

consequently, the implementation we havecarried out is group- wiseandspecificallytargetedforcardiaccineMRI.

3. Theproposedsystem 3.1. Thealgorithm

CINE cardiacMRimagescan bereconstructedfromhighlyun- dersampled data while keeping a very high level of detail from thefully-sampledimages.Astate-of-the-artalgorithmonthistopic was published in Royuela-del Val et al. [7] and later enhanced in Royuela-delValetal.[19].Brieflydescribed,the algorithmde- parts fromthe multi-coil k-space subsampled information b and solves a CS reconstruction problemwhich givestheresulting im- agesequencemi,atiterationstepi,as

mi=argmin

m

||

b− Em

||

2l2+

λ|| 

Tm

||

l1 (1)

mi is a vectorized stack of image frames, one frame per cardiac phase. E istheencodingoperatorthat includesthemultiplication by the coil sensitivities S, the intra-frame spatial Fourier trans- formF andtheapplicationoftheundersamplingmaskA.isthe temporaltotal variation(tTV)operator. T isthegroupwise(GW) transformationformotioncompensation(MC)thatregistersallthe framesinthesequencetoacommonreferenceand

λ

isaregular-

izationparameter.AppendixAdescribesamatrixformalization of thesedatastructuresandoperators.Asform0,thetransformation

T is the identity,since there isno data fromwhich motionin- formationcanbeestimated.Then,theiterativerefinementprocess findsmiwithTobtainedbytheheartmotionestimation(ME)on mi−1.The whole process continues until a predefined numberof iterationsismet.Inthismanuscript,thenumberofME/MCrecon- structioniterationswassettotwo.

TheME/MCiscarriedout bymeansofelasticregistration[26]. Aspreviouslystated,weuseFFDs[8];AppendixBprovidesdetails ontheME/MC process. B-splinesservethepurposeofinterpolat- ingthedeformationfield(seeEq.(B.3))fromagivencontrolpoint configurationtogiverise toadensefield. Specifically,weapply a GWparadigminwhichno particularframeisselectedasa refer- ence, butthereferenceisbuiltalong theoptimization processas theaverageofthetransformedimages.Asforthemetrictobeop- timized,wehaveusedthesumoftheintensitysquaredifferences (SSD)withrespecttothereferenceimage(seeEq.(B.1));thismet- ricisenlargedwithsmoothnesstermssoastoforcearealisticmo- tionfieldsolution(seeEq.(B.2)).

Eq. (1) is minimized with the well-known NESTA algorithm basedonNesterov’smethod[9].AdetailedpseudocodeoftheMRI reconstructionalgorithmcanbe foundinAppendixC.Specifically, Nesterov’smethod iteratively minimizesa function f by estimat- ingthreesequencesxk,yk andzk.Thexk sequencecorrespondsto the sequence we want to estimate (mi in the Eq. (1)), and it is obtainedfroma weightedsum ofthe other two sequences. Nes- terov’smethodcan beused fortheminimization ofbothsmooth andnonsmoothconvexfunctionsifusingtheappropriatesmooth- ing techniques.In [9],thel1-normin Eq.(1) iscomponent-wise approximatedbythewell-knownHuberfunction fμ(x),whichde- pends on asmoothing parameter

μ

; thisparameter isiteratively decreased duringthe optimization process. The startingguess x0 istheresultofapplyingtheadjointencodingoperator(EH)tothe subsampledk-space(x0=EHb).

3.2. Parallelimplementation

Our parallel implementation is coded in OpenCL [4], and all parallel operations are programmed as OpenCL kernels. In the nexttwo subsectionsweprovide detailsontheME/MC stageand the NESTA algorithm. Pseudocode for the latter is included in AppendixC.

3.2.1. Groupwiseregistration

The registrationprocedure isdetermined by the valuesof the parameters

θ

uinEq.(B.3).Theseparametersareobtainedbymini-

mizingthemetricdefinedinEq.(B.2).Inthissectionwewillcon- centrate on theimplementation ofthe mostresource-demanding operation,whichturnsout tobe findingthegradient ofEq.(B.1). Thegradientis,inessence,calculatedasfollows:

V

(

x

)

∂θ

=

V

(

x

)

m ·

m

T

(

x

)

·

T

(

x

)

∂θ

(2)

whereV(x) is the metric, m represents the image sequence and T(x) the transformation.Notice that xwill be a gridpoint (with coordinates givenasrowandcolumnnumbers r, 1≤ r≤ N1,and c,1≤ c≤ N2,respectively).Inaddition,eachparameter

θ

isassoci-

atedtoeachoftheframes(withindexn,1≤ n≤ Nt)aswellasto eachofthecontrolpoints(withcoordinates,say(ru,cu))andeach ofthedirectionsofvariation,namely,horizontal(indexl=1)and vertical(l=2);theseindiceswillbeborneinmindforthederiva- tives.

ThefirstfactorinequationEq.(2)canbewritten:

V

(

x

)

m r,c,n= 2 Nt



m(n)



T(n)

(

x

) 

r,c− 1 Nt

Nt



n=1

m(n)



T(n)

(

x

) 

r,c



(4)

Fig. 1. Functional units of the proposed reconstruction system. Gray boxes represent necessary albeit unproductive work, which is taken care of by OpenCLIPER. White boxes represent actual, productive work. Each white box is abstracted as a process , which may be composed of several other processes. U , U H , E , E H are the internal NESTA operators. tTV: temporal Total Variation; GW: Groupwise Registration.

withm(n)(x)thenthframeinsequencem,evaluatedatpixelx,and T(n)(x) itthe transformed positionofpixel xinthat phase. m(n) willbealsoreferredtoasthenthframe.Thisequationiscomputed inparallelfortheN1× N2× Nt pixels.

The second factor in Eq. (2) is the image gradient with re- specttothetransformation.Theactualoperationsweperformare the interpolation of the gradient of each frame m(n) at position T(n)(x). The interpolationis carriedout within a region ofinter- est, say

χ

m, so the number ofoperations launched in parallel is

N1χm× N2χm× Nt, wherethe two first factorsare the dimensions ofregion

χ

m.

The last factorinEq.(2) isthegradient ofthe transformation withrespecttothevariables

θ

utobeoptimized;sincethesevari- ables enter Eq.(B.3)asfactors, their derivative isstraightforward andcanbeprecomputed.

Hence, the calculation of the gradient of Eq. (2) will be per- formedparallelizingalongtherowsandcolumnsaffectedbytheB- splines(rsandcs),theframedimension(n),therowsandcolumns ofthe controlpoints(ru andcu) andthespatialdimension(l),as canbeseenbelow:

V

(

x

)

∂θ

rs,cs,n,ru,cu,l=

V

(

x

)

m rs,cs,n,l·

m

T

(

x

)

rs,cs,n,l·

T

(

x

)

∂θ

rs,cs,ru,cu.

3.2.2. NESTA

A matrix formalism of data structures andoperators (the en- codingoperatorE andthesparsifyingoperatorU)usedinNESTAis presentedinAppendixA;thisformalismmakesitsimpletoderive the adjointoperators. However, inpractice, highdatadimension- ality impliesthat storageofmatricesassociatedwithoperatorsis notfeasibleduetomemoryrequirements.Consequently,operators are implementedby meansoffunctionsandthelatterhave been thefocusofourparallelizationeffort.Specifically,fortheencoding operatorE thefollowingparallelizationsweremade:1)Multiplica- tionby thecoilsensitivitiesS isperformedparallelizingalongthe spatial dimensions (i.e., N1× N2 operationslaunched in parallel).

2) The by-frame spatialFourier transform F isperformed paral- lelizing along the coils andframesdimensions (i.e.,C× Nt paral- lel operations) using the clFFT [27] library.3) Application of the undersampling mask A is performed parallelizing along the spa- tial dimensions (i.e.,N1× N2 parallel operations). Note that in S, thecoilsandframesdimensionscouldhavealsobeenparallelized;

nevertheless, we empirically verified that this did not constitute aperformance gain,presumablycausedbythelimitednumberof cores in the hardware. As for the adjoint encoding operator EH, thefollowingparallelizationsweremade:1)Multiplicationbycoil conjugatesensitivities SH and, 2)by-frame spatialinverseFourier transformFH,Bothofthemwereparallelizedanalogouslytotheir counterparts S andF,respectively.3) Summationoftheresulting image sequence in each coilis parallelized along the spatialand temporaldimensions(i.e.,N1× N2× Ntoperationslaunchedinpar- allel).

RegardingthesparsifyingoperatorU=T,thetemporaltotal cyclicvariation is performedparallelizing along thespatialdi- mensions(i.e.,N1× N2operationslaunchedinparallel)andtheT groupwisetransformationforMCisperformedparallelizingalong both the spatialandthe frame dimensions(i.e. N1χm× N2χm× Nt

parallel operations).Similarly,the adjointsparsifying operatorUH isparallelizedalongthesamedimensionsasU.

Finally, matrixoperations neededin NESTA, such as scaling a vectorbyaconstantoradditionoftwovectors,areperformedef- ficientlybythe clBLASTlibrary[28].Thus,all thealgorithm steps (seeAppendixCfordetails)canbestraightforwardlycomputedby combiningoperatorsE andU,theiradjointsEH andUH,andafew matrixoperations.

3.3. Algorithmimplementationonagenericparalleldevice

Wehavetranslatedtheoriginal methodintoan actual device- agnostic software solution which (a) finishes in clinically viable times,comparabletootherpopularreconstructionframeworks,(b) issuitableforexecutioninparalleldevices,and(c)complieswith the WORA paradigm (see Section 2) so neither code nor data

(5)

need beduplicatedforCPU/GPUoranyotherdevice. Asstatedin Section1,wemakeextensiveuseofourframeworkOpenCLIPERto providethissolution.

In this section we provide details on this implementation;

Fig. 1 shows a diagram of our system functional units that will guideusthroughthedescriptionthatfollows.

3.3.1. Generalities

Togetherwiththecorealgorithm—enclosedwithinthedashed rectangle in Fig. 1 and its operations represented by rectangles withwhitebackground—, severalother tasksmustbe carriedout forthesystemtobe functionalbyitself.Thesetasks arenotonly part of the initialization process but they are also active while theactualprocessingistakingplace.Alltheseadministrativetasks (shadedboxes inFig.1) aredealtwithby OpenCLIPERwithmin- imal programmer intervention.While thesystemisatwork,data passes throughseveral transformations(called processes inOpen- CLIPER)thatneedtobeconnectedappropriately.Eachprocessmay be seen as a black box with a single input, a single output and anarbitrarysetofparameters.Thisspecificationallowsprospective programmerstointerconnectandcomposeprocessesarbitrarily,as long asoutputs fromoneare compatiblewithinputstothenext, sincetheirinterfacesmustallcomplywiththisspecification.

Anappropriatecomputingdevicemustbeselected(toprowin Fig.1).SinceOpenCLsupportsawiderangeofthem,anadditional workloadistheneedtodetecttheavailableplatformsanddevices (which may be from different vendors), andto choose the most suitable amongthem.Moreover, oncea devicehasbeenselected, kernelsmustbeloadedandcompiledforthechosendevice.Open- CLIPERsimplifiesthesetaskssignificantly:kernelloadingandcom- pilingisdonewithoutuserinterventionwhenaprocessdemands it,caching themasnecessarysotime isdedicatedtocompilation only once per device (and driverversion). Platform/device detec- tionandselectionmaybeeitherspecifiedbytheprogrammer(via hints such as device type, model,supported CL version, etc.) or, alternatively, left to the framework by just a singleline of code.

Inthelattercase—or,intheformer,wheneverthespecifiedhints matchseveralcandidatedevices—,theapriorifastestdeviceisau- tomaticallychosen.

All compute-intensive transformations are performed in the computing device, so the systemtakes full advantage of its par- allel capabilitiesand henceclinically viable execution times may be achieved (see quantificationin Section 4). The algorithms im- plemented in OpenCLIPER, which are themselves processes, are made up from several other processes which may be reused at programmer discretion.Inthissense,OpenCLIPERprovides apool offrequentlyusedprocesses asa ready-to-usetoolbox;forexam- ple,Hadamardmatrixmultiplication,fastFouriertransform,chan- nel integration,image sum, etc. Thesetoolsmake theimplemen- tation ofmoreinvolvedprocesses —suchasthe two describedin Section3.2— easy.

With respect to the four blocks on top of the NESTAProcess inFig.1,aftertheselectionofthemostappropriatedevice(either manually orautomatically), kernelloadingis delayeduntila pro- cess requires it.At that time, a previously compiled (i.e. cached) versionofthekernelissought.Ifitdoesnotexistyet,kernelcom- pilationisautomaticallytriggeredandthekernelcacheisupdated (loading of cached kernels is typically two orders of magnitude faster than on-the-flycompilation). Compilation logs (if any) are showntotheuserfordebuggingpurposes.

3.3.2. Inputandoutput

MRIdataloadingisdoneinasinglecodelinewithOpenCLIPER by a front-endfunction (see leftmost side ofFig. 1).As of today, k-space lines are input by means of a.clf file (and its accompa- nying.hdrfile);Matlabformat fileisalsoallowed. Outputformats

includethe twoformatsjustmentioned aswell asother popular imageformats(JPEG,PNG,etc).

Tomaximize utilization ofthe computingdevice, data objects may be loaded andsaved concurrently while the device is busy processingother objects.SinceMRI datafilesareoftenlarge,this cansaveanoticeableamountofprocessingtimeperpatient.

3.3.3. Datastructures

AcommonburdeninGPUcomputingliesintheneedto keep trackof pointersto dataobjects andtheir properties: numberof dimensions, sizesandstridesalong eachof them,datatypesand sizes(complex,float,integer,etc.), andpassingthem tothecom- puting device. Thisis typically done by addingarguments to the kernels,butkernelargumentspaceisusuallylimitedby thecom- putinghardware(apart formaddingmoreburdenforthe userto calltheirkernels). OpenCLIPERsimplifiesthisburdenby encapsu- latingalldatapropertieswithinthebufferinthecomputingdevice inawaythatistransparenttothekernelprogrammer(i.e.thereis nooffsetbetweentheobjectpointerandtherealdata).Thus,users justhavetopassasinglepointer tothedataobjecttohaveallits propertiesreadily available for thekernel code.OpenCLIPER pro- videssupport fordatawithan arbitrary numberofframes, coils, sensitivitymaps,samplingmasksanddatadimensionalityforthe specialcaseofMRIdata,andarbitrarilycomplexdataforthegen- eralcase. Ourframework also provides methods to traverse data buffersalong any givendimension. This functionalityis provided bygeneratingindexandsizetablesforeachdataobject,asshown inFig.1.

An additionalfeature of OpenCLIPER isthat all sub-objects in a data object (e.g. sensitivity maps for each coil) are mapped contiguously in device memory and properly adjusted to hard- warealignment,sokernelprogrammerscanassumealinearlayout when processingcompound data objects. Data transfers between host and device are driven by the DMA controller to maximize speed.

Listing1 shows a simpleOpenCL kernel whichsums a set of MR imagesalong the coildimension. It can be seen how a ker- nel can accessattributes frominput andoutput buffers (suchas coilorframestrides,sizes,andsoon)justbypassingtheirnatural pointerasargumentswithnoneedtoworryaboutheaderoffsets orthelike.AllattributeaccessfunctionsareO(1)aslongasallND

Listing 1. Accessing attributes of complex data structures. Note how data proper- ties (sizes, strides and number of coils in this particular case) are available from kernel code without passing them explicitly as kernel arguments.

(6)

arrays in the given objecthave the same number of dimensions (O(n)otherwise).

3.3.4. Additionalsupportforresearchers

OpenCLIPER hasalsobeendesignedasadevelopmenttool for researchers. Hence, it includes a number of additional utilities, whichwenowenumerate:

1. 1D,2D and3D objectsindevicememory,eitherstandalone or asamovie,canbegraphicallypresentedtotheuseratanytime by a simple call to

pointerToObject- > show()

method;

scalingandvideovelocityfacilitiesareprovidedatprogrammer convenience.OpenCV[29]iscurrentlyusedtodisplaywindows onthedesktop.

2. Similarly,anyobjectindevicememorycanbesavedforfurther inspectioninMatlaboranyothersupportedformatatanytime.

3. Whencompilingindebugmode,deviceandhostareautomati- callysynchronizedineveryprocesssothatrun-timeerrorstrig- gerexactlyattheresponsiblekernel,sofaultlocationdetection shouldbeimmediate.

4. When OpenCL kernelsare compiled onthe fly(asopposedto loadingapreviouslycachedversion),errorsmayshowupwhen thehostprogramisrun.Ifthisisthecase,kernelcompilation errorsareautomaticallyshowntotheuser.

5. In the case of program abortion, the full stack backtrace is showntotheprogrammer.

6. OpenCLIPERincludesfacilitiestogatherprofilingdataandgen- eratestatisticalreportsaswell.

7. Amemorymap ofdataobjectsindevice memorymaybe ob- tainedatanytime.

3.3.5. IncorporationofHIPandCUDAfunctionality

Asstatedintheintroduction,HIP[5]isan interestingeffortto bridgethegapbetweenGPUsvendorsbyprovidingfacilitiestomi- grateCUDAcodeintoareusableforminotherplatforms.However, handling exchange is still an issue; specifically, data objects de- finedinonelanguage(HIP/OpenCL)cannotbeusedasparameters to kernels written in the other language (OpenCL/HIP).We have solved this problemby providing a HIP handleto every OpenCL data object defined through the OpenCLIPER API. This way, de- velopers may invoke calls to HIP libraries on pure OpenCL data objects and, consequently, OpenCLIPERmaybenefitfromlibraries writteninCUDAforwhichaHIPimplementationisalsoavailable.

Asanexample,Listing2showshowanOpenCLIPERdataobject may be passed seamlessly to the rocFFT library[30] anduse its resultasanOpenCLIPERobjectagain.Afterinitialization(lines11–

15),avectorofcomplexnumbersiscreatedasanOpenCLmemory object(lines 17–20)anditscorresponding HIP handleisobtained (line23).ThenextcodesectionispureHIPcodeinwhichtheob- tainedhandleisusedtofeedtherocFFTlibrary(lines25–31)and, finally,theFFTresultissavedasaMatlabfile(line34).

4. Evaluation

We have executed several reconstructions to test the be- haviour of our platform. Reconstructions have been carried out on 7healthy volunteers courtesyof King’sCollege London. These data are 2D Cartesian, fully sampled dynamic short axis cine breath-hold ECG-triggered acquisitionsin a 1.5T Philips scanner with a bSSFP sequence. Some relevant parameters of the acqui- sitions include flip-angle 60, TR/TE = 3/1.5 ms, spatial resolu- tion 2 × 2 mm2, slice thickness 8 mm, 20 cardiac phases, FOV 320× 320 mm2,12–14 slicesandthenumberofchannelsisbe- tween 19and32,dependingonthesubject.Bothsensitivitymaps andk-spacedatafromallcoilswereprovidedtous.Thesedatasets were retrospectivelysubsampledwithaGaussian variable-density

Listing 2. Passing OpenCLIPER data objects to HIP libraries.

randomundersamplingpatternalongthephaseencodingdirection describedinAsifetal.[31]fordifferentvaluesofaccelerationfac- tor(AF).Hence,fortheexperimentscarriedoutwehavefullysam- pled images whichhave been usedas a referencefor measuring somequalityindices(QIs).

First,we havecomparedourplatformwithBART[13]to solve Eq. (1). As for OpenCLIPER we have used NESTA. As for BART,3 since NESTA is not available in that platform, we have used the ADMMmethod[32].Asforthesecomparisons,TinEq.(1)isthe identity due to the fact that BART doesnot incorporate —to the bestofourknowledge— agroupwiseME/MCimplementation.

Three experiments have been conducted, namely (a) AF = 4 (AF4),(b)AF=4pluscoilcompression(AF4CC)from19–32tohalf ofthechannels(9–16), and(c)AF= 8(AF8).Theparameter

λ

in

Eq.(1)hasbeenchosentomaximizethestructuralsimilarityindex (SSIM)[33]intheBARTreconstruction.Table1showsSSIMvalues forthethreeexperiments.Thus,

λ

=0.01 hasbeenusedforBART andOpenCLIPERreconstructions.Forthesakeoffairness,sincethe twooptimizationalgorithmsaredifferent,wehavechosenthein- ternalNESTAparameters(namely, both

μ

andthestoppingcrite-

rion,seelines5and12inAppendixC)toguaranteethatSSIMval- uesarecomparableforADMMandNESTAreconstructions;thisis graphicallyshown inFig.2a, withboxplots ofSSIM along all the slicesofallthepatients.Boxplotshavebeengroupedinpairs,i.e., AF4 for OpenCLIPER and BART, and the same structure goes for

3 Release downloaded on February 25, 2020, from https://github.com/mrirecon/

bart/releases/tag/v0.5.00

(7)

Fig. 2. Boxplots of SSIM (a), NCC (b), and SER (c) for experiments AF4, AF4CC, and AF8 on both OpenCLIPER which makes use of NESTA, and BART (which uses ADMM). No significant differences have been found.

(8)

Fig. 3. Example of reconstructions using BART and OpenCLIPER for the experiments AF4, AF8, and AF8 with ME/MC; the latest only for OpenCLIPER. Fully sampled images are shown as reference. End-systole and end-diastole frames are shown in each case, as well as the intensities along time of the vertical profile marked with red line in reference images. Last row shows the error images (mean squared error, MSE, between reference and reconstructed image). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1

Mean SSIM values evaluated on undersampled BART reconstruc- tions for a single slice of all patients and the three experiments (AF4, AF4CC, and AF8). Maximal values (bold highlighted) in- dicate the optimal regularization weight ( λin Eq. (1) ) for the experiments.

Reg. Term AF4 AF4CC AF8

λ= 10 −5 0.8776 0.8923 0.8200

λ= 10 −4 0.8832 0.8979 0.8293

λ= 10 −3 0.9037 0.9201 0.8642

λ= 0 . 01 0.9056 0.9229 0.8719

λ= 0 . 1 0.8955 0.9133 0.8592

λ= 1 0.8951 0.9128 0.8589

AF4CC aswell asforAF8,asindicated inthe horizontallabelling andinthelegends onthefigures. Forcompleteness,differentQIs havebeenanalyzed.Fig.2alsoincludesthecorrespondingboxplots forthe normalizedcrosscorrelation (NCC, Fig.2b) andthesignal toerrorratio(SER,Fig.2c).MannWhitneytests,4shownosignifi- cantdifferencesinanyoftheseparametersbetweenthecompared frameworks. Additionally, Figs. 3 and 4 show two reconstruction examples fortwo differentpatients. Inboth cases,the fullysam- pledreconstructedimagesareshownasreferencewiththeimages reconstructedusingBARTandOpenCLIPERfortheexperimentsAF4 and AF8; for the latter, ME/MC reconstruction is also shown for OpenCLIPERWith˙ theseparametersetting,executiontimesiniden- tical computerload situationshave beencompared. Todeal with variability, each experimenthas beenrun one hundredtimes for eachpatient.

Four platforms withfour differentGPUs havebeen employed.

AllofthemarestandardPC-classworkstationsbasedonIntelCore

4 Function wilcox.test in RStudio.

or AMD Ryzen processors. The exact GPU models are GeForce 2080TiandQuadroRTX6000fromnVidia,andRadeonRX480and RadeonRX5700XTfromAMD.Forreference,sometestshavealso been carried out using a CPU asthe computingdevice in a 70- threadIntelXeonserver.NoticethatBARTcanonlyberunonCPU ornVidiaGPUs,sonofurthercomparisoncould bedonebetween AMDandnVidiaasforBARTperformance.

Fig.5showsboxplotsofexecutiontimesontheGeForce2080Ti (a) and RTX6000 (b) for the three experiments, with the same ordering as in Fig. 2. As for a comparison betweeen nVidia and Radeon, Fig. 6 shows boxplots of the AMD Radeon RX 5700XT (GFX1010) and the nVidia QuadroRTX 6000 in the same condi- tionsasinthepreviousexperiment. Intermsofperformance,the FFTimplementationused bythe reconstructionalgorithm plays a prominentrole.BARTusesthenVidia’swell-knownproprietaryim- plementation cuFFT, whereas OpenCLIPER uses the AMD’s open- sourceimplementationclFFT.Whiletheformerishighlyoptimized fornVidia GPUs, thelatter is conceivedto be run on every pos- siblecomputingdevice andhence lacks ofanyspecific optimiza- tion. This results in clFFT being slower than cuFFT, asshown in Table2.OpenCLIPERcompensatesthiswithperformanceenhance- ments in other areas,such asparallel loading/saving of dataand kernelcaching(seeSection3.3).5

Fig.7showsboxplotsofSSIMforAF8onOpenCLIPERwithand without ME/MC; whereas boxplots for NCC and SER are shown in Fig. 8. Results of the unilateral Mann-Whitney test for SSIM are significant(p= 0.008)for ME/MC.Ifa unilateralsigned rank test isrun per patient,differencesfavor ME/MC insixout ofthe seven patientstested.Mann-Whitney testsarealso significantfor NCC and SER (p=0.006 and p=0.007, respectively). The price

5 In our previous work [34] , FFT times reported were overestimated since they incorporated an additional synchronization of the device command queue.

(9)

Fig. 4. Example of reconstructions using BART and OpenCLIPER for the experiments AF4, AF8, and AF8 with ME/MC; the latest only for OpenCLIPER. Fully sampled images are shown as reference. End-systole and end-diastole frames are shown in each case, as well as the intensities along time of the vertical profile marked with red line in reference images. Last row shows the error images (mean squared error, MSE, between reference and reconstructed image). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 5. Boxplots for experiments AF4, AF4CC, and AF8 on (a) GeForce 2080Ti and (b) RTX60 0 0. Times reported are per whole slice stack.

Fig. 6. Boxplots of computing time for AMD Radeon RX 5700XT (GFX1010) and nVidia Quadro RTX 60 0 0 (RTX60 0 0) for the three experiments. Notice that red-shaded boxes coincide with the red-shaded boxes in Fig. 5 . Times reported are per whole slice stack. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

(10)

Table 2

Performance comparison between cuFFT and clFFT libraries on various devices. A single experiment consists in transforming a K-space dataset ( 160 × 160 images, 19 coils, 20 frames, complex floats) 10 0 0 times. Each experiment is run 100 times on each library and device. For each combination we show mean execution times (mean) and their standard deviation (std), both in seconds.

cuFFT clFFT

Device mean std mean std

GeForce 2080Ti 0.6299 0.001 0.6518 0.0227

Quadro RTX 6000 0.5880 0.0016 0.6413 0.0020

Radeon RX 5700XT N/A N/A 0.8276 0.0015

Fig. 7. Boxplots of SSIM for AF8 on OpenCLIPER with (right, green) and without (left, red) ME/MC. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 8. Boxplots of NCC (a) and SER (b) for AF8 on OpenCLIPER with (right, green) and without (left, red) ME/MC. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

to payis theincrease incomputationtime, whichreachesa me- dianequalto62.12sand(q1,q3)= (53,72.13)for2080Ti,withqi

the ith quartile.ForRTX6000the medianis 60.16and(q1,q3)= (50.74,69.24)(timesreportedareperwholeslicestack).

Finally,wehavealsorunanexperimentonCPUwithbothBART and OpenCLIPER; the experiment is AF4CC with a randomly se- lected patient,tenrepetitions.Asfortheformer,themedianexe- cutiontime is24.58 s((q1,q3) = (24.84,24,45))whileforOpen- CLIPERmedianis16.33and(q1,q3)=(16.09,16.59).

5. Discussion

OpenCLIPER revealsitselfasadevice agnosticplatformforre- construction of MR dynamic images.We have shown results for 2D although thecode is preparedfor higherdimensionality pro- vided that sufficient memory is available. Fig. 5 shows that our computing times are comparable to those needed by BART; this hasbeentestedintwo nVidiadeviceswithdifferentmemoryca- pacities.Resultsarenotpoint-by-pointcomparablesinceoptimiza- tionmethodsusedby bothapproachesaredifferent.However,we meant to be fair by assuring that both platforms gave rise to images with similar qualities; optimization parameters were se- lected to this end, a goal that seems accomplished according to

theevidenceshowninFig.2.Asforcomputingtimesthemselves, we observethat our timesare fairly similar tothose fromBART;

hence,no obvious lossesare appraised by using OpenCLIPER de- spitesomeadministrativetasksneedtobehandled,aspointedout inSection3.3.1,duetoitsdeviceagnosticcharacter;thisoverload isrepresented inFig. 1by the shadedblocks located outsidethe dashedboxthatcontainsthecoreofthereconstructionalgorithm.

BARTcannotbetestedinAMDdevices;henceonlyOpenCLIPER enters the comparisons in Fig.6. The figure showsthat a device one order of magnitude more economical can do a remarkable job.Therefore,OpenCLIPERmakesitpossiblethatanaffordablede- viceisusedforimagereconstructioninviableclinicaltimes.When ME/MCentersthealgorithmthecomputingtimeneededreachesa quantityofaboutoneminuteforamultislicereconstruction,ade- laythat seems alsorealisticin aclinicalsetting.WhetherME/MC isworthtakingdependsontheaccelerationfactor;fortheAF=8 casewehaveemployed inourexperiments,statisticaldifferences werefound.

TheexecutiononCPUrevealedthatBARTneededextratimefor image reconstruction withrespect to OpenCLIPER.Since noobvi- ousdifferenceswerefoundonthenVidiadevices,chancesarethat thisisduetothefactthatBART,whenexecuted ontheCPU,par-

(11)

allelizesonlytheFFTpartofthereconstruction(makinguseofthe threading support in the FFTW library), whereas in OpenCLIPER, every process’ kernelsare executed inparallel.Additionally,clFFT (theOpenCLversionofFFTusedbyOpenCLIPER),hasbeenusedin bothCPUandGPUexperimentssincekernelcodeisunique.

The FFT algorithm isexhaustively usedin each reconstruction step. Hence, optimized implementations of thisalgorithm should havean appreciableimpact intheoverallreconstruction comput- ing time. Our HIP interface could be an alternative to use other FFTlibrarieswritteninCUDA(cuFFT,whichisknowntobehighly optimized fornVidiadevices,mightbeone ofsuchalternatives if its source codewere available).Nevertheless,interfacing overload isnon-negligibleandthisrequiresfurtherinvestigation.

6. Conclusionandfurtherwork

Inthispaper,a2DMRIcardiacreconstructionsystemhasbeen presented. This system is a) clinically viable in terms of execu- tion times, andb) suitable for any computing device which has anOpenCLimplementation,includingCPUs,GPUs,FPGAsandDSPs frommainvendors.TheuseofourframeworkOpenCLIPERallowed ustopartitiontheprobleminindependentblackboxes(processes) which are thenconnected asneededandexecuted in parallelon anycapable device, whilethesource coderemains unique forall prospective computing devices. Device initialization and mainte- nance is reduced to a minimum as well, while providing rele- vant development and debugging aids. Administrative but time- consuming tasks such as data loading/saving and kernel compi- lation areparallelized orcached so astominimally affectoverall performance.HIPcode(andprospectivelyCUDAcode)canbeinte- grated in developments whenever needed as OpenCLIPER makes data objects shareable between OpenCL and HIP. We also con- tributeaparallelgroupwiseregistrationsubsystemformotionesti- mation/compensationandaparallelmulticoilNESTAsubsystemfor l1− l2problemsolving.

Further work includes extendingthe 2D reconstruction to the free-breathing 3D problem, which poses additional issues dueto typicaldatavolumesthatextendfarbeyondthecapacityofasin- glecomputingdevicememory,alongwithmuchhigherprocessing times.

DeclarationofCompetingInterest

The entire research has been funded by the Government of Spain;theauthorsdeclarethattheyhavenoconflictofinterest.

Acknowledgements

ThisworkissupportedbyMINECOundergrantTEC2017-82408- R. In addition, the authors acknowledge the Asociación Española ContraelCáncer(AECC)anditsScientific Foundationforitsgrant PRDVL19001MOYA. We also acknowledge European Social Fund, OperationalProgrammeofCastillayLeón,andtheJuntadeCastilla yLeón,throughtheMinisteriodeEducación.Weexpressourgrati- tudetoDr.ClaudiaPrieto,King’sCollegeLondon,forkindlysharing withusthedataintheexperiments.

AppendixA. Notationandmainoperators

We aimto reconstruct a sequence of 2D imagesof size N1× N2 and Nt temporal frames. Data hasbeen acquired using a coil array ofC elements fromM samples ofadiscretizedk-spacegrid of size K=K1K2. The differentterms includedin Eq. (1)can be representedbythefollowingmatrices:

1. Multi-coilk-spacesubsampleddatab isavectorofsizeM× 1.

2. Reconstructed image m is a vector of size N× 1, where N= N1N2Nt.

3. E=AFS isthe encodingoperator,whichcan berepresentedas amatrixofsizeM× N.

(a) S isamatrixofsizeNC× N givenby:

S=

⎜ ⎜

S1 S2

.. . SC

⎟ ⎟

whereSc,1≤ c≤ C,isadiagonalmatrixofsizeN× N whose diagonalelementsrepresentthesensitivitymapsofthecoil c ataspaciallocation.

(b)F isamatrixofsizeM× NC givenby:

F=

⎜ ⎜

F1 0 ... 0 0 F2 ... 0

..

. ... ... ... 0 0 ... FC

⎟ ⎟

where Fc,1≤ c≤ C,is matrixofsize KNt× N representing thecoefficientsofFouriertransform.

(c)AisamatrixofsizeM× M givenby:

A=

⎜ ⎜

A1 0 ... 0 0 A2 ... 0

..

. ... ... ... 0 0 ... AC

⎟ ⎟

where Ac,1≤ c≤ C,isa diagonal matrixofsize KNt× KNt

whosediagonalelementstake thevalue 1ifthe entrycor- respondstoasensedk-spacelocationand0otherwise.Note thatA1=A2=...=AC,sincethesamplingmaskisthesame forallcoils.

4.U=Tisthesparsifyingoperator,whichcanberepresented asamatrixofsizeN× N.Inthiscase,theoperatoriscomposed oftwomatrices:

(a) isamatrixofsizeN× N givenby:



=

⎜ ⎜

⎜ ⎜

I 0 0 ... −I

−I I 0 ... 0 0 −I I ... 0

..

. ... ... ... ... 0 0 ... −I I

⎟ ⎟

⎟ ⎟

where I isa identity matrixof sizeN1N2× N1N2.This ma- trix computes the temporal finite differences, i.e., tTV op- erator. We have added a cyclical extension with respect to[26]basedonthecardiaccycleperiodicity.

(b)T is amatrixof sizeN× N.A detaileddescriptionofthis matrixcan befoundinAppendix B.Recall thatform0,the transformationT istheidentity.

Notethatwiththisformalism, theadjointoperatorrepresentation isstraightforwardastheHermitianofthesematrices(i.e.,EH and UH).

AppendixB.Groupwiseregistrationandmotioncompensation operators

The goal of the groupwiseregistration procedure is to jointly obtain a set of spatial transformations, one for each frame con- tainedinthetemporalsequence,sothat transformedimageside- allycoincide.Noimageistakenasareferencetoavoidanysortof bias.In practice,registered images willnot be exactly coincident buttheywillconstituteasparsesequenceinthetemporaldimen- sion,i.e.,registrationpromotessparsity.Thetargetistheminimiza- tionof acost functionH (see Eq.(B.2)), whichconsistsofa data

(12)

fidelityterm—thesumofsquaresoftheintensitydifferenceswith respecttothereferencethatisbuiltonthefly(seeEq.(B.1))—and a smoothnessterm, whichpreventstheonsetofunrealistictrans- formations. Minimization is achieved by gradient descent. Gradi- entsarecalculatedfollowingstandardprocedures[35].

V

(

x

)

= 1 Nt

Nt



n=1



m(n)



T(n)

(

x

) 

− 1 Nt

Nt



n=1

m(n)



T(n)

(

x

)  

2

(B.1)

H

( τ )

=

χ



V

(

x

)

+ Tc

0

4

p=1

λ

pRpdt



dx (B.2)

Eq. (B.2)showsthatthe metriciscalculatedwithinregion

χ

,

which maybethe wholeimage oronlyan estimatedarea where most ofthe motiontakes place.In theseequationsTc is thecar- diaccycleand

λ

pand1≤ p≤ 4aretheweightsoftheregulariza- tion terms Rp.These termsaccount forderivatives ofthe motion vectorfield;specificallyR1andR2arerespectivelyfirstandsecond orderspatialderivativeswhileR3andR4arefirstandsecondorder temporalderivatives.

ThemotionvectorfieldisapproximatedbymeansofB-Splines functions, thecoefficientsare whichofthe parameters thatenter theoptimization.ThisistheMEstage,wherethedense(forward) transformationiscalculatedasfollows:

T

(

x

)

=x+

C12



u1=C11 C22



u2=C21



2 l=1

B3



x

l− pul

l



·

θ

u (B.3)

whereC isthegridofcontrolpoints,B3representstheuniformB- spline function ofgrade3, pul the controlpoint coordinatealong dimensionl, l isthepixelresolutionofthegridofcontrolpoints ineachdimensionand

θ

uthefreeparameterthatdrivesthedefor-

mation.ThemeshboundariesgivenbyC aresetsothattheregion ofinterestiscompletelycoveredwithacertainmarginforapprox- imation via B-splines to avoidlarge displacement values produc- ing inconsistenciesat theedges ofthe region of interest leading to errorsininterpolation.Theregisteredimage setisobtainedby meansofinterpolation,whichmayberepresentedinmatrixform as[26]:

T=

⎜ ⎜

T1 0 ... 0 0 T2 ... 0

..

. ... ... ... 0 0 ... Tt

⎟ ⎟

whereTtisamatrixofsizeN1× N2associatedtothetransforma- tion ofthe framet in thesequence; it containstheinterpolation coefficientsthatresultasaconsequenceofthetransformation.Its genericshapeforwillbe:

T1=

⎜ ⎝

...

0 ... w1 w2 0 ... w3 w4 0 ... 0 0 ... 0 w1 w2 ... 0 w3 w4 ... 0

...

⎟ ⎠

i.e., the matrix will be sparse, the sparsity degree of which de- pends on the interpolation order. Fora bilineal interpolation, for instance, only four coefficients will be non-null in each row, as indicated in the equation. This is the MC part of the algorithm.

Thesematricesandtheiradjointsarenot explicitlybuilt,butthey areappliedasoperatorsforefficiency.Formoreinformationabout B-Splinefree-formdeformationssee[36].

AppendixC. CSreconstructionusingNESTA

Algorithm1 MRIreconstruction.

Step0:NESTAmulticoilreconstruction

Inputs:multi-coilk-spacesubsampleddatab,encodingoperator E=AFS,sparsifyingoperatorU=,theiradjointsandparame- ters

λ

,

γ

,La∈R+

Initialization:

μ

(0), x0=EHb, thenumber ofsteps maxIterand parameterLμ

fort=0tomaxIterdo

Step0.1:ApplyNesterov’salgorithmwith

μ

=

μ

(t)

fork≥ 0do

a.Computethegradientofthel1-norm:

fμ

(

Uxk

)

=



1

μUHUxk, if

|

Uxk

|

μ

,

UHsgn

(

Uxk

)

, otherwise,

where function sgn() applied to vector

v

means the stack of sign(

v

i/

| v

i

|

),with

v

itheithvectorcomponent.

b.Computethegradientofthecostfunction:

f

(

xk

)

=EH

(

Exk− b

)

+

λ∇

fμ

(

Uxk

)

c.Computeyk: yk=xkλLμ1+La

f

(

xk

)

d.computezk: zk=x0λLμ1+L

a



j≤k

α

j

f

(

xk

)

where

α

k=1/2(k+1). d.Computexk+1: xk+1=

τ

kzk+

(

1

τ

k

)

yk.

where

τ

k=2/(k+3).

Stopwhenagivencriterionismet endfor

Step0.2:Decreasethevalueof

μ

:

μ

(t+1)=

γ μ

(t)

endfor

Outputs:reconstructedimagem0=xk+1 fori=1toMotionItersdo

Step1:ME-GW

Inputs:reconstructedimagemi−1

Initialization: fix the region of interest, create the control pointsmesh,computeB-splinesproducts andcoefficients,asin [7]

for j≥ 0do

Step1.1:Calculatethepixel-wisedisplacementfields Step1.2:Transformimagesusinglinearinterpolation Step1.3:Calculatethemetricandsmoothingterms Step1.4:Calculategradients

Step1.5:Updatethemovementsofthecontrolpoints Stop KN1

|| θ

n−1

θ

n

||

<

T and |χ|1 (Hn−1− Hn)<

H

endfor

Outputs:T

Step2:NESTAreconstructionwithMC

Inputs:multi-coil k-spacesubsampleddatab, encodingoper- atorE=KFS, sparsifyingoperatorU=T,their adjoints,and parameters

λ

,

γ

La∈R+

Initialization:

μ

(0), x0=mi−1, the number of steps maxIter andparameterLμ

Seeabove(step0)forfurtherdetails Outputs:reconstructedimagemi=xk+1 endfor

Referencias

Documento similar

For instance, in a fully distributed model and using PMIPv6 terminology, each access router implements both LMA and MAG functions and for each user the access router could behave as

K) is the Banach space of continuous functions from Vq to K , equipped with the supremum norm.. be a finite or infinite sequence of elements

MD simulations in this and previous work has allowed us to propose a relation between the nature of the interactions at the interface and the observed properties of nanofluids:

In this guide for teachers and education staff we unpack the concept of loneliness; what it is, the different types of loneliness, and explore some ways to support ourselves

 The expansionary monetary policy measures have had a negative impact on net interest margins both via the reduction in interest rates and –less powerfully- the flattening of the

Jointly estimate this entry game with several outcome equations (fees/rates, credit limits) for bank accounts, credit cards and lines of credit. Use simulation methods to

For each dataset the averaged generalization accuracies over the 100 executions are shown for AdaBoost (first column), full bagging with both unweighted and weighted voting (second

We use for our experiments two recognition approaches exploiting information at the global and local levels, and the Biose- curlD database, containing 3,724 signature images and