A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

(1)

ContentslistsavailableatScienceDirect

Computer Methods and Programs in Biomedicine

journalhomepage:www.elsevier.com/locate/cmpb

A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

Elena Martín-González

^a

, Elisa Moya-Sáez

^a

, Rosa-María Menchón-Lara

^a

,

Javier Royuela-del-Val

^c

, César Palencia-de-Lara

^b

, Manuel Rodríguez-Cayetano

^a

, Federico Simmross-Wattenberg

^a^,^∗

, Carlos Alberola-López

^a

a Laboratorio de Procesado de Imagen (Image Processing Laboratory), Universidad de Valladolid, Valladolid 47011, Spain

b Departamento de Matemática Aplicada, Universidad de Valladolid, Valladolid, Spain

c Health Time Group, Córdoba, Spain

a rt i c l e i nf o

Article history:

Received 18 December 2020 Accepted 25 April 2021

Keywords:

Magnetic resonance imaging GPU

OpenCLIPER OpenCL

a b s t r a c t

Backgroundandobjective:RecentresearchhasreportedmethodsthatreconstructcardiacMRimagesac- quiredwithaccelerationfactorsashighas15inCartesiancoordinates.However,thecomputationalcost ofthesetechniquesisquitehigh,takingabout40minofCPUtimeinatypicalcurrentmachine.Thisde- laybetweenacquisitionandﬁnalresultcancompletelyruleouttheuseofMRIinclinicalenvironments infavorofothertechniques,suchasCT.Inspiteofthis,reconstructionmethodsreportedelsewherecan beparallelizedtoahighdegree,afactthatmakesthemsuitableforGPU-typecomputingdevices.This papercontributesavendor-independent,device-agnosticimplementationofsuchamethodtoreconstruct 2Dmotion-compensated,compressed-sensingMRIsequencesinclinicallyviabletimes.

Methods:ByleveragingourOpenCLIPERframework,theproposedsystemworksinanycomputingdevice (CPU,GPU,DSP,FPGA,etc.),aslongasanOpenCLimplementationisavailable,anddevelopmentissig- niﬁcantlysimpliﬁedversusapureOpenCLimplementation.InOpenCLIPER,theproblemispartitionedin independentblackboxeswhichmaybeconnectedasneeded,whiledeviceinitializationandmaintenance ishandledautomatically.ParallelimplementationsofbothagroupwiseFFD-basedregistrationmethod,as wellasamulticoilextensionoftheNESTAalgorithmhavebeencarriedoutasprocessesofOpenCLIPER.

Ourplatformalsoincludes signiﬁcantdevelopmentand debugging aids.HIP code andprecompiledli- braries canbe integratedseamlessly as wellsince OpenCLIPER makesdataobjectsshareable between OpenCLand HIP.ThisalsoopensanopportunitytoincludeCUDAsourcecode(viaHIP)inprospective developments.

Results:Theproposedsolutioncanreconstructawhole12–14sliceCINEvolumeacquiredin19–32coils and20phases,withanaccelerationfactorofranging4–8,inafewseconds,withresultscomparableto anotherpopularplatform(BART).Ifmotioncompensationisincluded,reconstructiontimeisintheorder ofoneminute.

Conclusions:Wehaveobtainedclinically-viabletimesinGPUsfromdifferentvendors,withdelaysinsome platforms thatdo not have correspondencewith its priceinthe market.We alsocontribute aparal- lelgroupwiseregistrationsubsystemformotionestimation/compensationandaparallelmulticoilNESTA subsystemforl1− l2-normproblemsolving.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

∗Corresponding author.

E-mail addresses: [email protected] , [email protected] (E. Martín- González), [email protected] (E. Moya-Sáez), [email protected] (R.-M.

Menchón-Lara), [email protected] (J. Royuela-del-Val), [email protected] (C. Palencia-de-Lara), [email protected] (M. Rodríguez-Cayetano),

[email protected] , [email protected] (F. Simmross-Wattenberg), [email protected] (C. Alberola-López).

https://doi.org/10.1016/j.cmpb.2021.106143

(2)

1. Introduction

Magneticresonance(MR) isa high-qualityandhighlyversatile medical imaging modality.Not only anatomical images with different contrasts can be acquired, butalso a numberof properly- designedpulsesequences provide additionalinformation,such as diffusion, perfusion, fMRI and a few others. Dynamic imaging is alsopossibleandMRcurrentlyconstitutesthegoldenstandardfor functional studies of the heart [1]. However, its major drawback is its inherent slowness relative to other alternatives such as CT andUS.AcompletestudyinMRImaytake30–45min,whereasCT takesafewminutesandUSisobservedinrealtime.Thishasthe consequenceofpatientdiscomfort,whichleadstomotionartifacts intheimages,theneedofsynchronizationfordynamicstudiesand poses additionaldiﬃculties for imagingnon-cooperative patients, suchaschildrenortheelderly.

This being the case, the MR community is doing a consider- ableefforttoincreasethevelocityofMRacquisitionandtodimin- ish the need of additional hardware (such asnavigators, ECG or PPGsynchronization, etc.).Apartfromimprovementsinthehard- ware of the scanners —that provide higher and more homoge- neous ﬁelds, more intense and faster gradients, etc.—, many dif- ferentsoftwaresolutionshavebeendescribed,whichincludepar- allelimaging(PI)andcompressedsensing(CS).Intermsofrecon- struction[2],differentalgorithmsarecapableofreconstructingim- ages fromundersampledk-spaceinformation; leavingasidemod- ern deeplearning-basedapproaches, forwhichtheirbottleneckis thetrainingstage,classicalalgorithmsareoptimization-basedand theymaybecomputationallydemanding,dependingontheircom- plexity. Hence fast implementations and appropriate computing hardwaremake adifference soastoobtain solutionsinclinically viable times. Here,clinical viable time is interpreted asthe time takentoreconstructanimagethatiscompatiblewithrepeatingan acquisition —wheneverthereconstructionisconsideredofinsuﬃ- cientquality— withoutcausinganunacceptableoverloadinclinical routine¹.The MRreconstruction workloadturns out to be highly parallelizable so proper implementations on natively-parallel devices, such as(butnot limitedto) GPUs,are currentlyplayingan outstandingrole.

Manydifferentattemptshavebeenreportedtospeeduprecon- struction algorithms by means of parallel devices.Section 2 provides an overview of the ﬁeld. As of today, the ﬁeld of parallel computinghasseenatremendousgrowthoftailoredsolutions that workexclusivelyon aparticularvendorGPU,namely,nVidia.

Nevertheless,other alternatives exist.Oneofthem isOpenCL [4]; OpenCL is an open standard for parallel computing, similar in its objectives to the nVidia-exclusive language CUDA, but covers a much wider set of computing devices (including nVidia GPUs) while ensuringthat sourcecodeanddataareunique,thus avoid- ing the need for code and data replication among prospectively supporteddevices.However,itisgenerallyregardedasmorecom- plicated andlessmaturethanCUDA. Asa matterof fact,OpenCL programmers must deal on their own with device selection and initialization, memory management, kernel loading andcompila- tion,host-deviceinteraction,andadministrationoverload.

Anotherexample isHIP [5];thisinitiative helpsdevelopers in convertingCUDAsourcecodeintoamoreportableform,sothatit canrun onbothnVidiaandAMDGPUs.However,asoftoday,the HIP API is not compatible with OpenCL, so data objects deﬁned inOpenCL cannotbeusedasparameterstoHIP kernelsandvice- versa,eventhoughOpenCL dataisundistinguishable fromHIP(or CUDA)dataintheGPUmemory.

1 Issues related to DICOM interoperability are a step further and well-known so- lutions are available [3] . We concentrate on operations on the reconstruction problem once all the necessary data is available.

The framework OpenCLIPER [6] was developed to address all theseshortcomingsofOpenCL.Inthispaper,OpenCLIPERhasbeen used to provide a parallel implementation of our previous pro- posal [7] on 2D cardiac cine imaging. This algorithm combines CS with Groupwise Motion Compensation (MC) to achieve CINE sequences with acceleration factors (AFs) of 8–15x, which show higher quality than other methods with pairwise approaches for motion compensation. Nevertheless, in its non-parallel version, thepost-acquisitionreconstructiontakesasigniﬁcantcomputation time (about40minina Corei7-4790CPUforthe fullheartcov- erage),whichis farfrombeingclinicallyviable.In thispaper, we show how OpenCLIPER eases dramatically the process of parallelizing a complex algorithm previously programmed in a script- ing language.² As abyproduct, wehaveenlarged theOpenCLIPER librarywithadditionalparallel functionality,such asa groupwise version of a registration algorithm based on free form deforma- tions(FFD)[8]oramulticoilparalleladaptationoftheNESTAopti- mizationalgorithm[9].Noneofthesealgorithms,tothebestofour knowledge,arecurrentlyavailableonGPUs.Acomparisonwithan- otherpopularframework (BART,seeSection 2)isincluded.Inad- dition,weshowhowtoseamlesslyincorporateHIPcodetoOpen- CLIPERbysolvingthe integrationdiﬃculty referredto above;this opensupthepossibilityofusingCUDAcodeinourframework—as longasthiscodecanbeconvertedtoHIP— atprogrammerwill.

The rest of this paper is organized as follows: Section 2 re- views the subject of current parallel implementations of undersampled MRI reconstruction algorithms. Section 3 describes the innards ofour proposed implementation.Section 4 evaluates our proposalperformance invariouscomputingdevicesandcompares it,aspreviouslymentioned,withanotherpopularframework. The discussioncan befound inSections5and6concludesthepaper.

Anumberofappendicesofbothmathematicalandalgorithmicde- tailshavebeenincludedforthesakeofself-completeness.

2. Relatedwork

TheﬁeldofMRreconstructionisveryactiveandquiteanactiv- ityhasbeenfocused onalgorithm implementationonGPUs.Two recentsurvey papers [10,11]give a generaloverviewof theﬁeld;

the formeris dedicated to GPU-basedmedicalimage reconstruc- tionwhile thelatterisspeciﬁcallytargetedto MRreconstruction.

Bothsurveypapers,however,haveasimilar structureasfortheir commontopic; they review FFT-based methods, either forCarte- sian and non-Cartesian data, PI and CS-based applications. The 2018 survey also includes a section on reconstruction based on deep-learning,probablyspurredbytheenormousactivitythathas beenreportedinthelasttwo years.Bothpapersalsoquotesome recentwork on multi-GPUsolutions. Thereader iskindlyinvited toconsultthereferencestherein.

Apartfrom particularimplementations of reconstruction algorithms, the ﬁeld has been enriched with a number of libraries, someexamplesofwhichareAGILE[12],BART[13],Gadgetron[14]

and Impatient [15], to explicitly mention four of them. Addi- tionallibrariesarementionedinourformerpublication onOpen- CLIPER [6]. These libraries are conceived to ease the process of algorithm prototyping and testing; Gadgetron shows additional client-servercharacteristics,wheretheservertakescareofthere- constructionprocesswhiletheclientfocusesondatahandling,i.e., Gadgetroncouldbeconsideredbothasalibraryandasanetwork service.

Thecommongroundofthesereferences,bothintermsofspe- ciﬁc algorithms and of reconstruction toolkits, is that they are based on theCUDA language and, consequently, they are vendor

2 OpenCLIPER code available at http://opencliper.lpi.tel.uva.es/ .

(3)

dependent. nVidia, since its onset in 2007 with the release of its CUDAapplicationprogramminginterface[16],hasbecomethe main supplier inthe ﬁeld ofparallel computing withGPUs,both inacademiaandinindustry.Itsmainstreamconditionisaccompa- nied,however,withsomeunavoidableconsequenceswhichcanbe summarized intwo,namely, theneed ofcodeduplicationshould theapplicationberuninGPUandinCPU(orinanyotherdevice), andtheneedofdatareplicationintheplatformsinwhichtheap- plicationisdeployed.

Alternativestothistechnologyarealsoavailable,mainly,Direct Compute issued by Microsoft and, as previously stated, OpenCL, an open source project led by the Khronos Group [11]. The latter has the beneﬁts of being both open-source and device agnostic; the second feature makes it comply with the write-once run-anywhere(WORA)paradigm. Anadditionalsolutionhasbeen proposed, whichisadomainspeciﬁc computationalimage reconstruction language, referred to asIndigo [17]. Itprovides a front- endwithanumberofimage reconstructionfacilities,aswell asa back-endthatisresponsibletoevaluatetheoperatorsprovidedby the front-endon aparticular platform.As long asdifferentback- ends are available, thelanguage will followthe WORAparadigm, although asoftodayit seemslimitedtoCUDAasforGPUimple- mentation.

As we mentioned in the Introduction, in this paper we illus- tratehow OpenCLIPERcanbeused toﬁndasolutiontoa2D dy- namiccardiac imagereconstruction algorithm,that makeswhole- heart singlebreath-hold reconstruction possible [18] in clinically feasibletimes.Ouralgorithmmakesuseofanoptimizer,forwhich we haveusedNESTA[9]aswellasFFD[8]formotioncompensa- tion.Furtheralgorithmicdetailsofthemethodaredescribedelse- where [7,19]. As forthe NESTA algorithm, we are only aware of a GPU version, whichwas recentlydescribedin Dinhetal. [20]; however,inthat implementation,asinglecoilisapparentlyused.

Inourimplementation,multiplecoilsareemployed andthealgo- rithm is fullyparallel in the coilsas well. Asfor theFFDs, some GPU-based implementations were describedin Modat etal. [21], Ruijtersetal.[22],Duetal.[23],Ellingwoodetal.[24],Punithaku- maretal.[25];however,theseproposalsarepairwiseanddevice- speciﬁc. We show in Royuela-del Val et al.[7] that a groupwise registrationhasaddedbeneﬁtsintermsofreconstructionquality;

consequently, the implementation we havecarried out is group- wiseandspeciﬁcallytargetedforcardiaccineMRI.

3. Theproposedsystem 3.1. Thealgorithm

CINE cardiacMRimagescan bereconstructedfromhighlyun- dersampled data while keeping a very high level of detail from thefully-sampledimages.Astate-of-the-artalgorithmonthistopic was published in Royuela-del Val et al. [7] and later enhanced in Royuela-delValetal.[19].Brieﬂydescribed,the algorithmde- parts fromthe multi-coil k-space subsampled information b and solves a CS reconstruction problemwhich givestheresulting im- agesequencem_i,atiterationstepi,as

mi=argmin

m

||

^b^{− Em}

||

²l2+

λ||

Tm

||

l1 (1)

m_i is a vectorized stack of image frames, one frame per cardiac phase. E istheencodingoperatorthat includesthemultiplication by the coil sensitivities S, the intra-frame spatial Fourier trans- formF andtheapplicationoftheundersamplingmaskA.^is^the temporaltotal variation(tTV)operator. T isthegroupwise(GW) transformationformotioncompensation(MC)thatregistersallthe framesinthesequencetoacommonreferenceand

λ

^is^a^regular-

izationparameter.AppendixAdescribesamatrixformalization of thesedatastructuresandoperators.Asform₀,thetransformation

T is the identity,since there isno data fromwhich motionin- formationcanbeestimated.Then,theiterativerefinementprocess findsm_iwithTobtainedbytheheartmotionestimation(ME)on m_i₋₁.The whole process continues until a predefined numberof iterationsismet.Inthismanuscript,thenumberofME/MCrecon- structioniterationswassettotwo.

TheME/MCiscarriedout bymeansofelasticregistration[26]. Aspreviouslystated,weuseFFDs[8];AppendixBprovidesdetails ontheME/MC process. B-splinesservethepurposeofinterpolat- ingthedeformationfield(seeEq.(B.3))fromagivencontrolpoint configurationtogiverise toadensefield. Specifically,weapply a GWparadigminwhichno particularframeisselectedasa reference, butthereferenceisbuiltalong theoptimization processas theaverageofthetransformedimages.Asforthemetrictobeop- timized,wehaveusedthesumoftheintensitysquaredifferences (SSD)withrespecttothereferenceimage(seeEq.(B.1));thismet- ricisenlargedwithsmoothnesstermssoastoforcearealisticmo- tionfieldsolution(seeEq.(B.2)).

Eq. (1) is minimized with the well-known NESTA algorithm basedonNesterov’smethod[9].AdetailedpseudocodeoftheMRI reconstructionalgorithmcanbe foundinAppendixC.Speciﬁcally, Nesterov’smethod iteratively minimizesa function f by estimat- ingthreesequencesx_k,y_k andz_k.Thex_k sequencecorrespondsto the sequence we want to estimate (m_i in the Eq. (1)), and it is obtainedfroma weightedsum ofthe other two sequences. Nes- terov’smethodcan beused fortheminimization ofbothsmooth andnonsmoothconvexfunctionsifusingtheappropriatesmooth- ing techniques.In [9],thel1-normin Eq.(1) iscomponent-wise approximatedbythewell-knownHuberfunction f_μ(^x)^,^which^de- pends on asmoothing parameter

μ

^; ^this^parameter ^isiteratively decreased duringthe optimization process. The startingguess x₀ istheresultofapplyingtheadjointencodingoperator(E^H)tothe subsampledk-space(x₀=E^Hb).

3.2. Parallelimplementation

Our parallel implementation is coded in OpenCL [4], and all parallel operations are programmed as OpenCL kernels. In the nexttwo subsectionsweprovide detailsontheME/MC stageand the NESTA algorithm. Pseudocode for the latter is included in AppendixC.

3.2.1. Groupwiseregistration

The registrationprocedure isdetermined by the valuesof the parameters

θ

ûⁱⁿÊq.^(B.3)^.^These^parametersâreôbtained^by^mini-

mizingthemetricdeﬁnedinEq.(B.2).Inthissectionwewillcon- centrate on theimplementation ofthe mostresource-demanding operation,whichturnsout tobe ﬁndingthegradient ofEq.(B.1). Thegradientis,inessence,calculatedasfollows:

∂

^V

(

^x

)

∂θ

⁼

∂

^V

(

^x

)

∂

^m ^·

∂

^m

∂

T

(

^x

)

^·

∂

T

(

^x

)

∂θ

⁽²⁾

whereV(^x) îs ^the ^metric, ^m ^represents ^the îmage ^sequence ând T(^x) ^the transformation.Notice that xwill be a gridpoint (with coordinates givenasrowandcolumnnumbers r, 1≤ r≤ N1,and c,1≤ c≤ N2,respectively).Inaddition,eachparameter

θ

^is^associ-

atedtoeachoftheframes(withindexn,1≤ n≤ Nt)aswellasto eachofthecontrolpoints(withcoordinates,say(^ru,c_u)⁾^and^each ofthedirectionsofvariation,namely,horizontal(indexl=1)and vertical(l=2);theseindiceswillbeborneinmindforthederiva- tives.

TheﬁrstfactorinequationEq.(2)canbewritten:

∂

^V

(

^x

)

∂

^m ^r,c,n= 2 Nt

m⁽ⁿ⁾

T⁽ⁿ⁾

(

^x

)

r,c− 1 Nt

Nt

n=1

m⁽ⁿ⁾

T⁽ⁿ⁾

(

^x

)

r,c

(4)

Fig. 1. Functional units of the proposed reconstruction system. Gray boxes represent necessary albeit unproductive work, which is taken care of by OpenCLIPER. White boxes represent actual, productive work. Each white box is abstracted as a process , which may be composed of several other processes. U , U ^H , E , E ^H are the internal NESTA operators. tTV: temporal Total Variation; GW: Groupwise Registration.

withm⁽ⁿ⁾(^x)^theⁿ^th^frameⁱⁿ^sequence^m^,êvaluatedât^pixel^x^,ând T⁽ⁿ⁾(^x) ît^the transformed positionofpixel xinthat phase. m⁽ⁿ⁾ willbealsoreferredtoasthenthframe.Thisequationiscomputed inparallelfortheN₁× N2× Nt pixels.

The second factor in Eq. (2) is the image gradient with re- specttothetransformation.Theactualoperationsweperformare the interpolation of the gradient of each frame m⁽ⁿ⁾ at position T⁽ⁿ⁾(^x)^. ^The interpolationis carriedout within a region ofinter- est, say

χ

^m^, ^so ^the ^number ôfôperations ^launched ⁱⁿ ^parallel îs

N₁_χ_m× N2χm× Nt, wherethe two ﬁrst factorsare the dimensions ofregion

χ

^m^.

The last factorinEq.(2) isthegradient ofthe transformation withrespecttothevariables

θ

utobeoptimized;sincethesevari- ables enter Eq.(B.3)asfactors, their derivative isstraightforward andcanbeprecomputed.

Hence, the calculation of the gradient of Eq. (2) will be per- formedparallelizingalongtherowsandcolumnsaffectedbytheB- splines(rsandcs),theframedimension(n),therowsandcolumns ofthe controlpoints(ru andcu) andthespatialdimension(l),as canbeseenbelow:

∂

^V

(

^x

)

∂θ

^r^s,c^s,n,r^u,c^u,l=

∂

^V

(

^x

)

∂

^m ^r^s,c^s,n,l·

∂

^m

∂

^T

(

^x

)

rs,cs,n,l·

∂

^T

(

^x

)

∂θ

^r^s,c^s,r^u,c^u.

3.2.2. NESTA

A matrix formalism of data structures andoperators (the en- codingoperatorE andthesparsifyingoperatorU)usedinNESTAis presentedinAppendixA;thisformalismmakesitsimpletoderive the adjointoperators. However, inpractice, highdatadimension- ality impliesthat storageofmatricesassociatedwithoperatorsis notfeasibleduetomemoryrequirements.Consequently,operators are implementedby meansoffunctionsandthelatterhave been thefocusofourparallelizationeffort.Speciﬁcally,fortheencoding operatorE thefollowingparallelizationsweremade:1)Multiplica- tionby thecoilsensitivitiesS isperformedparallelizingalongthe spatial dimensions (i.e., N₁× N2 operationslaunched in parallel).

2) The by-frame spatialFourier transform F isperformed parallelizing along the coils andframesdimensions (i.e.,C× Nt parallel operations) using the clFFT [27] library.3) Application of the undersampling mask A is performed parallelizing along the spatial dimensions (i.e.,N₁× N2 parallel operations). Note that in S, thecoilsandframesdimensionscouldhavealsobeenparallelized;

nevertheless, we empirically veriﬁed that this did not constitute aperformance gain,presumablycausedbythelimitednumberof cores in the hardware. As for the adjoint encoding operator E^H, thefollowingparallelizationsweremade:1)Multiplicationbycoil conjugatesensitivities S^H and, 2)by-frame spatialinverseFourier transformF^H,Bothofthemwereparallelizedanalogouslytotheir counterparts S andF,respectively.3) Summationoftheresulting image sequence in each coilis parallelized along the spatialand temporaldimensions(i.e.,N₁× N2× Ntoperationslaunchedinpar- allel).

RegardingthesparsifyingoperatorU=T,thetemporaltotal cyclicvariation ^is ^performedparallelizing along thespatialdi- mensions(i.e.,N₁× N2operationslaunchedinparallel)andtheT groupwisetransformationforMCisperformedparallelizingalong both the spatialandthe frame dimensions(i.e. N₁_χ_m× N2χm× Nt

parallel operations).Similarly,the adjointsparsifying operatorU^H isparallelizedalongthesamedimensionsasU.

Finally, matrixoperations neededin NESTA, such as scaling a vectorbyaconstantoradditionoftwovectors,areperformedef- ﬁcientlybythe clBLASTlibrary[28].Thus,all thealgorithm steps (seeAppendixCfordetails)canbestraightforwardlycomputedby combiningoperatorsE andU,theiradjointsE^H andU^H,andafew matrixoperations.

3.3. Algorithmimplementationonagenericparalleldevice

Wehavetranslatedtheoriginal methodintoan actual device- agnostic software solution which (a) ﬁnishes in clinically viable times,comparabletootherpopularreconstructionframeworks,(b) issuitableforexecutioninparalleldevices,and(c)complieswith the WORA paradigm (see Section 2) so neither code nor data

(5)

need beduplicatedforCPU/GPUoranyotherdevice. Asstatedin Section1,wemakeextensiveuseofourframeworkOpenCLIPERto providethissolution.

In this section we provide details on this implementation;

Fig. 1 shows a diagram of our system functional units that will guideusthroughthedescriptionthatfollows.

3.3.1. Generalities

Togetherwiththecorealgorithm—enclosedwithinthedashed rectangle in Fig. 1 and its operations represented by rectangles withwhitebackground—, severalother tasksmustbe carriedout forthesystemtobe functionalbyitself.Thesetasks arenotonly part of the initialization process but they are also active while theactualprocessingistakingplace.Alltheseadministrativetasks (shadedboxes inFig.1) aredealtwithby OpenCLIPERwithmin- imal programmer intervention.While thesystemisatwork,data passes throughseveral transformations(called processes inOpen- CLIPER)thatneedtobeconnectedappropriately.Eachprocessmay be seen as a black box with a single input, a single output and anarbitrarysetofparameters.Thisspeciﬁcationallowsprospective programmerstointerconnectandcomposeprocessesarbitrarily,as long asoutputs fromoneare compatiblewithinputstothenext, sincetheirinterfacesmustallcomplywiththisspeciﬁcation.

Anappropriatecomputingdevicemustbeselected(toprowin Fig.1).SinceOpenCLsupportsawiderangeofthem,anadditional workloadistheneedtodetecttheavailableplatformsanddevices (which may be from different vendors), andto choose the most suitable amongthem.Moreover, oncea devicehasbeenselected, kernelsmustbeloadedandcompiledforthechosendevice.Open- CLIPERsimplifiesthesetaskssignificantly:kernelloadingandcom- pilingisdonewithoutuserinterventionwhenaprocessdemands it,caching themasnecessarysotime isdedicatedtocompilation only once per device (and driverversion). Platform/device detec- tionandselectionmaybeeitherspecifiedbytheprogrammer(via hints such as device type, model,supported CL version, etc.) or, alternatively, left to the framework by just a singleline of code.

Inthelattercase—or,intheformer,wheneverthespeciﬁedhints matchseveralcandidatedevices—,theapriorifastestdeviceisau- tomaticallychosen.

All compute-intensive transformations are performed in the computing device, so the systemtakes full advantage of its parallel capabilitiesand henceclinically viable execution times may be achieved (see quantiﬁcationin Section 4). The algorithms im- plemented in OpenCLIPER, which are themselves processes, are made up from several other processes which may be reused at programmer discretion.Inthissense,OpenCLIPERprovides apool offrequentlyusedprocesses asa ready-to-usetoolbox;forexam- ple,Hadamardmatrixmultiplication,fastFouriertransform,chan- nel integration,image sum, etc. Thesetoolsmake theimplementation ofmoreinvolvedprocesses —suchasthe two describedin Section3.2— easy.

With respect to the four blocks on top of the NESTAProcess inFig.1,aftertheselectionofthemostappropriatedevice(either manually orautomatically), kernelloadingis delayeduntila process requires it.At that time, a previously compiled (i.e. cached) versionofthekernelissought.Ifitdoesnotexistyet,kernelcom- pilationisautomaticallytriggeredandthekernelcacheisupdated (loading of cached kernels is typically two orders of magnitude faster than on-the-ﬂycompilation). Compilation logs (if any) are showntotheuserfordebuggingpurposes.

3.3.2. Inputandoutput

MRIdataloadingisdoneinasinglecodelinewithOpenCLIPER by a front-endfunction (see leftmost side ofFig. 1).As of today, k-space lines are input by means of a.clf file (and its accompa- nying.hdrfile);Matlabformat fileisalsoallowed. Outputformats

includethe twoformatsjustmentioned aswell asother popular imageformats(JPEG,PNG,etc).

Tomaximize utilization ofthe computingdevice, data objects may be loaded andsaved concurrently while the device is busy processingother objects.SinceMRI dataﬁlesareoftenlarge,this cansaveanoticeableamountofprocessingtimeperpatient.

3.3.3. Datastructures

AcommonburdeninGPUcomputingliesintheneedto keep trackof pointersto dataobjects andtheir properties: numberof dimensions, sizesandstridesalong eachof them,datatypesand sizes(complex,ﬂoat,integer,etc.), andpassingthem tothecom- puting device. Thisis typically done by addingarguments to the kernels,butkernelargumentspaceisusuallylimitedby thecom- putinghardware(apart formaddingmoreburdenforthe userto calltheirkernels). OpenCLIPERsimpliﬁesthisburdenby encapsu- latingalldatapropertieswithinthebufferinthecomputingdevice inawaythatistransparenttothekernelprogrammer(i.e.thereis nooffsetbetweentheobjectpointerandtherealdata).Thus,users justhavetopassasinglepointer tothedataobjecttohaveallits propertiesreadily available for thekernel code.OpenCLIPER pro- videssupport fordatawithan arbitrary numberofframes, coils, sensitivitymaps,samplingmasksanddatadimensionalityforthe specialcaseofMRIdata,andarbitrarilycomplexdataforthegen- eralcase. Ourframework also provides methods to traverse data buffersalong any givendimension. This functionalityis provided bygeneratingindexandsizetablesforeachdataobject,asshown inFig.1.

An additionalfeature of OpenCLIPER isthat all sub-objects in a data object (e.g. sensitivity maps for each coil) are mapped contiguously in device memory and properly adjusted to hard- warealignment,sokernelprogrammerscanassumealinearlayout when processingcompound data objects. Data transfers between host and device are driven by the DMA controller to maximize speed.

Listing1 shows a simpleOpenCL kernel whichsums a set of MR imagesalong the coildimension. It can be seen how a kernel can accessattributes frominput andoutput buffers (suchas coilorframestrides,sizes,andsoon)justbypassingtheirnatural pointerasargumentswithnoneedtoworryaboutheaderoffsets orthelike.AllattributeaccessfunctionsareO(¹)âs^longâsâll^ND

Listing 1. Accessing attributes of complex data structures. Note how data proper- ties (sizes, strides and number of coils in this particular case) are available from kernel code without passing them explicitly as kernel arguments.

(6)

arrays in the given objecthave the same number of dimensions (O(ⁿ)otherwise).

3.3.4. Additionalsupportforresearchers

OpenCLIPER hasalsobeendesignedasadevelopmenttool for researchers. Hence, it includes a number of additional utilities, whichwenowenumerate:

1. 1D,2D and3D objectsindevicememory,eitherstandalone or asamovie,canbegraphicallypresentedtotheuseratanytime by a simple call to

pointerToObject- > show()

method;

scalingandvideovelocityfacilitiesareprovidedatprogrammer convenience.OpenCV[29]iscurrentlyusedtodisplaywindows onthedesktop.

2. Similarly,anyobjectindevicememorycanbesavedforfurther inspectioninMatlaboranyothersupportedformatatanytime.

3. Whencompilingindebugmode,deviceandhostareautomati- callysynchronizedineveryprocesssothatrun-timeerrorstrig- gerexactlyattheresponsiblekernel,sofaultlocationdetection shouldbeimmediate.

4. When OpenCL kernelsare compiled onthe ﬂy(asopposedto loadingapreviouslycachedversion),errorsmayshowupwhen thehostprogramisrun.Ifthisisthecase,kernelcompilation errorsareautomaticallyshowntotheuser.

5. In the case of program abortion, the full stack backtrace is showntotheprogrammer.

6. OpenCLIPERincludesfacilitiestogatherproﬁlingdataandgen- eratestatisticalreportsaswell.

7. Amemorymap ofdataobjectsindevice memorymaybe ob- tainedatanytime.

3.3.5. IncorporationofHIPandCUDAfunctionality

Asstatedintheintroduction,HIP[5]isan interestingeffortto bridgethegapbetweenGPUsvendorsbyprovidingfacilitiestomi- grateCUDAcodeintoareusableforminotherplatforms.However, handling exchange is still an issue; specifically, data objects de- finedinonelanguage(HIP/OpenCL)cannotbeusedasparameters to kernels written in the other language (OpenCL/HIP).We have solved this problemby providing a HIP handleto every OpenCL data object defined through the OpenCLIPER API. This way, de- velopers may invoke calls to HIP libraries on pure OpenCL data objects and, consequently, OpenCLIPERmaybenefitfromlibraries writteninCUDAforwhichaHIPimplementationisalsoavailable.

Asanexample,Listing2showshowanOpenCLIPERdataobject may be passed seamlessly to the rocFFT library[30] anduse its resultasanOpenCLIPERobjectagain.Afterinitialization(lines11–

15),avectorofcomplexnumbersiscreatedasanOpenCLmemory object(lines 17–20)anditscorresponding HIP handleisobtained (line23).ThenextcodesectionispureHIPcodeinwhichtheob- tainedhandleisusedtofeedtherocFFTlibrary(lines25–31)and, ﬁnally,theFFTresultissavedasaMatlabﬁle(line34).

4. Evaluation

We have executed several reconstructions to test the be- haviour of our platform. Reconstructions have been carried out on 7healthy volunteers courtesyof King’sCollege London. These data are 2D Cartesian, fully sampled dynamic short axis cine breath-hold ECG-triggered acquisitionsin a 1.5T Philips scanner with a bSSFP sequence. Some relevant parameters of the acqui- sitions include ﬂip-angle 60^◦, TR/TE = 3/1.5 ms, spatial resolu- tion 2 × 2 mm², slice thickness 8 mm, 20 cardiac phases, FOV 320× 320 mm²,12–14 slicesandthenumberofchannelsisbe- tween 19and32,dependingonthesubject.Bothsensitivitymaps andk-spacedatafromallcoilswereprovidedtous.Thesedatasets were retrospectivelysubsampledwithaGaussian variable-density

Listing 2. Passing OpenCLIPER data objects to HIP libraries.

randomundersamplingpatternalongthephaseencodingdirection describedinAsifetal.[31]fordifferentvaluesofaccelerationfac- tor(AF).Hence,fortheexperimentscarriedoutwehavefullysam- pled images whichhave been usedas a referencefor measuring somequalityindices(QIs).

First,we havecomparedourplatformwithBART[13]to solve Eq. (1). As for OpenCLIPER we have used NESTA. As for BART,³ since NESTA is not available in that platform, we have used the ADMMmethod[32].Asforthesecomparisons,TinEq.(1)isthe identity due to the fact that BART doesnot incorporate —to the bestofourknowledge— agroupwiseME/MCimplementation.

Three experiments have been conducted, namely (a) AF = 4 (AF4),(b)AF=4pluscoilcompression(AF4CC)from19–32tohalf ofthechannels(9–16), and(c)AF= 8(AF8).Theparameter

λ

ⁱⁿ

Eq.(1)hasbeenchosentomaximizethestructuralsimilarityindex (SSIM)[33]intheBARTreconstruction.Table1showsSSIMvalues forthethreeexperiments.Thus,

λ

=0.01 hasbeenusedforBART andOpenCLIPERreconstructions.Forthesakeoffairness,sincethe twooptimizationalgorithmsaredifferent,wehavechosenthein- ternalNESTAparameters(namely, both

μ

^and^the^stopping^crite-

rion,seelines5and12inAppendixC)toguaranteethatSSIMval- uesarecomparableforADMMandNESTAreconstructions;thisis graphicallyshown inFig.2a, withboxplots ofSSIM along all the slicesofallthepatients.Boxplotshavebeengroupedinpairs,i.e., AF4 for OpenCLIPER and BART, and the same structure goes for

3 Release downloaded on February 25, 2020, from https://github.com/mrirecon/

bart/releases/tag/v0.5.00

(7)

Fig. 2. Boxplots of SSIM (a), NCC (b), and SER (c) for experiments AF4, AF4CC, and AF8 on both OpenCLIPER which makes use of NESTA, and BART (which uses ADMM). No signiﬁcant differences have been found.

(8)

Fig. 3. Example of reconstructions using BART and OpenCLIPER for the experiments AF4, AF8, and AF8 with ME/MC; the latest only for OpenCLIPER. Fully sampled images are shown as reference. End-systole and end-diastole frames are shown in each case, as well as the intensities along time of the vertical proﬁle marked with red line in reference images. Last row shows the error images (mean squared error, MSE, between reference and reconstructed image). (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Table 1

Mean SSIM values evaluated on undersampled BART reconstructions for a single slice of all patients and the three experiments (AF4, AF4CC, and AF8). Maximal values (bold highlighted) in- dicate the optimal regularization weight ( λin Eq. (1) ) for the experiments.

Reg. Term AF4 AF4CC AF8

λ= 10 ⁻⁵ 0.8776 0.8923 0.8200

λ^{= 10}⁻⁴ ^0.8832 ^0.8979 ^0.8293

λ^{= 10}⁻³ ^0.9037 ^0.9201 ^0.8642

λ= 0 . 01 0.9056 0.9229 0.8719

λ= 0 . 1 0.8955 0.9133 0.8592

λ= 1 0.8951 0.9128 0.8589

AF4CC aswell asforAF8,asindicated inthe horizontallabelling andinthelegends ontheﬁgures. Forcompleteness,differentQIs havebeenanalyzed.Fig.2alsoincludesthecorrespondingboxplots forthe normalizedcrosscorrelation (NCC, Fig.2b) andthesignal toerrorratio(SER,Fig.2c).MannWhitneytests,⁴shownosigniﬁ- cantdifferencesinanyoftheseparametersbetweenthecompared frameworks. Additionally, Figs. 3 and 4 show two reconstruction examples fortwo differentpatients. Inboth cases,the fullysam- pledreconstructedimagesareshownasreferencewiththeimages reconstructedusingBARTandOpenCLIPERfortheexperimentsAF4 and AF8; for the latter, ME/MC reconstruction is also shown for OpenCLIPERWith˙ theseparametersetting,executiontimesiniden- tical computerload situationshave beencompared. Todeal with variability, each experimenthas beenrun one hundredtimes for eachpatient.

Four platforms withfour differentGPUs havebeen employed.

AllofthemarestandardPC-classworkstationsbasedonIntelCore

4 Function wilcox.test in RStudio.

or AMD Ryzen processors. The exact GPU models are GeForce 2080TiandQuadroRTX6000fromnVidia,andRadeonRX480and RadeonRX5700XTfromAMD.Forreference,sometestshavealso been carried out using a CPU asthe computingdevice in a 70- threadIntelXeonserver.NoticethatBARTcanonlyberunonCPU ornVidiaGPUs,sonofurthercomparisoncould bedonebetween AMDandnVidiaasforBARTperformance.

Fig.5showsboxplotsofexecutiontimesontheGeForce2080Ti (a) and RTX6000 (b) for the three experiments, with the same ordering as in Fig. 2. As for a comparison betweeen nVidia and Radeon, Fig. 6 shows boxplots of the AMD Radeon RX 5700XT (GFX1010) and the nVidia QuadroRTX 6000 in the same condi- tionsasinthepreviousexperiment. Intermsofperformance,the FFTimplementationused bythe reconstructionalgorithm plays a prominentrole.BARTusesthenVidia’swell-knownproprietaryim- plementation cuFFT, whereas OpenCLIPER uses the AMD’s open- sourceimplementationclFFT.Whiletheformerishighlyoptimized fornVidia GPUs, thelatter is conceivedto be run on every pos- siblecomputingdevice andhence lacks ofanyspeciﬁc optimization. This results in clFFT being slower than cuFFT, asshown in Table2.OpenCLIPERcompensatesthiswithperformanceenhance- ments in other areas,such asparallel loading/saving of dataand kernelcaching(seeSection3.3).⁵

Fig.7showsboxplotsofSSIMforAF8onOpenCLIPERwithand without ME/MC; whereas boxplots for NCC and SER are shown in Fig. 8. Results of the unilateral Mann-Whitney test for SSIM are signiﬁcant(p= 0.008)for ME/MC.Ifa unilateralsigned rank test isrun per patient,differencesfavor ME/MC insixout ofthe seven patientstested.Mann-Whitney testsarealso signiﬁcantfor NCC and SER (p=0.006 and p=0.007, respectively). The price

5 In our previous work [34] , FFT times reported were overestimated since they incorporated an additional synchronization of the device command queue.

(9)

Fig. 4. Example of reconstructions using BART and OpenCLIPER for the experiments AF4, AF8, and AF8 with ME/MC; the latest only for OpenCLIPER. Fully sampled images are shown as reference. End-systole and end-diastole frames are shown in each case, as well as the intensities along time of the vertical proﬁle marked with red line in reference images. Last row shows the error images (mean squared error, MSE, between reference and reconstructed image). (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 5. Boxplots for experiments AF4, AF4CC, and AF8 on (a) GeForce 2080Ti and (b) RTX60 0 0. Times reported are per whole slice stack.

Fig. 6. Boxplots of computing time for AMD Radeon RX 5700XT (GFX1010) and nVidia Quadro RTX 60 0 0 (RTX60 0 0) for the three experiments. Notice that red-shaded boxes coincide with the red-shaded boxes in Fig. 5 . Times reported are per whole slice stack. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

(10)

Table 2

Performance comparison between cuFFT and clFFT libraries on various devices. A single experiment consists in transforming a K-space dataset ( 160 × 160 images, 19 coils, 20 frames, complex ﬂoats) 10 0 0 times. Each experiment is run 100 times on each library and device. For each combination we show mean execution times (mean) and their standard deviation (std), both in seconds.

cuFFT clFFT

Device mean std mean std

GeForce 2080Ti 0.6299 0.001 0.6518 0.0227

Quadro RTX 6000 0.5880 0.0016 0.6413 0.0020

Radeon RX 5700XT N/A N/A 0.8276 0.0015

Fig. 7. Boxplots of SSIM for AF8 on OpenCLIPER with (right, green) and without (left, red) ME/MC. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 8. Boxplots of NCC (a) and SER (b) for AF8 on OpenCLIPER with (right, green) and without (left, red) ME/MC. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

to payis theincrease incomputationtime, whichreachesa me- dianequalto62.12sand(^q1,q₃)= (⁵³,72.13)^for^2080Ti,^with^qi

the ith quartile.ForRTX6000the medianis 60.16and(^q1,q₃)= (⁵⁰.74,69.24)^(times^reported^are^per^whole^slice^stack).

Finally,wehavealsorunanexperimentonCPUwithbothBART and OpenCLIPER; the experiment is AF4CC with a randomly se- lected patient,tenrepetitions.Asfortheformer,themedianexe- cutiontime is24.58 s((^q1,q₃) = (²⁴.84,24,45)⁾^while^for^Open- CLIPERmedianis16.33and(^q1,q₃)=(16.09,16.59).

5. Discussion

OpenCLIPER revealsitselfasadevice agnosticplatformforre- construction of MR dynamic images.We have shown results for 2D although thecode is preparedfor higherdimensionality provided that suﬃcient memory is available. Fig. 5 shows that our computing times are comparable to those needed by BART; this hasbeentestedintwo nVidiadeviceswithdifferentmemoryca- pacities.Resultsarenotpoint-by-pointcomparablesinceoptimiza- tionmethodsusedby bothapproachesaredifferent.However,we meant to be fair by assuring that both platforms gave rise to images with similar qualities; optimization parameters were se- lected to this end, a goal that seems accomplished according to

theevidenceshowninFig.2.Asforcomputingtimesthemselves, we observethat our timesare fairly similar tothose fromBART;

hence,no obvious lossesare appraised by using OpenCLIPER de- spitesomeadministrativetasksneedtobehandled,aspointedout inSection3.3.1,duetoitsdeviceagnosticcharacter;thisoverload isrepresented inFig. 1by the shadedblocks located outsidethe dashedboxthatcontainsthecoreofthereconstructionalgorithm.

BARTcannotbetestedinAMDdevices;henceonlyOpenCLIPER enters the comparisons in Fig.6. The ﬁgure showsthat a device one order of magnitude more economical can do a remarkable job.Therefore,OpenCLIPERmakesitpossiblethatanaffordablede- viceisusedforimagereconstructioninviableclinicaltimes.When ME/MCentersthealgorithmthecomputingtimeneededreachesa quantityofaboutoneminuteforamultislicereconstruction,ade- laythat seems alsorealisticin aclinicalsetting.WhetherME/MC isworthtakingdependsontheaccelerationfactor;fortheAF=8 casewehaveemployed inourexperiments,statisticaldifferences werefound.

TheexecutiononCPUrevealedthatBARTneededextratimefor image reconstruction withrespect to OpenCLIPER.Since noobvi- ousdifferenceswerefoundonthenVidiadevices,chancesarethat thisisduetothefactthatBART,whenexecuted ontheCPU,par-

(11)

allelizesonlytheFFTpartofthereconstruction(makinguseofthe threading support in the FFTW library), whereas in OpenCLIPER, every process’ kernelsare executed inparallel.Additionally,clFFT (theOpenCLversionofFFTusedbyOpenCLIPER),hasbeenusedin bothCPUandGPUexperimentssincekernelcodeisunique.

The FFT algorithm isexhaustively usedin each reconstruction step. Hence, optimized implementations of thisalgorithm should havean appreciableimpact intheoverallreconstruction computing time. Our HIP interface could be an alternative to use other FFTlibrarieswritteninCUDA(cuFFT,whichisknowntobehighly optimized fornVidiadevices,mightbeone ofsuchalternatives if its source codewere available).Nevertheless,interfacing overload isnon-negligibleandthisrequiresfurtherinvestigation.

6. Conclusionandfurtherwork

Inthispaper,a2DMRIcardiacreconstructionsystemhasbeen presented. This system is a) clinically viable in terms of execution times, andb) suitable for any computing device which has anOpenCLimplementation,includingCPUs,GPUs,FPGAsandDSPs frommainvendors.TheuseofourframeworkOpenCLIPERallowed ustopartitiontheprobleminindependentblackboxes(processes) which are thenconnected asneededandexecuted in parallelon anycapable device, whilethesource coderemains unique forall prospective computing devices. Device initialization and mainte- nance is reduced to a minimum as well, while providing relevant development and debugging aids. Administrative but time- consuming tasks such as data loading/saving and kernel compilation areparallelized orcached so astominimally affectoverall performance.HIPcode(andprospectivelyCUDAcode)canbeinte- grated in developments whenever needed as OpenCLIPER makes data objects shareable between OpenCL and HIP. We also con- tributeaparallelgroupwiseregistrationsubsystemformotionesti- mation/compensationandaparallelmulticoilNESTAsubsystemfor l1− l2problemsolving.

Further work includes extendingthe 2D reconstruction to the free-breathing 3D problem, which poses additional issues dueto typicaldatavolumesthatextendfarbeyondthecapacityofasin- glecomputingdevicememory,alongwithmuchhigherprocessing times.

DeclarationofCompetingInterest

The entire research has been funded by the Government of Spain;theauthorsdeclarethattheyhavenoconﬂictofinterest.

Acknowledgements

ThisworkissupportedbyMINECOundergrantTEC2017-82408- R. In addition, the authors acknowledge the Asociación Española ContraelCáncer(AECC)anditsScientiﬁc Foundationforitsgrant PRDVL19001MOYA. We also acknowledge European Social Fund, OperationalProgrammeofCastillayLeón,andtheJuntadeCastilla yLeón,throughtheMinisteriodeEducación.Weexpressourgrati- tudetoDr.ClaudiaPrieto,King’sCollegeLondon,forkindlysharing withusthedataintheexperiments.

AppendixA. Notationandmainoperators

We aimto reconstruct a sequence of 2D imagesof size N₁× N2 and Nt temporal frames. Data hasbeen acquired using a coil array ofC elements fromM samples ofadiscretizedk-spacegrid of size K=K₁K₂. The differentterms includedin Eq. (1)can be representedbythefollowingmatrices:

1. Multi-coilk-spacesubsampleddatab isavectorofsizeM× 1.

2. Reconstructed image m is a vector of size N× 1, where N= N₁N₂Nt.

3. E=AFS isthe encodingoperator,whichcan berepresentedas amatrixofsizeM× N.

(a) S isamatrixofsizeNC× N givenby:

S=

⎛

⎜ ⎜

⎝

S₁ S2

.. . SC

⎞

⎟ ⎟

⎠

whereSc,1≤ c≤ C,isadiagonalmatrixofsizeN× N whose diagonalelementsrepresentthesensitivitymapsofthecoil c ataspaciallocation.

(b)F isamatrixofsizeM× NC givenby:

F=

⎛

⎜ ⎜

⎝

F1 0 ... 0 0 F2 ... 0

..

. ... ... ... 0 0 ... FC

⎞

⎟ ⎟

⎠

where Fc,1≤ c≤ C,is matrixofsize KNt× N representing thecoeﬃcientsofFouriertransform.

(c)AisamatrixofsizeM× M givenby:

A=

⎛

⎜ ⎜

⎝

A1 0 ... 0 0 A2 ... 0

..

. ... ... ... 0 0 ... AC

⎞

⎟ ⎟

⎠

where Ac,1≤ c≤ C,isa diagonal matrixofsize KNt× KNt

whosediagonalelementstake thevalue 1ifthe entrycor- respondstoasensedk-spacelocationand0otherwise.Note thatA₁=A₂=...=A_C,sincethesamplingmaskisthesame forallcoils.

4.U=Tisthesparsifyingoperator,whichcanberepresented asamatrixofsizeN× N.Inthiscase,theoperatoriscomposed oftwomatrices:

(a) îsâ^matrixôf^size^N× N givenby:

=

⎛

⎜ ⎜

⎝

I 0 0 ... −I

−I I 0 ... 0 0 −I I ... 0

..

. ... ... ... ... 0 0 ... −I I

⎞

⎟ ⎟

⎠

where I isa identity matrixof sizeN₁N₂× N1N₂.This matrix computes the temporal ﬁnite differences, i.e., tTV operator. We have added a cyclical extension with respect to[26]basedonthecardiaccycleperiodicity.

(b)T is amatrixof sizeN× N.A detaileddescriptionofthis matrixcan befoundinAppendix B.Recall thatform₀,the transformationT istheidentity.

Notethatwiththisformalism, theadjointoperatorrepresentation isstraightforwardastheHermitianofthesematrices(i.e.,E^H and U^H).

AppendixB.Groupwiseregistrationandmotioncompensation operators

The goal of the groupwiseregistration procedure is to jointly obtain a set of spatial transformations, one for each frame con- tainedinthetemporalsequence,sothat transformedimageside- allycoincide.Noimageistakenasareferencetoavoidanysortof bias.In practice,registered images willnot be exactly coincident buttheywillconstituteasparsesequenceinthetemporaldimen- sion,i.e.,registrationpromotessparsity.Thetargetistheminimiza- tionof acost functionH (see Eq.(B.2)), whichconsistsofa data

(12)

ﬁdelityterm—thesumofsquaresoftheintensitydifferenceswith respecttothereferencethatisbuiltontheﬂy(seeEq.(B.1))—and a smoothnessterm, whichpreventstheonsetofunrealistictrans- formations. Minimization is achieved by gradient descent. Gradi- entsarecalculatedfollowingstandardprocedures[35].

V

(

^x

)

= 1 Nt

Nt

n=1

m⁽ⁿ⁾

T⁽ⁿ⁾

(

^x

)

− 1 Nt

Nt

n=1

m⁽ⁿ⁾

T⁽ⁿ⁾

(

^x

)

2

(B.1)

H

( τ )

=

χ

V

(

^x

)

+ Tc

0

4

p=1

λ

^p^R^p^dt

dx (B.2)

Eq. (B.2)showsthatthe metriciscalculatedwithinregion

χ

^,

which maybethe wholeimage oronlyan estimatedarea where most ofthe motiontakes place.In theseequationsTc is thecar- diaccycleand

λ

^pând¹≤ p≤ 4aretheweightsoftheregulariza- tion terms Rp.These termsaccount forderivatives ofthe motion vectorfield;specificallyR₁andR₂arerespectivelyfirstandsecond orderspatialderivativeswhileR₃andR₄arefirstandsecondorder temporalderivatives.

ThemotionvectorﬁeldisapproximatedbymeansofB-Splines functions, thecoeﬃcientsare whichofthe parameters thatenter theoptimization.ThisistheMEstage,wherethedense(forward) transformationiscalculatedasfollows:

T

(

^x

)

=x+

C12

u1=C11 C22

u2=C21

2 l=1

B3

_x

l− pul

l

·

θ

u (B.3)

whereC isthegridofcontrolpoints,B₃representstheuniformB- spline function ofgrade3, pu_l the controlpoint coordinatealong dimensionl, l isthepixelresolutionofthegridofcontrolpoints ineachdimensionand

θ

^u^the^free^parameter^that^drives^the^defor-

mation.ThemeshboundariesgivenbyC aresetsothattheregion ofinterestiscompletelycoveredwithacertainmarginforapprox- imation via B-splines to avoidlarge displacement values produc- ing inconsistenciesat theedges ofthe region of interest leading to errorsininterpolation.Theregisteredimage setisobtainedby meansofinterpolation,whichmayberepresentedinmatrixform as[26]:

T=

⎛

⎜ ⎜

⎝

T1 0 ... 0 0 T2 ... 0

..

. ... ... ... 0 0 ... Tt

⎞

⎟ ⎟

⎠

whereTtisamatrixofsizeN1× N2associatedtothetransforma- tion ofthe framet in thesequence; it containstheinterpolation coeﬃcientsthatresultasaconsequenceofthetransformation.Its genericshapeforwillbe:

T1=

⎛

⎜ ⎝

...

0 ... w1 w2 0 ... w3 w4 0 ... 0 0 ... 0 w₁ w₂ ... 0 w₃ w₄ ... 0

...

⎞

⎟ ⎠

i.e., the matrix will be sparse, the sparsity degree of which de- pends on the interpolation order. Fora bilineal interpolation, for instance, only four coeﬃcients will be non-null in each row, as indicated in the equation. This is the MC part of the algorithm.

Thesematricesandtheiradjointsarenot explicitlybuilt,butthey areappliedasoperatorsforeﬃciency.Formoreinformationabout B-Splinefree-formdeformationssee[36].

AppendixC. CSreconstructionusingNESTA

Algorithm1 MRIreconstruction.

Step0:NESTAmulticoilreconstruction

Inputs:multi-coilk-spacesubsampleddatab,encodingoperator E=AFS,sparsifyingoperatorU=^,^their^adjoints^and^parameters

λ

,

γ

,La∈R⁺

Initialization:

μ

⁽⁰⁾^, ^x0=E^Hb, thenumber ofsteps maxIterand parameterL_μ

fort=0tomaxIterdo

Step0.1:ApplyNesterov’salgorithmwith

μ

=

μ

⁽^t⁾

fork≥ 0do

a.Computethegradientofthel1-norm:

∇

^fμ

(

^Uxk

)

=

₁

μU^HUx_k, if

|

^Uxk

|

^≤

μ

^,

U^Hsgn

(

^Uxk

)

, otherwise,

where function sgn() ^applied ^to ^vector

v

means the stack of sign(

v

i/

| v

i

|

)^,^with

v

itheithvectorcomponent.

b.Computethegradientofthecostfunction:

∇

^f

(

^xk

)

⁼^E^H

(

^Exk− b

)

⁺

λ∇

^fμ

(

^Uxk

)

c.Computey_k: y_k=x_k−_λ_L_μ¹₊_L_a

∇

^f

(

^xk

)

d.computez_k: z_k=x₀−_λ_L_μ¹₊_L

a

j≤k

α

j

∇

^f

(

^xk

)

where

α

k=1/2(^k+1)^. d.Computex_k₊₁: x_k₊₁=

τ

kz_k+

(

¹−

τ

k

)

^yk.

where

τ

k=2/(^k+3)^.

Stopwhenagivencriterionismet endfor

Step0.2:Decreasethevalueof

μ

^:

μ

⁽^t⁺¹⁾=

γ μ

⁽^t⁾

endfor

Outputs:reconstructedimagem0=x_k₊₁ fori=1toMotionItersdo

Step1:ME-GW

Inputs:reconstructedimagem_i₋₁

Initialization: ﬁx the region of interest, create the control pointsmesh,computeB-splinesproducts andcoeﬃcients,asin [7]

for j≥ 0do

Step1.1:Calculatethepixel-wisedisplacementﬁelds Step1.2:Transformimagesusinglinearinterpolation Step1.3:Calculatethemetricandsmoothingterms Step1.4:Calculategradients

Step1.5:Updatethemovementsofthecontrolpoints Stop _KN¹

|| θ

n−1−

θ

ⁿ

||

<

T and _|_χ|¹ (^Hn−1− Hn)<

H

endfor

Outputs:T

Step2:NESTAreconstructionwithMC

Inputs:multi-coil k-spacesubsampleddatab, encodingoper- atorE=KFS, sparsifyingoperatorU=T,their adjoints,and parameters

λ

,

γ

^L^a∈R⁺

Initialization:

μ

⁽⁰⁾^, ^x0=m_i₋₁, the number of steps maxIter andparameterL_μ

Seeabove(step0)forfurtherdetails Outputs:reconstructedimagem_i=x_k₊₁ endfor

A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

Computer Methods and Programs in Biomedicine

A clinically viable vendor-independent and device-agnostic solution for accelerated cardiac MRI reconstruction

Elena Martín-González

, Elisa Moya-Sáez

, Rosa-María Menchón-Lara

,

Javier Royuela-del-Val

, César Palencia-de-Lara

, Manuel Rodríguez-Cayetano

, Federico Simmross-Wattenberg

, Carlos Alberola-López

||

||

λ||

||

λ

μ

θ

∂

(

)

∂θ

∂

(

)

∂

∂

∂

(

)

∂

(

)

∂θ

θ

∂

(

)

∂

(

)

(

)



χ

χ

θ

∂

(

)

∂θ

∂

(

)

∂

∂

∂

(

)

∂

(

)

∂θ

pointerToObject- > show()

λ

λ

μ

⎛

⎜ ⎜

⎝

⎞

⎟ ⎟

⎠

⎛

⎜ ⎜

⎝

⎞

⎟ ⎟

⎠

)