Towards an AI endowed visual neuroprosthesis for the blind: development and first-in-human implementation of a deep learning intracortical neural interface

(1)

PROGRAMA DE DOCTORADO EN TECNOLOGÍAS DE LA INFORMACIÓN Y LAS COMUNICACIONES

TESIS DOCTORAL

HACIA UNA NEUROPRÓTESIS VISUAL BASADA EN IA:

DESARROLLO E IMPLEMENTACIÓN EN HUMANOS DE UNA INTERFAZ NEURONAL DEEP LEARNING

Presentada por Antonio M. Lozano Ortega para optar al grado de Doctor

por la Universidad Politécnica de Cartagena

Dirigida por:

Dr. José Manuel Ferrández Vicente

Codirigida por:

Dr. Eduardo Fernández Jover Dr. Francisco Javier Garrigós Guerrero

Cartagena, 2022

(2)

(3)

DOCTORAL PROGRAMME IN INFORMATION AND COMMUNICATION TECHNOLOGIES

PhD THESIS

TOWARDS AN AI ENDOWED VISUAL NEUROPROSTHESIS FOR THE BLIND:

DEVELOPMENT AND FIRST-IN-HUMAN IMPLEMENTATION OF A DEEP LEARNING INTRACORTICAL NEURAL INTERFACE

Presented by ANTONIO MANUEL LOZANO ORTEGA to the Technical University of Cartagena in fulfilment of the thesis requirement for the award

of PhD

Supervisor:

Dr. José Manuel Ferrández Vicente

Co-supervisors:

Dr. Eduardo Fernández Jover Dr. Francisco Javier Garrigós Guerrero

Cartagena, 2022

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

Abstract

In this doctoral thesis, I aim to step further beyond state-of-the-art towards the development of a cortical neuroprosthesis for the blind. Currently, a fully working visual cortical neuroprosthesis able to help blind people to recover a form of vision is still a challenging ongoing project, although strong efforts are being made by research groups all over the world. This work is situated at the conjunction between neuroscience, neural engineering, and artificial intelligence.

In arder to learn how to write into the brain, we hypothesize and develop a Deep Learning (DL) artificial retina able to mimic ganglion cells' firing rates in response to spatiotemporal light patters.

We developed and tested a hardware/software neural interface pipeline composed of a camera integrated within the user's pair of glasses, an edge-computing system to process the video feed -where its Deep Learning capabilities abide- and transform the scene's most relevant visual features into commands to a neurostimulator's which drives electrical pulses to the brain through an Utah array.

To extract the most relevant information from a complex and dynamic visual environment, we used task-oriented DL models, such as object detection and semantic segmentation models. Furthermore,

we propase and deploy computational neural encoding models such an artificial retina, to drive electrical stimulation pulse patterns that are delivered into the neural tissue through a Utah array, to encade this information into the brain.

Finally, we discuss the evoked visual percepts generated by a multielectrode intracortical implant in a first-in-human trial.

To the best of our knowledge, we have developed and implemented the first Al driven intracortical visual neuroprosthesis for the blind: Neurolight.

Antonio Manuel Lozano Ortega, Cartagena, February 2022

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

1

Chapter ¹

Introduction

In this chapter I introduce the concept and necessity of developing of a cortical neural interface for the blind, that can safely and reliably produce phosphene vision. Following these, the main goals and subgoals of this work within the context of neuroprosthesis research are defined, and our plan of action towards meeting these are proposed.

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

(36)

(37)

(38)

●

◦ ◦ ◦ ◦ ◦

(39)

(40)

●

(41)

(42)

(43)

(44)

(45)

(46)

(47)

(48)

●

◦ ◦ ◦ ◦ ◦

(49)

(50)

(51)

(52)

(53)

●

(54)

(55)

(56)

(57)

(58)

●

(59)

(60)

(61)

(62)

(63)

●

(64)

(65)

(66)

(67)

(68)

(69)

(70)

(71)

(72)

(73)

(74)

(75)

(76)

●

(77)

(78)

(79)

(80)

(81)

October 22, 2018 9:48 1850043

International Journal of Neural Systems, Vol. 28, No. 10 (2018) 1850043 (16 pages)

World Scientific Publishing Companyc DOI: 10.1142/S0129065718500430

A 3D Convolutional Neural Network to Model Retinal Ganglion Cell’s Responses to Light Patterns in Mice

Antonio Lozano^∗,§, Cristina Soto-Sánchez^†,‡,¶, Javier Garrigós^∗,, J. Javier Mart´ınez^∗,∗∗, J. Manuel Ferrández^∗,†† and Eduardo Fernández^†,‡‡

∗Dpto. Electr´onica, Tecnolog´ıa de Computadoras y Proyectos Universidad Polit´ecnica de Cartagena, Cartagena, Spain

†Instituto de Bioingenier´ıa, Universidad Miguel Hern´andez Alicante, Spain

‡CIBER-BBN, Madrid, Spain

§[email protected]

¶[email protected]

[email protected]

∗∗[email protected]

††[email protected]

‡‡[email protected] Accepted 5 September 2018 Published Online 24 October 2018

Deep Learning offers flexible powerful tools that have advanced our understanding of the neural coding of neurosensory systems. In this work, a 3D Convolutional Neural Network (3D CNN) is used to mimic the behavior of a population of mice retinal ganglion cells in response to different light patterns. For this purpose, we projected homogeneous RGB flashes and checkerboards stimuli with variable luminances and wavelength spectrum to mimic a more naturalistic stimuli environment onto the mouse retina. We also used white moving bars in order to localize the spatial position of the recorded cells. Then recorded spikes were smoothed with a Gaussian kernel and used as the output target when training a 3D CNN in a supervised way. To find a suitable model, two hyperparameter search stages were performed. In the first stage, a trial and error process allowed us to obtain a system that is able to fit the neurons firing rates.

In the second stage, a systematic procedure was used to compare several gradient-based optimizers, loss functions and the model’s convolutional layers number. We found that a three layered 3D CNN was able to predict the ganglion cells ﬁring rates with high correlations and low prediction error, as measured with Mean Squared Error and Dynamic Time Warping in test sets. These models were either competitive or outperformed other models used already in neuroscience, as Feed Forward Neural Networks and Linear- Nonlinear models. This methodology allowed us to capture the temporal dynamic response patterns in a robust way, even for neurons with high trial-to-trial variable spontaneous ﬁring rates, when providing the peristimulus time histogram as an output to our model.

Keywords: Retina modeling; neural coding; deep learning; convolutional neural networks.

1. Introduction

Neurosensory systems are still a considerable chal- lenge for scientists, even though great achievements have been made. The combined eﬀorts of many biological and engineering disciplines have contributed

to the knowledge and technology that allows us to prevent, treat and hopefully overcome some neuro- logical diseases and their limitations by building the knowledge corpus, diagnostic systems, rehabilitation treatments and prosthesis for the disabled.

1850043-1

(82)

October 22, 2018 9:48 1850043

A. Lozano et al.

In the case of the visual system, several retinal ganglion cell spiking models have previously been proposed. Some of them are used as general models of early and cortical neurosensory path- ways, such as Linear-Nonlinear models,^6,7 General- ized Linear Models (GMLs),⁴and Integrate and Fire models.⁵ Some of them oﬀer a convenient tradeoﬀ between retinal functionality reproduction and biological resemblance.^53,54These models have obtained diverse performance results depending on the metric used (e.g. explained variance, average number of spikes and correlation), as shown in Refs. 4, 8–

12. Recent studies have proposed hybrid systems that are able to outperform the preceding models, such as fine tuning physiological models with Genetic Algorithms,¹⁴ or the use of Deep Learning (DL) techniques.^13–15 These DL methodologies have been proved to be useful in general in the biomedical field with applications such as Alzheimer’s disease diagno- sis,⁵⁷ seizure detection,⁶² or discrimination between different forms of Rapidly Progressive Dementia,⁵⁸ to mention a few. This success is derived from recent advances in the field of machine learning and the development of new deep artificial neural networks, layers and architectures (e.g. Convolutional Neural Networks, Long Short Term Memory layers, Gated Recurrent Units, Generative Adversarial Networks, etc.). These new features help to handle the van- ishing gradient problem and perform better in some situations.¹⁶ Some of these features include training techniques, such as: dropout¹⁷for regularization purposes; parameter sharing to decrease the computational cost per layer; and several alternatives to the sigmoid and hyperbolic tangent activation functions, such as ReLU (Rectifier Linear Unit), LeakyReLU, PReLU (Parametric Rectifier Linear Unit) or ELU (Exponential Linear Unit) functions. In addition, the high computational cost of these machine learning systems has been tackled by the use of GPUs,¹⁸ which reduces the time required for the training and tuning procedures.

Among these advances, Convolutional Neural Networks (CNNs) have been proven to be a great tool for visual recognition problems. They outperform other traditional machine learning techniques,¹⁹such as Support Vector Machines, and they have the sig- niﬁcant advantage of performing end-to-end learning (i.e. avoiding the necessity of the feature engineering stage).

Moreover, machine learning techniques, in general, have been shown to outperform traditional techniques in neural decoding tasks.²⁰ This, and the structural analogy between CNNs and the visual lat- eral geniculate nucleus-V1-V2-V4-Inferior Temporal cortex pathway,²¹ following the way of Fukushima’s Neocognitron,²² makes CNNs to be a likely suitable solution for the bioinspired vision encoding task.

Convolutional neural networks have been successfully used to model visual information processing.^1,2 In particular, 3D Convolutional Neural Network (3D CNNs) have demonstrated the capacity to learn spatiotemporal features.³ These results suggested that 3D CNNs could be a promising paradigm to model retinal ganglion cells dynamic responses to light pattern stimuli and, therefore, they were selected for use in this work.

Recently, a data-driven methodology that makes uses of CNNs has been used to model tiger salaman- der retina ganglion cells¹³in response to both white noise and gray scale natural scenes. In Ref. 13, CNNs were proven to be able to more accurately model the retinal responses to both natural and artiﬁcial stimuli than the previous techniques and they were better able to generalize across stimuli types. In the same kind of approach, recurrent neural networks have recently been used to predict primate retinal responses on gray scale natural images.²³

In this work, a deep learning data-driven approach is proposed to address one of this early neurosensory challenges, which is modeling the neural activity of a mammal retina in response to the stimulation with light patterns. In this approach, retinal neural recordings in response to stimulation with homogenous and checkerboard flashes patterns built from RGB combinations to stimulate differentially on the mouse retina, were processed and fed into a supervised machine learning system — a 3D CNN — that is able to process spatio-temporal visual stimuli and reproduce the retinal behavior. This 3D CNN is trained to mimic the fire rates that are obtained after stimulating real mouse retinas with simple different RGB stimulus combinations and moving bars.

The ultimate aim is to build a system that is able to model simultaneously the responses of the mouse retinal ganglion cells to a variablility of stimuli with diﬀerent RGB combinations that mimic a more naturalistic stimuli scenario and to moving bars to deter- mine the spatial localization of the recorded cells, in

1850043-2

(83)

October 22, 2018 9:48 1850043

A 3D Convolutional Neural Network to Model Retinal Ganglion Cell’s Responses to Light Patterns in Mice

order to explore the capability of the model to yield useful information about the modeled neurons.

The rest of this paper is organized as follows.

Section 2 presents the methods for retinal responses recording and it describes the proposed modeling approach. Section 3 shows and discusses the results obtained with our experimental setup and with the implemented 3D CNN model architecture and compare it with other models used in neuroscience.

Finally, Sec. 4 summarizes the main contributions of our research and it outlines the remaining challenges.

2. Proposed Methodology 2.1. Ethical approval

All of the experimental procedures were carried out in conformity with directive 2010/63/EU of the European Parliament and of the European Coun- cil, and with Spanish regulation RD/53/2013 on the protection of animals in experiments for scien- tiﬁc purposes, and with the approval of the Miguel Hern´andez University Committee for Animal Use in Laboratories.

2.2. Materials and methods: Retinal recordings

The retinas were extracted following the same procedure as in Ref. 24. Brieﬂy, the animals were dark- adapted for 1 h and then sacriﬁced by cervical dislo- cation. The eyes were immediately enucleated, and the retinas were dissected and mounted on an agar plate with the ganglion cell layer facing up. The retinas were then mounted on a recording chamber and superfused with Ringer medium (124 mM NaCl, 2.5 mM KCl, 2 mM CaCl₂, 2 mM MgCl₂, 1.25 mM NaH₂PO₂, 26 mM NaCHO₃ and 22 mM Glucose) bubbled with 95% O₂ and 5% CO₂ at physiological temperature. The whole procedure was performed under dim red illumination.

Extracellular recordings were obtained from the retinal ganglion cells with a utah electrode array consisting of 10 × 10 platinum electrodes with a 400 µm electrode inter-spacing.²⁵The data recorded from each electrode were digitized with 16-bit resolution and 30 kHz sampling rate, and were then stored for further analysis. The recorded spike events were sorted with an open source tool, Neural Sorter,²⁶for oﬀ-line spike sorting analysis. Figure 1 summarizes this procedure.

Fig. 1. Data acquisition procedure.

2.2.1. Visual stimulation

Four different patterns of visual stimulation were projected onto the retinas spanning 4×4 mm on their surface. The first one consisted of a set of six different RGB combinations stimuli of 500 ms duration, inter- spersed with a dark uniform stimulus for 1500 ms, repeated 25 times, for a total duration of 300 s. The irradiance spectra for the monitor RGB flashes were:

a 612 nm peak for red; a peak for green at 544 nm;

three peaks for blue with a maximum at 544 nm, a second peak at 487 and a third peak at 435 nm; four peaks for magenta with a maximum at 612 nm, and three more peaks at 544 nm, 487 nm and 435 nm; two peaks for yellow with a maximum at 544 nm and a second peak at 612 nm; two peaks for cyan with a maximum at 544 nm and a second peak at 487 nm, measured with a Spectrometer (SM-240 Spectrom- eter, CVI Spectral Products, US). The second and third visual patterns consisted of two sequences of 24 binary checkerboards RGB combinations (each square spanning 8×8 pixels and 16×16 pixels) of 500 ms duration, followed by a dark uniform stimulus for 1500 ms, repeated 10 times, for a total stimulus length of 480 s. The light intensity of the stimulus at the surface of the retina ranged from 0.88 to 4.88 cd/m2 (ColorCal, Cambridge Research Sys- tems). Finally, we projected a set of moving bars that spanned 0.25 mm when projected onto the retina, with ﬁxed velocity, diﬀerent orientations (0^◦, 30^◦, 45^◦, 90^◦, 135^◦) and two directions for each orienta- tion, making a total of eight moving patterns, with each bar taking one second to get across the whole screen.

The recorded spikes trains for a total of 37, 4, 4 and 17 neurons, respectively for each experiment were used in the analysis to test the proposed

1850043-3

(84)

October 22, 2018 9:48 1850043

A. Lozano et al.

modeling approach. The cell types observed felt into the following categories: Sustained ON(20), transient ON(5), transient OFF(9) and ON-OFF responses(3).

2.2.2. Models input and output

Each training sample consisted on 50 frames of input stimulus — the actual image projected onto the retina plus 49 additional image frames accounting for the stimulus history — and the model’s target output: the smoothed ganglion cell’s firing rate corresponding to the presentation of the actual image in each time bin. Thus, the model receives a video representation of the stimulus and the stimulus history as an input, and its goal is to learn to perform the necessary spatiotemporal computations in order to successfully predict the corresponding firing rate for that stimulus. The defined video frames are rep- resented as (140×140×3 pixels images, 140×140×1 pixels in the case of the moving bars), each frame corresponding to 10 ms of visual stimulus, with a total length of 500 ms per video input to the model. Each training input is, therefore, comprised of a new image frame followed by 49 frames corresponding to 490 ms of the preceding stimulus. This allows us to incorpo- rate the spatiotemporal information that is needed to model the response of the ganglion cells.

The target output of the model is a spiking probability function for each of the neurons. This function comes from the convolution of spike trains with a Gaussian function, which transforms these discrete

spike events into a virtually continuous target to fit, which transforms a classification problem — discrete classes corresponding to number of spikes in a bin — into a regression problem — a continuous firing rate function.

Given that the darkness duration lasted 1500 ms, to allow the retina recover from consecutive light stimulations, a relevant subset of training data consisted of spontaneous ﬁring corresponding to the recovering time when the retina was removed. There- fore, only the initial 500 ms of darkness after each ﬂash were included in the model, which reduced the training time into a more balanced dataset.

2.3. Proposed modeling approach

In this work, a completely data-driven approach is proposed to develop our model, which is in contrast to the physiologically inspired models of the retina as used in Ref. 27, where the retina models are built based on the mathematical description of the retinal microcircuits. Our goal is to build a system that is able to learn the necessary computations to reproduce the recorded response of the biological retina by itself; that is, the instantaneous ﬁring rates.

The architecture of the proposed model (Fig. 2) consists of a series of 3D convolutions of volumes of data coming from the stimulus video frames, followed by a nonlinear activation function, a maxpooling operation and a batch normalization operation. This blocks of convolution-activation-maxpooling-batch

Fig. 2. Illustration of one of the 3D CNN’s architectures tested, with 8(a), 16(b) and 32(c) ﬁlters, respectively, in the convolutional layers and 3× 3 × 4 kernels. Each convolution operation of each of the ﬁlters with the previous layer activations is followed by nonlinear activation function, a Maxpooling and Batch Normalization.

1850043-4

(85)

October 22, 2018 9:48 1850043

normalization are concatenated several times. This allows to reduce the dimensionality of the output volume of the convolution layers, extracting the most relevant descriptive features and compressing the spatiotemporal information. After these convolutional stages, the resultant tensor is flattened, and it inputs a fully connected layer, followed with another nonlinear activation function (ReLu or PreLu) that outputs the firing probability or firing count prediction for each neuron in the corresponding 10 ms bin.

The 3D convolutions performed by this model are linear operations were a three dimensional filter — also called kernel — is passed through the entire input volume, performing a matrix multipli- cation operation between the filter’s coefficients and the input at each specific point, generating a filtered output volume (see Fig. 3).

The nonlinear activation function, returns zero when its inputs are negative and yields either the same output value if the input is positive in the case of ReLU, or the input value multiplied by a learnable parameter in the case of the PReLU activation.

The Batch Normalization operation normalizes each layers mini-batch input during training, which allows a faster convergence by reducing the so-called phenomenon internal covariate shift .⁶³

Maxpooling is a downsampling strategy which consist on a sliding volume that is passed through

Fig. 3. Illustration of the model’s basic operation: three dimensional convolution, where a ﬁlter — also called kernel — is passed through the entire input volume, gen- erating a new output volume by performing matrix mul- tiplication of its weights and the input values on each position. Here, H, W and T denote, respectively, the dimensions height, width and time, while the subscripts i, k and o stands for input, output and kernel. Each model’s layer has several of these ﬁlters, whose weights need to be learned during training.

its input and that yields the maximum of the input values for each position, which in this case are the rectiﬁed activations of the convolutional neurons from the previous layer.

In the case of the homogeneous flash patterns, three convolutional layers followed by a fully connected layer were enough to fit the ganglion cell’s firing rates; in the case of the checkerboard patterns fitting, four convolutional layers were more effective.

The actual spike trains are then generated by means of a Poisson point process, which is preceded by a ﬁring rate normalization in the case of the PSTH (Peri-Stimulus Time Histogram) curve ﬁtting introduced later on. The Poisson point process²⁸ is used to generate stochastic events with a certain probability during a certain time interval. This has been used widely in the spike generation of ganglion cell’s.^4,6,13,23 The probability of a spike occurring during a short time interval is given by Eq. (1).

P {One spike during dt} ≈ r(t)dt, (1) where r(t) is the instantaneous firing rate and dt is a small time window such that r(t)dt 1 for the approximation to be valid. In our case, the firing rate is provided by the CNN output for the next 10 ms simulation interval and the time resolution was set to the value of dt = 1 ms to fulfill the approximation requirements.

2.4. Single trial versus PSHT fitting 2.4.1. Single trial responses

Two different fitting approaches were tested (both in a supervised learning way) in our application. In the first one, the smoothed ganglion cell responses were fed to the CNN in a single-trial way; that is, each training example consisted on 500 ms of stimulus input and 10 ms of the biological neurons firing probability to be learned, during the whole recording. This led to a multiple-batch experiment training, where each batch corresponds to one entire pass throughout the light stimulus pattern repeated through time, for a total of 25 batches in the case of the homogeneous RGB flashes and 10 batches for the checkerboard patterns. Given that the ganglion cell responses to each trial presented a certain degree of variability, this training approach lead our model

1850043-5

(86)

October 22, 2018 9:48 1850043

A. Lozano et al.

to learn a general response shape of the firing probability distribution along the trials for each stimulus presentation to the retina. For the homogeneous RGB flash patterns, convergence between training and validation loss was achieved around the 17th of 25 batches fed to the network in the first training epoch, meaning that the model achieved its best fitting of the retinal behavior before the whole recording from the responses to the pattern was shown to the neural network.

2.4.2. PSTH as prediction target

In the second training approach, the target function presented to each neuron model to be learned consisted of a smoothed version of the Peri-Stimulus Time Histogram (PSTH) for each neuron in a single batch. The PSTH is one of the most commonly used methods to estimate the ﬁring rate of a neuron over a time interval and it is usually averaged by the number of trials:

PSTH(t) = 1 N

N i=1

SP (t), (2)

where N is the number of trials and SP is the spike count in the ith bin.

Thus, in the case of uniform RGB flashes, 25 repetitions were reduced to a single batch and 10 repetitions in the case of the checkerboards. Hence, the CNN was taught to learn the most characteristic response shapes for each ganglion cell. With this training approach, four epochs through the cell’s responses were enough to achieve convergence for both the RGB flashes and the checkerboard patterns. This means that we can shorten the training time by approximately 75% with respect to the previous method, proving that the PSTH approach was a more efficient training strategy than the single trial approach.

These results will be useful in future stimuli design, where the search for a trade-oﬀ between number of diﬀerent stimuli and number of repetitions will be considered.

2.5. Model training

Training was performed over four diﬀerent datasets:

homogeneous RGB ﬂashes, two checkerboard datasets of diﬀerent size (8×8 and 16×16 pixels each square element) with binary RGB combinations, and

a moving bars dataset. Here, we define a data sample as an input/output sample pair of the dataset that will be fitted by the DL model. The flashes training set consisted of 30,000 data samples (obtained from 300 s of stimulus-response recording of sliding volumes of frames and the spike firing probability of the recorded neurons), with a 70% 10% 20% split for training/validation/test, where the training data is used to adjust the model weights iteratively by means of a gradient descent optimization algorithm, the validation set is used to stop the training of the model before it overfits, and the test set is used in order to provide the actual metrics and results. The same ratio was used for the checkerboards and moving bars datasets, with a total duration of 480 s of stimulus-response recording (with 48,000 data samples) for each one of the checkerboard stimulus, and 80,000 data points for the moving bars.

The network specification was performed using the widely adopted Tensorflow³¹ and Keras³² deep learning frameworks. The training process was performed on CPU+GPU instances of the cloud computing service Floydhub.³³ The CPU nodes had two cores and 7 GB of RAM. The GPU nodes were configured with four cores, 60 GB RAM and Nvidia Tesla K80 GPUs with 12 GB.

Training and fine tuning a Deep Learning System is a complex process that regularly requires finding optimum values from a huge hyperparameter search space for variables, such as the number of layers, number or filters for each layer, filter dimensions, dropout percentage, activation functions, loss function, optimizers, batch size selection and weight initialization, among others.

In this work, a two stage model customization was performed. In the first stage, a manual search was carried out to find a suitable architecture and an estimate of a model that achieved a reasonably good firing rates fitting. Then, a systematic combination of optimizers, loss functions and number of layers was carried out to evaluate the consequent variation in performance and to find the best suboptimal solution between them.

2.5.1. First hyperparameter search stage

In this initial stage, an empirical, trial-and-error process was carried out to adjust the size and several of the parameters of the diﬀerent layers, including

1850043-6

(87)

October 22, 2018 9:48 1850043

the number of filters per layer, dropout percentage, pooling, activation functions, batch size and others that achieved a reasonably good firing rates fitting.

The number and sizes of the ﬁlters of the convolutional layers were varied to achieve the best results.

As expected, the temporal dimension of the kernels was the most relevant set of parameters, and the ﬁrst layers parameters had a key role in the results given that this ﬁrst convolution addresses RGB values of the monitors pixels. The relevant operations performed by the model can be seen in Table 1, were we show one of the model’s tested architectures.

A Lecun uniform function was used for the weight initialization, which takes samples from a uniform distribution parametrized in relation to the number of inputs to that layer.²⁹Speciﬁcally, the weights are randomly drawn from a distribution with mean zero and a standard deviation given by

σw = m^−1/2, (3)

where m is the number of inputs to the node.

L1 and L2 weight regularizers,³⁰ and eventually activity regularizers, were also included among the parameters for every layer. Adding an L1 cost term, Eq. (4), as a weight regularizer promotes sparse feature selection — forcing the weights associated to the irrelevant features to be close to zero — because L1

depends on the average of the absolute weight values:

L1 = λ

k i=1

|w_i|. (4)

Additionally, the L2 term, Eq. (5), penalizes high- value weights

L2 = λ

k i=1

w²_i. (5)

2.5.2. Second hyperparameter search stage

In the second stage, we systematically combined a series of optimizers and loss functions to obtain information about the effect of this hyperparameters on the performance and the evolution of the mean correlation coefficient when augmenting the number of convolutional layers and to fine tune the model to achieve the best performance. Adam, Adamax and RMSprop were optimizers tested. The loss functions evaluated were Poisson, Mean Squared Error (MSE) and Mean Squared Logarithmic Error (MSLE), which lead to objective function fittings of slightly different shapes. The number of layers was varied between 1 and 4. The results obtained by this second hyperparameter search stage were categorized by the mean correlation coefficient achieved between Table 1. Example of a tested network’s architecture.

Output dimension Kernel shape

Layer type (height, width, time) (height, width, time) Kernel # Param #

Conv3D 139× 139 × 41 2× 2 × 10 8 968

ReLU 139× 139 × 41 None 8 0

MaxPooling3D 69× 69 × 41 2× 2 × 1 8 0

BatchNorm 69× 69 × 41 None 8 32

Conv3D 68× 68 × 37 5× 5 × 3 16 2576

ReLU 68× 68 × 37 None 16 0

MaxPooling3D 34× 34 × 37 2× 2 × 1 16 0

BatchNorm 68× 68 × 37 None 16 64

Conv3D 33× 33 × 33 4× 4 × 2 32 10272

ReLU 33× 33 × 33 None 32 0

Max Pooling3D 16× 16 × 33 2× 2 × 2 32 0

BatchNorm 33× 33 × 33 None 32 128

Conv3D 14× 14 × 29 4× 4 × 2 32 46112

ReLU 14× 14 × 29 None 32 0

BatchNorm 14× 14 × 29 None 32 128

Max Pooling3D 7× 7 × 29 2× 2 × 1 32 0

Flatten 45472 None 1 0

Dense 37 None 1 1682501

ReLU 37 None 1 0

1850043-7

(88)

October 22, 2018 9:48 1850043

A. Lozano et al.

Fig. 4. Mean correlation coeﬃcient between biological neurons responses and models, versus the number of convolutional layers. Diﬀerent performance was achieved when changing the number of layers, combining several optimizers and cost functions.

our model and the biological ganglion cell responses (Fig. 4).

3. Results

3.1. Model performance

Two metrics were initially used to measure the good- ness of the ﬁt during the development of the model:

Poisson loss between model predictions and unseen data (related with the log-likelihood of two variables under Poisson distribution assumption), and Pear- son’s coeﬃcient between PSTHs generated from 25 trials of the same stimulus (six RGB ﬂashes) for each of the neurons on the network’s output.

As commented in 2.5, a two stage hyperparameter search was performed. The Poisson loss objective function was monitored in training and validation sets as training was performed in the flashes dataset during the first stage (manual search). In addition to the Poisson loss, other indicators, such as Mean Absolute Error (MAE) and Mean Squared Error (MSE), were used to monitor the procedure. Finally, the training stopped when the validation set error started to increase, meaning that the model started to overfit.

During the second hyperparameter search (systematic combination of optimizers and loss functions), the results were compared and they are shown

in Fig. 4. In the case of the Adam optimizer and its variant Adamax, in combination with the Poisson loss, the model’s performance decreased after using three convolutional layers, while in the other combinations between the optimizers (Adam, Adamax, RMSprop) with loss functions (MSE, MSLE, Pois- son) the performance stabilized. Models with one or two layers were only able to ﬁt the behavior of neurons with very well deﬁned and simple response

Fig. 5. (Color online) Sample of a ganglion cell’s PSTH fittings in response to six different homogeneous flash patterns, where the three models evaluated were success- ful. The biological neuron responses are showed in black, and the model’s predictions are showed in blue, red and yellow. Horizontal lines in the upper part of the figure represent the length (500 ms) of the stimulus.

1850043-8

(89)

October 22, 2018 9:48 1850043

shapes, thus aﬀecting the measured overall performance, while three and four layered models were more eﬀective on capturing the response of neurons with more complex responses even for noisy neurons.

The results showed a signiﬁcant correlation, given the intrinsic variability of the neurons’ responses.

For example, Fig. 5 illustrates the PSTHs from both model and biological neuron responses measured within 10 ms bins, showing a qualitatively coherent behavior.

In contrast, Fig. 6 shows that the model responds with a different characteristic firing probability wave- form to each stimuli, with a rising fire rate and with a slightly different delay for each neuron when the light was shown and with a depression or rising of the spiking probability when the light faded out or after some milliseconds of darkness. In brief, ON, OFF and ON/OFF behaviors were observed in both the real retinal recordings and the model predictions, and the CNN was at the same time able to model a variety of neuronal behaviors to the same stimulus,

Fig. 6. (Color online) Dynamic firing function of a real neuron (blue) versus model neuron (black). The model is able to fit different types of neurons at once, with different base firing rate and characteristic shapes. On the left column, seven examples of the modeling results obtained by fitting the PSTH directly for several checkerboard patterns. On the right, the fitting obtained in the single-trial way for the same neuron. It can be observed that the PSTH training approach offers more robust results — a higher number of neurons was successfully modeled —, while in the single-trial way some of the characteristic responses were missed or less accurately captured.

1850043-9

(90)

October 22, 2018 9:48 1850043

A. Lozano et al.

with different qualitative responses and different firing rate baselines.

The CNN was also very sensitive to the Gaus- sian smoothing of the actual spike train, changing the shape of the ﬁtting curve depending on the standard deviation used on the Gaussian, poorly predicting a mean ﬁre rate when the standard deviation was too low and, therefore, unable to compensate the neuron’s variability on the response (data not shown).

Figure 7 illustrates raster plots for both biological and model ganglion cells’ responses to 29 repeated

trials of 6 RGB flashes for 10 different neurons. It is noticeable that the CNN’s neuron is able to predict a response qualitatively similar to the real ganglion cells with different behaviors, with a concentrated firing activity after the light flashes and similar temporal dynamics. The results in Fig. 7 are shown with a smoothing standard deviation of 10 ms.

In order to compare our model’s performance with other data-driven tools well established in neuroscience, we ﬁtted Linear-Nonlinear models^6,7 in addition to a three layered Feed Forward Neural

Fig. 7. Comparison between recorded and model-generated raster plots for several neurons in response to homogeneous RGB images.

1850043-10

(91)

October 22, 2018 9:48 1850043

Network (FFNN).³⁵ The Linear-Nonlinear model consists of a single linear filter and a rectification nonlinearity for each neuron. Our alternative baseline model, the FFNN, has already been proved to perform with competitive or better results than GLMs,³⁴ and was configured with 30 and 20 neurons in each hidden layer, respectively, and the same number of neurons as firing rates to fit in the output layer. Both of these models targeted simultaneously all the neurons firing rates, in order to compare them with the proposed 3D CNN model.

Both kind of models were fitted with a similar set of hyperparameters feature search as in the proposed model’s development: L1 and/or L2 weight regularization, assessment of the effects of weight initialization on model’s convergence, dropout and overfitting control by early stopping to mention a few. The dataset dimensions provided for the 3D CNN, LN and FFNN models were reduced to 50 × 50 × 3 (50 × 50 × 1 in the case of the moving bars) in order to reduce the number of the LN and the FFNN parameters to a manage- able order of magnitude, since they don’t have the advantage of shared weights that the CNN architectures offer.²¹ Several metrics were used to measure and compare the models performance, such as the Correlation Coefficient, Mean Squared Error and with the Dynamic Time Warping (DTW),³⁶ which is a metric specially suitable to measure the similarity of time series that are similar but can be warped elastically, giving poor correlation. The baseline distance for the DTW calculation was the

Squared Euclidean distance. Our model gets competitive results under almost all metrics and cir- cumstances on test sets, where the model’s generalization is specially relevant. Table 2 shows model performances with Standard Errors in training and test set for the correlation coeﬃcients and MSE, while the DTW was evaluated on the entire dataset, without randomizing the training samples and thus preserving the temporal structure of the ﬁring rates responses in order to assure a proper time- dependent measure. The computation was performed using a python implementation of the FastDTW^37,38 algorithm, which reduces DTW’s high computational load while providing accurate results.

Lastly, in a first approximation to explore the possible capabilities of the proposed models to provide valuable insights on the biological neurons computation, we trained a 3D CNN on the ganglion cell’s responses to moving bars, without adding Maxpool- ing operations and replacing the last fully connected layer by another layer of 17 convolutional filters of the same size of the output of the previous layer, which yielded the firing rate predictions for the 17 neurons modeled in the bars dataset. This way, the spatiotemporal structure of the input through the model is maintained. We averaged the last convolutional layer’s weights learned by the 3D CNN along the time axis and, finally, they were visualized and compared with the time-averaged Linear-Nonlinear model filters, trained with gradient descent (GD) optimizers, and the Spike Triggered Average (STA), traditionally used to fit Linear-Nonlinear models, as

Table 2. Results under several metrics for diﬀerent models and stimuli: homogeneous ﬂashes, checkerboards (CB) and moving bars.

C.Corr. MSE DTW

Stimulus Model Train Test Train Test Entire sequence

Flash

LN 0.825 ± 0.031 0.777 ± 0.034 0.997 ± 0.720 1.210 ± 0.844 269.419 ± 35.531 FFNN 0.890 ± 0.035 0.847 ± 0.036 0.448 ± 0.293 0.701 ± 0.446 182.995 ± 41.090 CNN 0.939 ± 0.027 0.901 ± 0.030 0.253 ± 0.232 0.445 ± 0.366 97.410 ± 13.94 16× 16CB

LN 0.696 ± 0.030 0.593 ± 0.034 0.679 ± 0.37 0.973 ± 0.434 529.834 ± 382.119 FFNN 0.550 ± 0.035 0.513 ± 0.035 0.475 ± 0.03 0.855 ± 0.202 1101.075 ± 464.148 CNN 0.632 ± 0.027 0.615 ± 0.031 0.561 ± 0.228 0.889 ± 0.382 572.4 ± 258.830 8× 8CB

LN 0.404 ± 0.030 0.397 ± 0.034 0.927 ± 0.379 1.199 ± 0.431 1106.157 ± 459.875 FFNN 0.576 ± 0.035 0.539 ± 0.036 0.390 ± 0.172 0.858 ± 0.459 1562.719 ± 787.745 CNN 0.688 ± 0.027 0.696 ± 0.030 0.445 ± 0.143 0.699 ± 0.270 682.920 ± 292.300 Bars

LN 0.589 ± 0.030 0.468 ± 0.038 12.340 ± 2.259 13.376 ± 2.386 5667.365 ± 1486.142 FFNN 0.590 ± 0.055 0.504 ± 0.052 9.614 ± 1.572 11.690 ± 1.967 4821.189 ± 1079.966 CNN 0.629 ± 0.030 0.562 ± 0.034 10.210 ± 1.740 11.065 ± 1.843 5137.772 ± 1021.121

1850043-11

(92)

October 22, 2018 9:48 1850043

A. Lozano et al.

Fig. 8. (Color online) Visualization of the learned ﬁlters for the developed 3D CNN, in comparison with a gradient descent ﬁtted Linear-Nonlinear model and a STA. Red pixels represent the weights with higher values, while blue pixels represent the weights with lower values.

a baseline. This baseline STA was obtained by using the functionalities offered by open-source neurophysiology analysis tool pyret.³⁹It should be noted that the learned filters were coherent with the results yielded by the standard method (STA) used to obtain the most informative spatial information for the neurons, and were also clearer than the results obtained by the LN model in our case. Two of the visualizations are reproduced in Fig. 8, showing that the filters of the obtained model are similar to the results drawn by methods designed specifically for obtaining this kind of information.⁴⁰

4. Discussion and Future Work

In this paper, a data-driven methodology to model retinal ganglion cell responses to different light patterns has been applied to a biological mouse retina with positive results, showing consistently a high correlation with the spiking firing rates obtained from the biological spike train responses, along with low prediction error as measured by Mean Squared Error and Dynamic Time Warping. Two different approaches were successfully tested: fitting single trial responses and PSTH fitting, where the second has better spiking simulation results. In addition, the Peri-Stimulus Time Histogram built from Poisson- simulated spike trains showed a similar behavior in both the model and the real retina. On the way to achieve these results, several structural and parametric decisions were taken for the model, resulting in a

3D CNN model that showed high sensitivity to the activity and parametric regularizers on the dense layers, on one hand, and to the variance of the Gaussian with which the spike trains were ﬁltered, on the other hand. These facts revealed that retinal spiking variability handling will play an important role in future developments of the model.

The model deployed here provides a new tool to model the neural encoding of spatiotemporal light patterns encoded by the mouse retina ganglion cells. In particular, we will set up new experiments that will allow us to further extend the ability of our model to mimic a wider and more representa- tive range of retinal behaviors, and also reveal new insights on them.

During the development of this work, several limitations of different nature were encountered, both biological and computational. The first limit was the neural spiking variability. This characteristic of the ganglion cells’ response was partially handled by Gaussian smoothing, in addition to the endeavor of a second training approach — fitting the PSTH instead the single-trial responses. The results that we obtained during the network training showed that the convolutional model with the correct hyperparameters setting is able to discriminate between the most distinctive neurons response to the stimulus and the neuronal noise. This could be seen either as a positive feature of the model — that is able to fit the behavior of even noisy neurons and extract the characteristic firing — or a failure on the task of developing a fully biological-mimicking artificial ganglion cell. However, the matter of chaos in neural population coding,⁴¹and the role of noise and spontaneous firing on the neural information^42,43 is still open to discussion and study in the scientific community, and is outside of scope of this present work. In general, our model was able to mimic with higher precision neurons with a well-shaped response and low noise levels associated with the spontaneous firing, even if they present complex, nonlinear dynamics.

The second training approach — PSTH ﬁtting — implies that to obtain a more robust model of the response of neurons with high variability, several repetitions of the stimulus are needed to obtain a characteristic shape for the response of that ganglion cell. This leads to a second limitation, which is the experiment duration. Given the ﬁnite time interval on which a biological retina can be kept functional

1850043-12

(93)

October 22, 2018 9:48 1850043

into the recording setup, a set of stimuli must be carefully selected to study the desired retinal behavior. This matter will be considered as a fundamen- tal variable to be studied in future works, where we will search for a suitable trade-oﬀ between number of repetitions and response robustness.

In the development of a retina model, the ques- tion of individual versus population image encoding arises. Although the 3D convolutional neural network proposed here is able to simultaneously mimic the activity of a variable number of neurons, there is no intrinsic inter-dependence between the activity of the artificial neurons created in the models architecture. Two different future approaches can be followed from here. The first is to develop a new enhanced model which takes into account not only the stimulus presented to the retina but also the activity of the spatially adjacent neurons. The second is to accept the hypothesis of the independently neuronal coding and develop functional lighter models of different ganglion cells types which could be useful in order to be implemented in a functional artificial retina suitable for a prosthesis. For this purpose, it would be necessary to successfully record and iden- tify a sufficiently varied number of types of neurons and then study the role of each one in the retinal coding of an actual image to mimic this complex process.

Another key limitation of the process of building any neural model is the measure of the good- ness of the fit. As is mentioned in Ref. 44, the design of the performance metrics of neurosensory models is not a trivial task and widely used metrics such as the Pearsons correlation coefficient do not always yield to information that helps us to differentiate if the models are performing well when modeling the explainable variance and the response variability. In this work, the correlation coefficient has been selected as a simple metric to compare the architectures of different models and the parameters’

tuning. Moreover, we used also distance-based metrics, more adequate for time-series comparison, like the basic MSE, but also the DTW, an algorithm which is able to ﬁnd the optimal alignment between two time series that may warp in time, measuring a kind of elastic distance. However, more sophis- ticated metrics could be applied to systematically evaluate several solutions, such as the use of modern nonlinear time-frequency analysis of populations

coding.^44,45 Also, other quantitative and indirect techniques could serve to validate the model, such as the development of linear/nonlinear decoding systems to reconstruct images from the neural activity.^24,46In particular, this last approach could reveal insights into how eﬀectively the constructed models have encoded the information in comparison to the biological reference.

On the computational/modeling side, the issue of hyperparameter search remains one of the main tasks whose efficiency could be improved. In our work, a two stage procedure was applied (manual search followed by a systematic combination of optimizers and loss functions). Some results show that a ran- dom search can be more efficient than a manual or grid search for several machine learning algorithms and datasets because not all hyperparameters have the same importance on the models accuracy and, thus, a grid search wastes time on performing too many trials on hyperparameters that do not actu- ally improve the models performance in a significant way.⁴⁷

In order to ensure its reliability in real world situations, deep learning models should be tested in the most diverse situations possible. In this matter, researchers in the neuroscience ﬁeld face the chal- lenge of generalization, on which creating suitable datasets plays a critical step on it.

In our work, we focus on implementing a methodology to create accurate models of retinal ganglion cells that generalize in a competitive way or out- performs other traditional models as measured by evaluating them on samples that are unseen during training, proving this way that our model is able to generalize well and thus, that it can be a competitive and useful tool for the research community. Extend- ing this results to highly complex and realistic stimuli is one of the main natural steps that our research is pointing towards to, and the generation of convenient datasets recordings that containing realistic visual stimulus will play a major role in order to assure its success.

Among our proposals for future works are: the use of new or enhanced CNN models and techniques, such as recurrent layers, that take the temporal states of each neuron into account; the change in the spike generation to more advanced and ﬂexible models, such as inhomogeneous Gamma and inverse Gaussian proposed in Ref. 48; the visualization of the

1850043-13

(94)

October 22, 2018 9:48 1850043

A. Lozano et al.

inner activations and learned ﬁlters, as in Ref. 13, which may help in the understanding of retinal computations, building more computationally opti- mized networks as done in Ref. 49. Finally, it should be noted that deep learning is a rapidly evolving ﬁeld. New architectures, training procedures and algorithms are emerging quickly, such as GANs,⁵⁰ the most recent Capsule Nets,⁵¹ which promise new capabilities and drawbacks whose usefulness for our modeling purposes should also be explored, in addition to the third generation of neural networks, such as the spiking neural networks⁵²models.

With regards to the application domain, our efforts will also include working on efficient image reconstruction from both real and simulated retinal ganglion cells, studying the effect of noise on CNNs performance⁵⁹for both encoding and decoding tasks, and also on the use of highly-complex and realistic light patterns, which would allow us to build a more powerful and generalizing retinal model that could be suitable for hardware implementation in devices such as FPGAs.^60,61

Finally, although deep learning algorithms are usually though to be black box algorithms, this is less accurate in the case of convolutional neural networks, as the knowledge learned by the network is suitable for visualization. Current frameworks⁵⁵ incor- porate recent techniques to visualize and interpret these representations of learned knowledge.⁵⁶ It is possible to visualize the intermediate outputs of the network to understand how it transforms the inputs at each stage and it is also possible to visualize the internal filters learned to understand which pattern each filter is being receptive to, or to visualize the map of class activations that allows identifying which parts of an input identified it as a member of a class, which will be another line to be developed. Thus, our hope is that this means that we are able to take a step further in the development of realistic and bioinspired retinal models that are suitable for biomedical applications, such as disease simulation and prosthesis development.

Acknowledgments

This work was supported by the Programa de Ayudas a Grupos de Excelencia de la Región de Murcia, Fundación Séneca, Agencia de Cien- cia y Tecnolog´ıa de la Región de Murcia, by the

grant MAT2015-69967-C3-1-R from the Spanish Government, and by the project TEC2015-66878- C3-2-R (MINECO/FEDER, UE) from the Spanish Government.

References

1. H. Wen, J. Shi, Y. Zhang, K.-H. Lu and Z. Liu, Neu- ral encoding and decoding with deep learning for dynamic natural vision, arXiv:1608.03425.

2. H. Lee, C. Ekanadham and A. Y. Ng, Sparse deep belief net model for visual area V2, in Adv. Neural Inf. Process. Syst.20 (2008) 873–880.

3. D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks, in Proc. IEEE Int. Conf.

Comput. Vis. 2015 Santiago, Chile, pp. 4489–4497.

4. J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky and E. P. Simoncelli, Spatio-temporal correlations and visual signalling in a complete neuronal population, Nature 454 (2008) 995–999.

5. N. Burkitt, A review of the integrate-and-ﬁre neuron model: I. homogeneous synaptic input, Biol. Cybern.

95(1) (2006) 1–19.

6. E. J. Chichilnisky, A simple white noise analysis of neuronal light responses, Network : Comput. Neural Syst.12 (2001) 199–213.

7. S. A. Baccus and M. Meister, Fast and slow contrast adaptation in retinal circuitry, Neuron 36(5) (2002) 909–919.

8. J. W. Pillow, L. Paninski, V. J. Uzzell, E. P. Simon- celli and E. Chichilnisky, Prediction and decoding of retinal ganglion cell responses with a probabilis- tic spiking model, J. Neurosci.25(47) (2005) 11003–

11013.

9. N. A. Lesica and G. B. Stanley, Encoding of natural scene movies by tonic and burst spikes in the lat- eral geniculate nucleus, J. Neurosci. 24(47) (2004) 10731–10740.

10. J. Keat, P. Reinagel, R. C. Reid and M. Meister, Predicting every spike: A model for the responses of visual neurons, Neuron30(3) (2001) 803–817.

11. S. V. David and J. L. Gallant, Predicting neuronal responses during natural vision, Netw. Comput. Neu- ral Syst.16(2–3) (2005) 239–260.

12. K. A. Zaghloul, K. Boahen and J. B. Demb, Contrast adaptation in subthreshold and spiking responses of mammalian y-type retinal ganglion cells, J. Neu- rosci.25(4) (2005) 860–868.

13. L. Mcintosh, N. Maheswaranathan, A. Nayebi, S. Ganguli and S. Baccus, Deep learning models of the retinal response to natural scenes, in Adv. Neural Inf. Process. Syst. Vol. 29 (Barcelona, Spain, 2016), pp. 1369–1377.

14. R. Crespo-Cano, A. Mart´ınez- ´Alvarez, A. D´ıaz- Tahoces, S. Cuenca-Asensi, J. M. Ferr´andez and 1850043-14

(95)

October 22, 2018 9:48 1850043

E. Fern´andez, On the automatic tuning of a retina model by using a multi-objective optimization, in Artificial Computation in Biology and Medicine (Elche, Spain, 2015), pp. 108–118.

15. D. Turcsany, A. Bargiela and T. Maul, Modelling retinal feature detection with deep belief networks in a simulated environment, in Proc. of the Euro- pean Council for Modelling and Simulation, (Brescia, Italy, 2014), pp. 364–370.

16. K. He, X. Zhang, S. Ren and J. Sun, Delving deep into rectiﬁers: Surpassing human-level performance on imageNet classiﬁcation, in Proc. IEEE Int. Conf.

Comput. Vis. (Santiago, Chile, 2015), pp. 1026–1034.

17. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overﬁtting, J. Mach.

Learn. Res.15 (2014) 1929–1958.

18. S. Shi, Q. Wang, P. Xu and X. Chu, Benchmark- ing State-of-the-Art deep learning software tools, in 7th Int. Conf. on Cloud Computing and Big Data (Macau, China, 2016), pp. 99–104.

19. A. Krizhevsky, I. Sutskever and E. H. Geoﬀrey, Ima- geNet classiﬁcation with deep convolutional neural networks, in Adv. Neural Inf. Process. Syst.25 (Lake Tahoe, Nevada, 2012) 1097–1105.

20. J. I. Glaser, R. H. Chowdhury, M. G. Perich, L. E. Miller and K. P. Kording, Machine learning for neural decoding, arXiv:1708.00909.

21. Y. Lecun, Y. Bengio and G. Hinton, Deep learning, Nature521(7553) (2015) 436–444.

22. K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recog- nition unaﬀected by shift in position, Biol. Cybern.

34(4) (1980) 193–202.

23. E. Batty, J. Merel, N. Brackbill, A. Heitman, A.

Sher, A. Litke, E. J. Chichilnisky and L. Panin- ski, Multilayer recurrent network models of primate retinal ganglion cell responses, in Proc. Int. Conf.

on Learning Representations (Toulon, France, 2017), pp. 1–12.

24. A. D´ıaz-Tahoces, A. Mart´ınez- ´Alvarez, A. Moll, L. Humphreys, J. Bolea and E. Fern´andez, Towards the reconstruction of moving images by populations of retinal ganglion cells, in Artificial Computation in Biology and Medicine (Elche, Spain, 2015), pp. 220–

227.

25. E. Fern´andez, J. M. Ferr´andez, J. Ammermuller and R. Normann. Population coding in spike trains of simultaneously recorded retinal ganglion cells, Brain Res.887(1) (2000) 222–229.

26. NEural SOrter a tool for oﬄine electrophysiological recording analysis, http://soruceforge.net/projects/

neuralsorter.

27. P. Martnez-Caada, C. Morillas, B. Pino, E. Ros and F. Pelayo, A computational framework for realistic retina modeling, Int. J. Neur. Syst. 26(7) (2016) 16500301.

28. L. F. Abbott and P. Dayan, Theoretical Neuro- science: Computational and Mathematical Modeling of Neural Systems (The MIT Press, Cambridge, MA, 2005).

29. Y. LeCun, L. Bottou, G. B. Orr and K.-R. M¨uller, Eﬃcient backprop, in Neural Networks: Tricks of the Trade, eds. G. B. Orr and K.-R. M¨ller (Verlag Berlin Heidelberg, 1998), pp. 9–50.

30. A. Y. Ng, Feature selection, L1 vs. L2 regular- ization, and rotational invariance, in Proc. of the Twenty-first International Conf. on Machine Learn- ing (Banﬀ, Alberta, Canada, 2004), pp. 78–86.

31. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, S. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, W. Jia, R. J´ozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J.

Shlens, B. Steiner, I. Sutskever, K. Talwar, P.

Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu and X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed sys- tems, in Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Learn- ing (2015), https://www.tensorﬂow.org/about/bib.

32. F. Chollet, Keras (2015), https:// github.com/

fchollet/keras (accessed March, 2017).

33. FloydHub, https://www.ﬂoydhub.com (accessed March, 2017).

34. A. S. Benjamin, H. L. Fernandes, T. Tomlinson, P.

Ramkumar, C. VerSteeg, R. Chowdhury, L. Miller and K. P. Konrad, Modern machine learning outper- forms GLMs at predicting spikes, bioRxiv 111450 (2017), doi: https://doi.org/10.1101/111450.

35. K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2(5) (1989) 359–

366.

36. H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, in IEEE Trans. on Acoustics, Speech, and Signal Pro- cessing 26(1) (1978) 43–49.

37. S. Salvador and P. Chan, Toward accurate dynamic time warping in linear time and space, Intelligent Data Analysis11(5) (2007) 561–580.

38. Slaypni, A Python implementation of FastDTW (2015), https://github.com/slaypni/fastdtw (accessed July, 2017).

39. B. Naecker, N. Maheswaranathan, S. Ganguli and S. Baccus, Pyret: A Python package for analysis of neurophysiology data, J. Open Source Software2(9) (2017), Miscellaneous Resource.

40. M. Fiorani, J. C. B. Azzi, J. G. M. Soares and R.

Gattass, Automatic mapping of visual cortex recep- tive ﬁelds: A fast and precise algorithm, J. Neurosci.

Methods 221 (2014) 112–126.

1850043-15

Towards an AI endowed visual neuroprosthesis for the blind: development and first-in-human implementation of a deep learning intracortical neural interface

PROGRAMA DE DOCTORADO EN TECNOLOGÍAS DE LA INFORMACIÓN Y LAS COMUNICACIONES

TESIS DOCTORAL

HACIA UNA NEUROPRÓTESIS VISUAL BASADA EN IA:

DESARROLLO E IMPLEMENTACIÓN EN HUMANOS DE UNA INTERFAZ NEURONAL DEEP LEARNING

Presentada por Antonio M. Lozano Ortega para optar al grado de Doctor

por la Universidad Politécnica de Cartagena

Dirigida por:

Dr. José Manuel Ferrández Vicente

Codirigida por:

Dr. Eduardo Fernández Jover Dr. Francisco Javier Garrigós Guerrero

Cartagena, 2022

DOCTORAL PROGRAMME IN INFORMATION AND COMMUNICATION TECHNOLOGIES

PhD THESIS

TOWARDS AN AI ENDOWED VISUAL NEUROPROSTHESIS FOR THE BLIND:

DEVELOPMENT AND FIRST-IN-HUMAN IMPLEMENTATION OF A DEEP LEARNING INTRACORTICAL NEURAL INTERFACE

Presented by ANTONIO MANUEL LOZANO ORTEGA to the Technical University of Cartagena in fulfilment of the thesis requirement for the award

of PhD

Supervisor:

Dr. José Manuel Ferrández Vicente

Co-supervisors:

Dr. Eduardo Fernández Jover Dr. Francisco Javier Garrigós Guerrero

Cartagena, 2022

Abstract

Chapter 1

Introduction

A 3D Convolutional Neural Network to Model Retinal Ganglion Cell’s Responses to Light Patterns in Mice

Chapter ¹