The experiment runs in an iterative way: 1) The positions of the ball and of the paddle are updated, 2) the new position of the ball is accessed and the actor network receives input accordingly, 3) the actor network determines the target position of the paddle and we calculate the reward, 4) we update the synapses according to the plasticity rule, 5) the mean expected reward is updated. If the player loses the ongoing game (ball touches the lower wall), the position of the ball is reset and the ball starts in a random direction. A flowchart of the game-loop is shown in figure 4.4.
For each potential position of the ball, we calculate and save the task-specific (position-specific) mean expected reward ¯Rk; k∈ [0, 31]and estimate it effectively
using an exponentially-weighted moving average: ¯
Rk → R¯k+γ(R−R¯k) , (4.6)
where γ is the discount factor for the moving average. We use a task specific reward for the 32 tasks in our setup because task specificity is required for learning multiple tasks [Frémaux et al., 2010]. In the somewhat surprising terminology of reinforcement learning the 32 states of the ball are considered as 32 distinct
4.2 Experimental setup
Figure 4.4: Flowchart of the implementation.The experiment runs in loop iterat- ing between the PPU and the neuromorphic core. The game is started/reset by setting the ball in the middle of the playing field and releasing it in a random di- rection. Based on the current position, the environment sends a regular spike train to the neuromorphic core through the corresponding state; the winning neuron is determined as the most active neuron. The reward is determined and based on the distance between the ball state and the target position and the plasticity is applied (equation (4.3)). Finally, the PPU updates the environment: the ball moves on and the paddle moves towards the target position; and the loops starts again. If the player looses the game (the ball drops to the lower wall), the environment is reset. Figure taken from Wunderlich et al. [2019].
4. Demonstrating advantages
(multi-armed bandit) tasks, as e.g. in Sutton and Barto [2018]. The weights are initialized with a Gaussian distribution and we update all the 1024 synapses based on the R-STDP learning rule (equation (4.3)),
∆wij =η(R−R¯k)A+mn , (4.7)
where A+mn is a processed version of the accumulator value a+mn (equation (4.1))
on the causal branch. The accumulator value is digitized to 8-bit resolution, calibrated against the offset value, and bit-shifted to the right, meaning that the often noisy least-significant bit is thrown away. Note, that we update all the synapses on the array although we would expect that only synapses in a single row should experience any updates, the others should have exactly zero update because the STDP trace is zero. This holds for this specific application, but we want to use the results of the experiments as a pilot study for large-scale experiments where the distinction between used synapses and silent synapses is not obvious. Further, even the unused synapses could contain nonzero values on the coincidence accumulators, for example through a malfunctioning reset. By updating all the synapses, thus refrained from using expert knowledge, we preserve the generality of our findings for larger networks. This will also be important, when we compare the implementation on neuromorphic hardware to computer simulations. By not including expert knowledge, we keep the network setup scalable (no elaborate blacklisting of faulty synapses) and hence we preserve the generality of the results.
Monitoring the learning progress
We monitor the learning with two observables. The accumulated mean expected reward is defined as ⟨R⟩ = 1 32 31
∑
i=0 ¯ Ri , (4.8)which is the average expected reward over all the 32 positions (inputs) at a given iteration. Because of the graded reward scheme over the length of the paddle, the mean expected reward is only a proxy for the ability of the player to catch the ball; even a slightly off paddle can catch the ball. To access the playing capability more precisely, we introduce the measure of performance
P= 1 32 31
∑
i=0 ⌈Ri⌉ , (4.9)where⌈·⌉is the ceiling operator and Ri is the last reward received by state i. The
performance reflects the ratio of states in which the aiming is accurate enough such that the paddle can reflect the ball. The used parameters of the environment simulation and the plasticity are give in table 4.1.
4.2 Experimental setup
Symbol Description Value
neuromorphic hardware BrainScaleS 2
(2nd prototype version) N number of action/output neurons (LIF) 32
NS number of state/input units 32
Nsyn number of synapses 32·32=1024 Nspikes number of spikes from input unit 20
TISI ISI of spikes from input unit 10 µs w mean of initial weights (digital value) 14
σw standard deviation of initial weights 2 L length and width of quadratic playing field 1 m
∥vp∥1 L1-norm of ball velocity 0.025 m per iteration vp velocity of paddle controlled by BSS2 0.05 m per iteration
rb radius of ball 0.02 m
rp length of paddle 0.20 m
γ decay constant of reward 0.5
β learning rate 0.125
NEST version (software simulation) 2.14.0
NEST timestep 0.1 ms
CPU (software simulation, one core used) Intel i7-4771
Set #1 Set #2 Set #3 (standard)
τmem LIF membrane time-constant 28.5 µs 18.4 µs 24.8 µs
τref LIF refractory time-constant 4 µs 14.3 µs 13.8 µs
τsynexc LIF excitatory synaptic time-constant 1.8 µs 2.4 µs 1.4 µs
Eleak LIF leak potential 0.62 V 0.56 V 0.87 V
Vreset LIF reset potential 0.36 V 0.36 V 0.30 V
Vthresh LIF threshold potential 1.28 V 1.31 V 1.21 V
η+ amplitude of correlation function a +
72 114 70
(digital value)
τ+ time-constant of correlation function a+ 64 µs 80 µs 60 µs
Table 4.1: Parameters used in the experiment. Abbreviations in the table: LIF: leaky integrate-and-fire; ISI: inter-spike interval. The three parameter sets for the three chips are the results of the meta-parameter optimization. We describe quantities of the playing field, such as the length of the paddle, in meters to give them a dimension and to distinguish them from dimensionless quantities. If not mentioned otherwise, the experiments were carried out on chip #1. Table adapted from Wunderlich et al. [2019].
4. Demonstrating advantages
Meta-parameter optimization
We performed meta-parameter optimization of the time-constants(τmem; τsynexc; τref),
the neuron potentials(El; Vreset; Vthresh)and the amplitude of the coincidence detec-
tors(τ+; η+). For the optimization we used the decision-tree-based optimization
algorithmFOREST_MINIMIZEfrom theSCIKIT-OPTIMIZE[Head et al., 2018] soft-
ware package with default settings (extra trees regressor model, 105acquisition function samples, maximizing expected improvement, target improvement 0.01). The meta-parameter optimization serves several ends: 1) it helps to explore the parameter space and sets the parameters of the model to a good working point, 2) it ensures that the results can be compared between hardware and simulation results 3) finally, meta-parameter optimization builds the basis of our transfer experiments, where we study if results obtained on one BSS-2 chip can be applied on another chip of the same generation.