MARCO TEÓRICO CONCEPTUAL
SALUD HOLISTICA
2.3 ANTECEDENTES INVESTIGATIVOS
The authors of [27] show that executing the PF algorithm completely in parallel has its problems with respect to accuracy, due to the resampling step. Their
solution to the problem is to exchange a small amount of particles, which are “fit” enough to introduce variety and “enrich” the local populations of parti- cles of each sub-filter, without sacrificing much of the computation advantages gained from parallelization of the algorithm. They show that exchanging even a small amount of particles, using two different communication topologies, can significantly improve the tracking performance. Another important result from their implementation, is the comparison of all-to-all and ring topologies. Ac- cording to their experimental results, a ring topology can at times outperform an all-to-all one. The implication of this find, is that a one-to-one mapping of the PPF implementation onStarburstusing a ring topology, is not only optimal, but also accurate, taking full advantage of the multiprocessor ring NoC. Addi- tionally, the NoC is also designed to be hardware cost efficient, with minimal communication time overhead, as opposed to other shared memory solutions.
Following this notion, the topology of this implementation is illustrated in Fig. 4.16 using a directed task graph. Here, each of the nodes in the graph represent real-time tasks, executed on its own processing core, and managed by theHelix real-time kernel through apthreadcompatible API. The edges of the graph represent data precedence relations and unidirectional data channels. In essence, tasks communicate through software circular FIFO buffers, which don’t rely on shared memory to transport data from one processor to another, but rather – through the ring NoC. The numbering of the tasks on the graph, reflects the sequence order of processors, as they are interconnected on the ring. Thus, the communication overhead from processor 0 to processor 1 for example is minimal. If the total number of physical processors isPtot, then the communication overhead from 0 toPtot−1 is the highest, but it is minimal the other way around, since the ring NoC is unidirectional.
Figure 4.16: Parallel Particle filter task graph and communication topology Each task in the graph executes a the distributed PPF algorithm as described in alg. 7, on a local particle population. Assuming that the total amount of
Algorithm 7Distributed SIR Particle Filter Algorithm 1: fori= 1 :Nlocal do
2: Initialize{x0,i, w0,i}, such thatx0i ∼p(x0) andw0,i= Nlocal1 . 3: end for
4: foreach system iterationk >0 ; k∈Ndo
5: Acquire new measurement from processor 0: yk = read fifo(0) 6: fori= 1 :Nlocal do
7: Draw a sample xk,i∼p(xk|xk−1) using Eq. 3.5
8: Assign a particle weight,wk,i, using Eq. 3.11
9: end for
10: Exchange particles: {x∗k,j, wk,j∗ }= xchg({xk,i, wk,i}, D, A) 11: Resample particles: {xk,i, wk,i}= Resample({x∗k,j, wk,j∗ }) 12: Compute local estimateˆxk using3.12
13: Send local estimate to processorP+ 1: write fifo(xˆk, P+ 1) 14: end for
particles used in a non-parallel PF would beN, distributed amongPprocessors, then the amount of particles per task isNlocal =NP. For consistency and ease of analysis, it will be assumed from now on, thatN is chosen, such thatNlocal is the the same for every task. Thus, each task executes a “small” version of the PF algorithm withNlocal particles, in a fixed, predetermined amount of time.
Each of the steps are performed locally, in the same manner as a non-parallel implementation. However, before any processing can begin, each processorp >0 reads a new measurement from processor 0, through a dedicated FIFO buffer. This measurement can come from a sensor, or simulation data for evaluation purposes, but it is (for now) always relayed from processor 0. In a future version of the PPF implementation, the use of this core as a data distribution gateway will be omitted, and replaced directly with a hardware accelerator, such as the HOG-SVM detector or just the camera peripheral.
One may notice a newly introduced “exchange” step, inserted between the update and resampling. During this step, particles generated from the update are first sorted in descending order according to their weights. Then, each processorpwith a task nodeτp shall sendDamount of high weight particles to A≤P−1 amount of neighborspi, such that
pi= mod(p+i, P) ; 0< i≤A,
respecting the unidirectional nature of the ring NoC. TheDamount of particles are then incorporated in the local population, forming a new set of particles
{x∗k,j, wk,j∗ };j = 1, ..., Nnew, which is used to resample the local set. More details about the exchange scheme are reviewed soon.
After resampling, each of the cores sends its local estimate to core (P+ 1), which performs global estimation. However, as noted in [27], it is also sufficient to “pick” a local estimate from a PF task, as the estimation quality comes close to that of the global estimator. Nevertheless, processor (P+ 1) is also used for evaluation purposes.
(a) Exchange with appending. (b) Exchange with replacement. (c) Exchange with overlap Figure 4.17: Different exchange strategies. Here, the blue block represents the old local set of particles, while the green blocks - the sets of particles, received from neighboring processors.