After this change, the average duration of the application was of 2.73 seconds, a not great but still reasonable improvement.
8.3.5
Adding extra instances
Since the FPGA block for this application is really simple and does not consume a lot of resources, it is worth adding a few extra instances that will be executed in parallel to gain more efficiency. Specifically, the total amount of instances was changed to 4.
8.3.6
Optimization overview
Figure 8.2: BFS optimization chart
The total performance gain from the accumulated optimizations of the BFS application (based on the non-optimized version) is 11.86.
8.3.7
Analysis
A portion of the trace from the non-optimized bfs application is shown in Figure 8.3.
Figure 8.3: BFS execution trace without optimizations
We can observe that the application barely manages to complete an iteration and a half of the main loop during that interval. While the main function seems to be doing all the job in the first half of the timeline, some bfs2 FPGA tasks are hidden in the trace. If the trace is zoomed in by that section we can see more clearly what is happening (shown in Figure 8.4).
Figure 8.4: Zoomed in BFS execution trace without optimizations
It seems that both threads are alternating the preparation work before creating the task and the execution of the tasks themselves.
Looking again at the first image, we can also observe that there is a significantly large computation being done by the main function between the end of the bfs2 FPGA tasks and the start of the bfs1 SMP tasks from the next loop iteration. This could represent the reduction process of the variable ’stop’ that is executed by the main function after all the FPGA tasks have been completed. Since the bandwidth size is only 1024 and each integer represents a single boolean value, the main function has to apply a very large reduction.
The trace from the application with the bandwidth optimization is shown in Figure 8.5.
Figure 8.5: BFS execution trace with bandwidth optimization
Both the time to execute all the bfs2 FPGA tasks and the reduction of the variable ’stop’ after all of them are finished have been reduced to a significant
degree. However, the packing of some of the structures has caused a false
dependence between bfs1 tasks, due to the random access on one of the packed variables which could cause 2 different tasks to modify the same integer at the same time. Because of that, the dependence has serialized the execution of the bfs1 tasks (note the yellow lines).
The trace from the application with SMP task optimization is shown in Figure 8.6.
Figure 8.6: BFS execution trace with SMP optimization
We immediately see that the dependence lines are gone, and the average
duration of the bfs1 task groups is slightly reduced. While the tasks are no
longer done sequentially, some overhead might have been produced due to the atomic write operation (Figure 8.2).
The final optimization, adding additional instances to the FPGA task, resulted in a substantially faster execution (shown in Figure 8.7).
We can see that the bfs2 FPGA tasks are completed faster thanks to being able to send more than one of them concurrently. This can be seen more clearly if we zoom in on the FPGA task creation section.
The application with a single instance is shown in Figure 8.8, and the application with 4 instances is shown Figure 8.9. We can see in the latter trace that the amount of instances in the same amount of time has almost quadrupled.
Figure 8.8: Zoomed in BFS execution trace with SMP optimization
Figure 8.9: Zoomed in BFS execution trace with additional instances
8.3.8
Conclusion
This BFS implementation, as observed, is certainly portable to an FPGA
when it comes to improving its performance.
For one, the fact that one of the tasks could not be easily implemented as an FPGA task (due to a random access problem, explained in section 7.1) already caps the total amount of optimization we can achieve using the FPGA device and the total workload that it manages to take. Secondly, while the booleans being packed in integers had a really solid performance increase at the start, it might have made other optimizations like pipelining and unrolling harder for the synthesizer to implement.
In conclusion, for this application to perform optimally in an FPGA heterogeneous computing environment, it needs not only an FPGA device but also a good performing host processor.
8.4
NN
8.4.1
Use case analysis
The input data in this application consists of a large amount of entries, including several information like year, month, day, hour, name, longitude and latitude. Specifically, these use case entries represent hurricanes with their arrival time at certain locations. Even though there is a lot of information about each entry, the algorithm only uses the longitude and latitude in order to compute the nearest neighbor algorithm.
The original use case for the application was really short and would not be suitable for optimization purposes, so a bigger use case was generated by the already provided ‘inputGen’ tool in the Rodinia Suite.
8.4.2
State of the non-optimized application
The original algorithm was given the entire information in characters, and extracted the longitude and latitude from them using the parsing function ‘atof’ from the ‘stdlib’ library which converted the specific characters into float type. Vivado HLS does not support this function, so the first idea was to simply make a simplified implementation of that parsing function and call it inside the FPGA block.
the execution of the big use case with a reasonable amount of time. This could be because parsing is composed of mostly sequential iterations of a single address which has a limited amount of read ports, while also writing the result on a variable which has a data dependence with itself between iterations.
8.4.3
Data handling optimization
Since the FPGA logic did not seem to handle the parsing properly, that part of the data handling was moved out of the FPGA and placed in the host before starting to send the tasks. This way, not only the FPGA did not need to treat the input before computing ,but the needs of bandwidth were also significantly reduced due to only needing to send two float types instead of the entire information of each entry.
After this change, the duration of the application was of 13.35 seconds.
8.4.4
Adding extra instances
Once the parsing was out of FPGA block, 2 additional instances were added to the logic, upping the number of instances to a total of 3. The duration of the application at this stage was of 9.94 seconds, not a big improvement but certainly not small either.
8.4.5
Block size increase
The block size of the application was originally 1024. However, since the input data consists of 3 single dimension arrays, that number could be increased. The block size was increased to 8192, since increasing the input size further did not seem to improve the application’s performance.
The duration of the application at that point was of 8.62 seconds.
8.4.6
Loop pipelining
The method that was chosen for pipelining is shown in Code 8.3. The original ’for’ loop has been blocked, resulting in loops ’ii’ and ’i’. The reason behind this is that you are able to pipeline based on the inner loop with a certain amount of iterations equal to the block factor. This means that Vivado HLS will only try to pipeline this many iterations at once. In this case, the block factor was decided to be 32.
1
2 v o i d n n b l o c k (f l o a t * t m p l a t s , f l o a t * tmp longs , f l o a t * t a r g e t l a t , f l o a t * t a r g e t l o n g , f l o a t * z ) {
3 #pragma HLS ARRAY PARTITION v a r i a b l e=t m p l a t s c y c l i c f a c t o r=BLOCK FACTOR
4 #pragma HLS ARRAY PARTITION v a r i a b l e=t m p l o n g s c y c l i c f a c t o r=BLOCK FACTOR
5 #pragma HLS ARRAY PARTITION v a r i a b l e=z c y c l i c f a c t o r=BLOCK FACTOR
6 f o r(i n t i i = 0 ; i i < BSIZE ; i i += BLOCK FACTOR) {
7 #pragma HLS PIPELINE I I =1 8 f o r (i n t i = 0 ; i < BLOCK FACTOR ; i ++){ 9 z [ i + i i ] = s q r t ( ( ( t m p l a t s [ i + i i ] − (* t a r g e t l a t ) ) * ( t m p l a t s [ i+ i i ] − (* t a r g e t l a t ) ) ) + ( ( t m p l o n g s [ i + i i ] − (* t a r g e t l o n g ) ) * ( tmp longs [ i+ i i ] − (* t a r g e t l o n g ) ) ) ) ; 10 } 11 } 12 }