5. RESULTADOS
5.5. CIRUGÍA DE RESCATE EN FUNCIÓN DEL TIPO DE TRATAMIENTO:
The top of the stack is cached, but instead of the large ring buffer suggested in Chapter 3, a four places long linear cache was implemented. This number was chosen based on studies on the code generated by Java compilers. The Sun Java compiler performs practically no optimization on the user code. The only optimization goal is to keep the number of stack locations required by an application at a minimum. This results in code that pumps the stack fast and with small amplitude, first loading data to stack, operating on the data and finally saving the result to local variable. Making the stack cache larger than four places would provide only marginal improvement. Since the maximum number of stack data popped by any given instruction is four, it is also sufficient for the most data hungry instructions.
The linear cache refers to an architecture where the contents of the in- dividual registers are moved up or down along the cache as the cached area moves up or down. This architecture is shown in Figure 5.1. The ring buffer on the other hand refers to an architecture where the data items in the cache remain in place, but the access pointers move up or down. The ring buffer, shown in Figure 5.2, is quite efficient in ASIC designs, but since the architecture incurs a lot of multiplexers, which are expensive in FPGAs, it is not well suited for FPGA based designs. Using the linear approach the
DIR CE Null IN OUT1 OUT2 OUT3 OUT4 CE R1 D Q CE D Q R2 CE D Q R3 CE D Q R4
Figure 5.1: Stack cache with linear architecture.
number of multiplexers is one per cache location, regardless of the number of output signals. The direction of the movement is assigned to all of the multiplexers as the selection signal. Also all of the registers share a common clock enable line. In the ring buffer approach the number of multiplexers is highly dependent on the number of outputs, since each output signal re- quires a multiplexer tree of its own. Naturally the outputs require control signals for selecting the appropriate register, thus increasing the number of registers. At the input side the incoming data can be directly mapped to all registers, but the clock enable signal needs to be decoded from the pointer showing the location of the next free element. In the case of four places the number of multiplexers is three per output port, totaling 12 for four outputs. The linear approach also provides outputs faster after a clock edge. This is because the multiplexing is done before the registers, not after it. The faster outputs provide more time for the ALU to process the data, thus decreasing the minimum clock period.
OUT1 OUT2 OUT3 OUT4 OUT_2<1> OUT_2<0> CE R1 D Q CE D Q R2 CE D Q R3 CE D Q R4 CE<0> CE<1> CE<2> CE<3> IN OUT_1<1> OUT_1<0> OUT_3<1> OUT_3<0> OUT_4<1> OUT_4<0>
Figure 5.2: Stack cache with ring buffer architecture. Only the output multiplexers connected to OUT1 are shown, similar structures would be needed for each output, with their own control signals.
The data in the stack cache is hardwired to the ALU. This means that the instructions, like iadd, which perform an operation on the top stack elements, do not need to actively fetch the data. Rather the data is pro- vided directly to the inputs of the arithmetic unit. In case the stack cache is not valid, a validate request is send to the stack unit. When the stack unit receives the validate request, it validates as many stack top locations as specified by the request. If some of the locations are already valid, they are naturally skipped. After the validation process is completed, the ALU receives an acknowledgment signal from the stack module and is free to con- tinue processing the data. All writes to the stack go through the cache and all the reads come from the cache.
The fact that the topmost stack elements are cached and provided to the ALU gives the REALJava co-processor a clear performance edge over the naive architecture in which every operand has to be retrieved from the memory before an operation can be performed on them. The effect of the caching cannot be demonstrated by the measured results, since the cache was already present in the first version of the prototype. However, the effect of the direct connection to the ALU can be seen by comparing the results of versions 0.06 and 0.08. The connection was introduced in version 0.07 and fine tuned in version 0.08. The performance increase was measured to be roughly 25% in the byte arithmetic tests and just under 20% in the integer tests 2. The performance increase is shown in Figure 5.3. The aforemen- tioned tests contain one arithmetic instruction inside the test loop with two load instruction before it and one store instruction after it. Since the direct connection to the ALU has no effect on the loads and stores, the perfor- mance increase for the arithmetics alone is considerably higher.
Figure 5.3: The effect of the direct connection from the stack cache to the ALU on the byte arithmetics.
2