6.3.1. O perative Part O rganisation
The operative part of an integrated processor establishes the hardware implementa tion and the topological structure of the elements defined at the functional level. There are several criteria for the organisation of the operative part available to the machine designer20. Here, the design of the operative part is based on the principle o f information locality for increased performance, with local memory and registers providing multiple accesses.
The Neural-RISC operative part is organised in five blocks: a two-segment data path, the local memory, two I/O buffers, and the boot ROM (Figure 6.5).
The datapath occupies approximately 11.3% o f the chip area. It has a structure which constitutes two perfect rectangles, with metal-1 buses and metal-2 control lines. This rectangular datapath is sliced and assembled from a large number of small cells on a multiple bus structure— a double bus on the right segment and triple bus on the left seg ment. The bus system is interrupted by sets of tri-state drivers which allow data transfer or independent processing by the operative sub-parts. These tri-state drivers are made sufficiently large to drive the capacitive loads o f internal and external buses lines, and to convert the signal of the limited voltage-swing buses to a level accepted outside, by the standard cells and memory blocks. The whole datapath is 16-bit wide, although some registers with fewer bits share horizontal slices, in order to reduce the height o f the data path.
Functionally, the datapath is composed of three operative sub-parts: • Data processing
• I/O buffer management • Communication processing
data addres: Boot R O M I/O Buffer data address R A M I/O Buffer data address R A M Local Memory data address R A M
B A DR/B AR/A DUB AUA
In p u t Link O u tp u t Link DR A/AR/AL BUS A x < BUS B DL
Input Link O u tp u t Link
v ---
Data Processing I/O Buffer
Management
Communication Processing
Figure 6.5: Organisation o f the Neural-RISC Operative Part
The data processing sub-part includes the execution unit and the timer (Figure 6.5). At the the right end o f the first datapath segment are the registers and counters used for I/O buffer management. The second datapath segment (right side) contains the hardware (registers, latches, comparators, etc.) that implements the data ports and the input arbiter o f the communication unit.
6.3.2. C ontrol Part O rganisation
The control part of a sequential machine drives the operative part by activating its control lines at the right time according to the system timing. The control part of the Neural-RISC takes less than 5% of the prototype chip area and it is implemented with
PLAs. Each PLA synthesizes a "nondeterministic" finite state machine-^9, in the sense that a machine can be in more than one state at the same time. The PLA implementation of nondeterministic finite automata112 provides a considerable reduction o f silicon area, in contrast with conventional implementations96.
The organisation of the control part, shown in Figure 6.6, is extremely simple and includes three control units:
• Processor control unit • Channel-R control unit • Channel-L control unit
Processor Control Unit C om m unication Unit
PLA C hanncl-R C hanncl-L
PLA PLA
f 71
o T O
| 16
p P 6
I 1 o
Figure 6.6: Organisation o f Neural-RISC Control Part
The processor control unit commands the functional components o f the processor unit, i.e., the execution unit, the timer and the interrupt controller. The processor-unit PLA generates 21 different commands and takes 9 signals from the operative part, including a 4-bit instruction opcode. The simplicity of the instruction set allows instruc tion decoding and execution to be controlled by a single PLA.
The control units associated with the communication channels have identical PLAs that regulate the operations performed by the two communication engines. Each of these PLAs provides 16 control outputs and 8 input signals.
The system timing is based on two-phase nonoverlapping clock signals (4>i and <&2), generated directly by an external oscillator (Figure 6.7). Control operations are executed in a scheme where each clock phase corresponds to a conventional machine cycle. Instructions take an even number of clock phases and each o f these phases holds a stable
state which is stored in static flip-flops. In the global control timing, a stable state is used to perform the equivalent to a full machine cycle, initiating and terminating transfers within the clock phase transition edges. This strategy accelerates the instruction flow, enabling instructions to execute in fewer clock cycles (Appendix 3 presents the control sequence o f instructions). Compared with conventional designs, where states change every clock cycle, this design strategy allows distributing a slower clock over the circuit to attain the same performance. As a consequence o f this approach, delays must be intro duced to guard asynchronous sequential events from switching hazards. These delays are physically dependent, and can induce failures on extreme operation conditions if careful inspection is not taken during design.
To achieve efficient utilisation of the cycle time, a pair of successive signals <J>i* and <&2x is generated internally from 0>i and 0 2 (Figure 6.7). They correspond to the
two-phase clock signals with the nonoverlapping time equal zero. These extended-phase signals are used to command bus drivers and memory-address registers in order to speed up memory access. In these accesses, address decoding effectively initiates before the actual phase, when read or write operations occur.
Clock Cycle -£--- >- I I | Machine Cycles | r6 ^ --- >■, <e>i Lj_______ <J>2 — |_______ L_4 i i i i <£]* --- I I <t>2X --- ---
Figure 6.7: Neural-RISC Clock System
This timing strategy applies to all control units that operate by sharing resources on alternative half cycle basis. Resources used by an unit in a phase are released to the use
of others in the consecutive phase. Resource alternation is a counter proposal for increasing the machine performance. Increased performance could also be obtained by doubling resources and by using complex self-synchronised control for resource conten tion. However, this approach would require far more silicon area. Instead, with the approach taken, silicon area is saved towards the integration o f multiple processors in the same chip.