Simtec Hydra ARM Multiprocessor System Introduction Hydra is a hardware add-on for Acorn Computers RiscPC ARM based desktop computer systems which will convert it into an affordable asymmetric parallel processing system. RiscPC machines have the ability to support more than one processor. As standard they have two processor slots, one is normally occupied by an ARM processor card (the primary processor), and the other is free allowing the addition of a second ARM, Intel, Motorola or other secondary processor. While the design of the primary processor card may be relatively simple the second processor card must incorporate a certain amount of arbitration logic to enable it to share the bus with the primary processor. Although there are different design requirements for primary and secondary processor cards the two processor slots on a standard RiscPC are electrically identical. The Hydra card interfaces with the RiscPC via one of the processor slots and duplicates both of the original slots and combines additional slots with the necessary arbitration logic to support a further four ARM processor cards. Because the Hydra design integrates the arbitration logic with the base board, ordinary ARM610 and 710 processor cards can be used. This makes it possible to add up to four off-the-shelf ARM processor cards to any RiscPC system. Indeed, the Hydra card is not limited to just ARM processor cards, anything which appears to the system to be an ARM card can be used. This open up the possibility of adding alternative high speed I/O cards which access memory or other expansion cards directly. The Hydra API (Application program interface) With four slave processor cards fitted a RiscPC with Hydra has, in theory, five times the processing power of a standard RiscPC. Unfortunately the operating system RISC OS is not a multiprocessor OS and has no way of taking advantage of this increased processing power. One way to make effective use of Hydra is to switch to an operating system which does support multiprocessing such as RiscBSD, Helios or Taos. This has the advantage that any applications software which can multithread will automatically take advantage of any available processors. However, for the ordinary RISC OS user, the easiest way to harness the power of Hydra is to use application software written to enhance parts of RISC OS which uses the Hydra API. As the API exists independently on RISC OS, any MP aware applications will make use of the new resources and ordinary applications will run unaffected. Design philosophy RISC OS is a robust, compact, efficient ROM based operating system with support for installable filesystems, fast bitmap and graphics operations, anti-aliased font rendering. It has a desktop environment (the Wimp) which allows multiple co-operating tasks to share the machine. RISC OS was designed to run on a single processor. As such there is no interface to support the creation of threads or manage their execution. The Hydra API is designed to provide some of the benefits of multithreading with as little as possible of the overhead. After all it is reasoned that the main reason for using Hydra is to enhance the computers performance. In this context it is not appropriate for the software to impose a heavy performance burden. The Hydra API provides calls to: - Set up the areas of memory containing code and data which a thread will use. - Move additional areas of memory in and out of the address space of the slave processors. - Schedule the thread for execution. - Monitor the progress of a scheduled thread. Threads are written in ARM assembler 32 bit mode. They see an operating system interface which is a subset of RISC OS supporting screen and keyboard I/O, file operations and certain utility functions. In addition there is a generic interface which allows a thread to issue a call to any RISC OS SWI. SWIs generated on a slave processor are either performed locally or passed to the Master processor for execution. In this way, and filing operations are performed by only one processor so filing system consistency is guaranteed. Architecture The Hydra API is implemented by a relocatable module which runs on the RISC OS host and a small kernel which is run by each slave. Code (kernel and user) is shared between slaves. Data areas can be shared or unique. When Hydra starts the kernel code is loaded into shared memory and the slave processors are reset under control of the host. Memory is then allocated to hold level 1 & 2 page tables for each installed slave. At the end of the boot sequence the kernel enters a command processing loop. As an aid to software development each slave processor can receive keyboard input and send character based output to a virtual terminal which is provided by the HydraTerm application. This allows trace information and notifications of exceptions to be displayed. The kernel also supports a limited command line interface (CLI) allowing memory and registers to be dumped and disassembled and code to be executed. Each slave inputs and processes commands until a thread is scheduled for it whereupon it abandons whatever command it was executing and enters the thread code at the specified address. Any calls which the thread makes to the standard character I/O SWIs (OSReadC, OSReadLine, OSWriteC etc.) are routed to the virtual terminal. It is not anticipated that end users will interact with Hydra via this interface. When a thread signifies that it has terminated (by calling OSExit) the next pending thread is executed. If no thread is waiting control returns to the interactive command line. Scheduling Threads As described above threads are allocated to processors on a first-come first served basis. The simple queuing mechanism allows Hydra to be shared between a number of client applications and allows for solutions which scale well whatever number of slave processors are fitted. Lets assume that a hypothetical application has a time consuming task which can be split to run in parallel on a number of processors. A naive approach might be to split the task into four threads each of which would take N seconds to execute. On a system with one slave the four threads would execute sequentially taking a total of 4 N seconds. On a four slave system the threads would execute concurrently taking N seconds. However, on a three processor system the first three threads would execute immediately, leaving the fourth thread to execute on its own after the first three had completed, taking a total of 2 N seconds. A better approach would be to split the task into twelve threads. On a four processor system each processor would execute three of the threads; on a three processor system each processor would handle four of the threads and so on. This approach also scales better to future systems which may support more than four slave processors. Memory map The memory map for a slave processor looks a little like the memory map of a RISC OS machine: Address Allocation 00000000 - 00007FFF Kernel internal use, vector tables, communication queues and stacks (unique to each slave) 00008000 - 037FFFFF Available to user programs. Memory in this region is allocated by the client application. 03800000 - 0380FFFF Kernel code (read only, shared between all slaves, may be less than 64k in practice) 03810000 - 03FFFFFF One to one mapping with I/O space in hosts address space 04000000 + Level 1 and level 2 page tables and other memory management workspace. The size of this area depends on the amount of physical ram in the system. 80000000 - FFFFFFFF One to one mapping with physical memory which by default is not accessible to prevent a rogue slave from corrupting RISC OS or other processors workspace. How it woks The Hydra arbitration logic is used to multiplex processors to the memory bus and ensures that only one processor talks to the memory bus at any one time. Any processor requiring a memory cycle is guaranteed access to the bus by using a last used-least priority rotational priority encoder which gives the bus to each processor in turn if they need it otherwise it stays with the current owner. When reset, an external memory modifier unit is enabled to force the processor to execute its reset code from a fixed area of memory otherwise it would execute the RISC OS reset code and crash the already running RISC OS. Once the processor is initialised and running useful code, the modifier unit is disabled and the processor addresses are output normally. There is also logic to halt a processor so when a task is complete, a processor can shut itself down and wait in suspended animation until un-halted or reset by the Master processor. There is an extensive interrupt structure allowing slaves to send IRQs or FIQs to each other and to signal the Master processor through the interrupt structure of the podule bus. Wherever possible registers have hardware interlocks which prevent one processor from interfering with bits that control the others. In some cases, registers are context sensitive and will only set or enable particular bits of a register dependant on which processor is accessing them. A processor can be identified by reading the ID_Status register, whose contents reflects the physical socket number that the processor is connected to. This enables the controlling software to compute which register bits belong to that processor. A HardwareVer register holds the current revision number of the arbitration logic. For those who feel a need to access the hardware directly, below is a register list of the Hydra card. Please note that some of these registers and their operation will change but every effort will be made to make them backwards compatible. Currently there are 16 write and 8 read registers, each 4 bits wide, addressed physically from &3800000 and a 4Mb block of address space set aside at &3C00000 for local Slave memory. Addr Register Settings Reset State Flags R/W &00 FIQ_set 1 sets bits in reg. 0 no change. 1(n) asserts FIQ P(n) 0000 (-MS) W &04 FIQ_clr 1 clears bits in reg. 0 no change. (-MS) W &08 ForceFIQ_clr 1 clears bits every slave FIQ reg. 0 no change. 1(n) (-M-) W &10 MMU_LSN Writes D[3:0] to A[24:21] of MMU 0000 (A--) W &14 MMU_MSN Writes D[3:0] to A[28:25] of MMU 0000 (A--) W &18 MMU_set 1 sets bits in reg. 0 no change. 1(n) enables MMU for P(n) 0000 (A--) W &1C MMU_clr 1 clears bits in reg. 0 no change. (-MS) W &20 IRQ_set 1 sets bits in reg. 0 no change. 1(n) asserts IRQ P(n) 0000 (-MS) W &24 IRQ_clr 1 clears bits in reg. 0 no change. (-MS) W &28 ForceIRQ_clr 1 clears bits every slave IRQ reg. 0 no change. 1(n) (-M-) W &30 Reset Writes D[3:0] to reg. 1(n) to assert RST(n). 0000 (-M-) W &34 X86_killer Writes D[3:0] to reg. 0 is disabled 1111 max (SEQ 15/16ths) 0000 (A--) W &38 Halt_set 1 sets bits in reg. 0 no change. 1(n) to halt P(n). 1111 (-MS) W &3C Halt_clr 1 clears bits in reg. 0 no change. (-M-) W Status Registers: &00 FIQ_status D(n)=1 if P(n) set interrupt. For D(n) =1 then Master set interrupt. (-MS) R &04 FIQ_readback D(0:3) returns data written to FIQ_set reg. (-MS) R &08 HardwareVer D(0:3) with hardware id number (current version returns 1) (A--) R &18 MMU_status D(n)=1 then MMU enabled for P(n). 0000 (A--) R &1D ID_status D[3:0] Master X0XX, P(0)=X100, P(1)=X101, P(2)=X110, P(0)=X111. (A--) R &20 IRQ_status D(n)=1 if P(n) set interrupt. For D(n) =1 then Master set interrupt. (-MS) R &24 IRQ_readback D(0:3) returns data written to by IRQ_set reg. (-MS) R &30 RST_status D(n)=1 then P(n) is still under RESET (A--) R &38 Halt_status D(n)=1 then P(n) is halted (A--) R Access Flags: A - Any processor, M - Master only, S - Slave only *NOTE: MS - Master and Slave have context sensitive access Obsolete Registers: &08 PFIQ_set 1 sets bits in reg. 0 no change. 1(n) asserts PFIQ to master 0000 (--S) W &0C PFIQ_clr 1 clears bits in reg. 0 no change. (-M-) W &28 PIRQ_set 1 sets bits in reg. 0 no change. 1(n) asserts PIRQ to master. 0000 (--S) W &2C PIRQ_clr 1 clears bits in reg. 0 no change. (-M-) W &10 PFIQ_status D(n)=1 if P(n) set interrupt. (A--) R &28 PIRQ_status D(n)=1 if P(n) set interrupt. (A--) R Inter-processor interrupts The Hydra card supports two identical interrupt structures, one for IRQs and the other for FIQs. In each case it is possible for a slave to set the interrupt line of one or more slaves simultaneously by writing to either the IRQ_set or FIQ_set registers. Slaves communicate to the Master processor by writing to these registers which assert the appropriate podule interrupt lines. The IRQ mechanism is used by the API for message passing and should not be used by user code. However, FIQs may be freely used. The default owner of the FIQ vector is the register snapshot routine used by the debugger. Registers are written to by accessing a set register with the required bits set to 1 and cleared by accessing the paired reset register with bits set in the positions where bits are to be cleared. In this way, if other processors set additional interrupt bits, they won't be accidentally cleared by the interrupted processor as writing a zero to any of the registers has no effect. Inter slave and master to slave interrupts: FIQ_set: D(0:3) & IRQ_set: D(0:3) Set register: Status register: Bits: D0 D1 D2 D3 D0 D1 D2 D3 Master: M>S0 M>S1 M>S2 M>S3 S0>M S1>M S2>M S3>M Slave0: S0>M S0>S1 S0>S2 S0>S3 M>S0 S1>S0 S2>S0 S3>S0 Slave1: S1>S0 S1>M S1>S2 S1>S3 S0>S1 M>S1 S2>S1 S3>S1 Slave2: S2>S0 S2>S1 S2>M S2>S3 S0>S2 S1>S2 M>S2 S3>S2 Slave3: S3>S0 S3>S1 S3>S2 S3>M S0>S3 S1>S3 S2>S3 M>S3 A slave can send an interrupt to other slaves by writing the appropriate bits to the set register. D(n) will send an interrupt to slave processor n. When a slave reads the register, a vertical slice is read, with bits set for every processor that has posted it an interrupt. Slave 0 would set D(0), slave 1 set D(1) etc. As sending an interrupt to oneself has no purpose, the otherwise redundant diagonal bits are used to store the interrupt bits written by the Master processor to the slaves. When writing to the register, the master sets the flags of M(S0) M(S1) M(S2) M(S3), one for each slave. FIQ_readback: D(0:3) & IRQ_readback: D(0:3) It is possible for a processor to examine whether an interrupt has been cleared by the recipient by reading the readback registers. They return the bitfield in the same format as the set registers. Because continuous polling of a register is bus-inefficient, it is expected that an acknowledge interrupt will be returned to the sender after the interrupt is serviced. Slave to Master interrupts are performed by writing to the 'redundant' bit that corresponds to the slave itself. When a slave interrupts the master it sets its flag bit in the 4 bit IRQ or FIQ set register. Once an interrupt bit is set, it can only be cleared by the recipient of the interrupt or by a system reset. In case interrupts are sent to a processor that is not fitted or running, the ForceFIQclear and ForceIRQclear registers allow the master to clear all interrupts destined for a particular slave. D(0) clears all interrupts to slave0, D(1) to slave1 etc. Simtec Hydra Multiprocessor hardware overview Iss B 17th May 1996