Simtec Hydra ARM Multiprocessor System

Introduction

Hydra is a hardware add-on for Acorn Computers RiscPC ARM based
desktop computer systems which will convert it into an affordable
asymmetric parallel processing system.  RiscPC machines have the
ability to support more than one processor.  As standard they have two
processor slots, one is normally occupied by an ARM processor card
(the primary processor), and the other is free allowing the addition
of a second ARM, Intel, Motorola or other secondary processor.  While
the design of the primary processor card may be relatively simple the
second processor card must incorporate a certain amount of arbitration
logic to enable it to share the bus with the primary processor.
Although there are different design requirements for primary and
secondary processor cards the two processor slots on a standard RiscPC
are electrically identical.  The Hydra card interfaces with the RiscPC
via one of the processor slots and duplicates both of the original
slots and combines additional slots with the necessary arbitration
logic to support a further four ARM processor cards.  Because the
Hydra design integrates the arbitration logic with the base board,
ordinary ARM610 and 710 processor cards can be used.  This makes it
possible to add up to four off-the-shelf ARM processor cards to any
RiscPC system. Indeed, the Hydra card is not limited to just ARM
processor cards, anything which appears to the system to be an ARM
card can be used.  This open up the possibility of adding alternative
high speed I/O cards which access memory or other expansion cards
directly.


The Hydra API (Application program interface)

With four slave processor cards fitted a RiscPC with Hydra has, in
theory, five times the processing power of a standard RiscPC.
Unfortunately the operating system RISC OS is not a multiprocessor OS
and has no way of taking advantage of this increased processing power.
One way to make effective use of Hydra is to switch to an operating
system which does support multiprocessing such as RiscBSD, Helios or
Taos.  This has the advantage that any applications software which can
multithread will automatically take advantage of any available
processors.  However, for the ordinary RISC OS user, the easiest way
to harness the power of Hydra is to use application software written
to enhance parts of RISC OS which uses the Hydra API.  As the API
exists independently on RISC OS, any MP aware applications will make
use of the new resources and ordinary applications will run
unaffected.


Design philosophy

RISC OS is a robust, compact, efficient ROM based operating system
with support for installable filesystems, fast bitmap and graphics
operations, anti-aliased font rendering.  It has a desktop environment
(the Wimp) which allows multiple co-operating tasks to share the
machine.  RISC OS was designed to run on a single processor.  As such
there is no interface to support the creation of threads or manage
their execution. The Hydra API is designed to provide some of the
benefits of multithreading with as little as possible of the
overhead. After all it is reasoned that the main reason for using
Hydra is to enhance the computers performance. In this context it is
not appropriate for the software to impose a heavy performance burden.

The Hydra API provides calls to:

- Set up the areas of memory containing code and data which a thread
will use.
- Move additional areas of memory in and out of the address space of
the slave processors.
- Schedule the thread for execution.
- Monitor the progress of a scheduled thread.

Threads are written in ARM assembler 32 bit mode. They see an
operating system interface which is a subset of RISC OS supporting
screen and keyboard I/O, file operations and certain utility
functions. In addition there is a generic interface which allows a
thread to issue a call to any RISC OS SWI.  SWIs generated on a slave
processor are either performed locally or passed to the Master
processor for execution.  In this way, and filing operations are
performed by only one processor so filing system consistency is
guaranteed.


Architecture

The Hydra API is implemented by a relocatable module which runs on the
RISC OS host and a small kernel which is run by each slave.  Code
(kernel and user) is shared between slaves.  Data areas can be shared
or unique.  When Hydra starts the kernel code is loaded into shared
memory and the slave processors are reset under control of the
host. Memory is then allocated to hold level 1 & 2 page tables for
each installed slave. At the end of the boot sequence the kernel
enters a command processing loop.  As an aid to software development
each slave processor can receive keyboard input and send character
based output to a virtual terminal which is provided by the HydraTerm
application. This allows trace information and notifications of
exceptions to be displayed. The kernel also supports a limited command
line interface (CLI) allowing memory and registers to be dumped and
disassembled and code to be executed.  Each slave inputs and processes
commands until a thread is scheduled for it whereupon it abandons
whatever command it was executing and enters the thread code at the
specified address.  Any calls which the thread makes to the standard
character I/O SWIs (OSReadC, OSReadLine, OSWriteC etc.) are routed to
the virtual terminal.  It is not anticipated that end users will
interact with Hydra via this interface.  When a thread signifies that
it has terminated (by calling OSExit) the next pending thread is
executed. If no thread is waiting control returns to the interactive
command line.


Scheduling Threads

As described above threads are allocated to processors on a first-come
first served basis. The simple queuing mechanism allows Hydra to be
shared between a number of client applications and allows for
solutions which scale well whatever number of slave processors are
fitted.  Lets assume that a hypothetical application has a time
consuming task which can be split to run in parallel on a number of
processors.  A naive approach might be to split the task into four
threads each of which would take N seconds to execute.  On a system
with one slave the four threads would execute sequentially taking a
total of 4 N seconds.  On a four slave system the threads would
execute concurrently taking N seconds.  However, on a three processor
system the first three threads would execute immediately, leaving the
fourth thread to execute on its own after the first three had
completed, taking a total of 2 N seconds.  A better approach would be
to split the task into twelve threads. On a four processor system each
processor would execute three of the threads; on a three processor
system each processor would handle four of the threads and so on. This
approach also scales better to future systems which may support more
than four slave processors.


Memory map

The memory map for a slave processor looks a little like the memory
map of a RISC OS machine:

Address Allocation

00000000 - 00007FFF Kernel internal use, vector tables, communication
queues and stacks (unique to each slave)

00008000 - 037FFFFF Available to user programs. Memory in this region
is allocated by the client application.

03800000 - 0380FFFF Kernel code (read only, shared between all slaves,
may be less than 64k in practice)

03810000 - 03FFFFFF One to one mapping with I/O space in hosts address
space

04000000 + Level 1 and level 2 page tables and other memory management
workspace. The size of this area depends on the amount of physical ram
in the system.

80000000 - FFFFFFFF One to one mapping with physical memory which by
default is not accessible to prevent a rogue slave from corrupting
RISC OS or other processors workspace.


How it woks

The Hydra arbitration logic is used to multiplex processors to the
memory bus and ensures that only one processor talks to the memory bus
at any one time.  Any processor requiring a memory cycle is guaranteed
access to the bus by using a last used-least priority rotational
priority encoder which gives the bus to each processor in turn if they
need it otherwise it stays with the current owner.

When reset, an external memory modifier unit is enabled to force the
processor to execute its reset code from a fixed area of memory
otherwise it would execute the RISC OS reset code and crash the
already running RISC OS.  Once the processor is initialised and
running useful code, the modifier unit is disabled and the processor
addresses are output normally.  There is also logic to halt a
processor so when a task is complete, a processor can shut itself down
and wait in suspended animation until un-halted or reset by the Master
processor.  There is an extensive interrupt structure allowing slaves
to send IRQs or FIQs to each other and to signal the Master processor
through the interrupt structure of the podule bus.  Wherever possible
registers have hardware interlocks which prevent one processor from
interfering with bits that control the others.  In some cases,
registers are context sensitive and will only set or enable particular
bits of a register dependant on which processor is accessing them. A
processor can be identified by reading the ID_Status register, whose
contents reflects the physical socket number that the processor is
connected to. This enables the controlling software to compute which
register bits belong to that processor.  A HardwareVer register holds
the current revision number of the arbitration logic.

For those who feel a need to access the hardware directly, below is a
register list of the Hydra card.  Please note that some of these
registers and their operation will change but every effort will be
made to make them backwards compatible.  Currently there are 16 write
and 8 read registers, each 4 bits wide, addressed physically from
&3800000 and a 4Mb block of address space set aside at &3C00000 for
local Slave memory.

Addr	Register	Settings			          Reset State  Flags R/W

&00	FIQ_set	1 sets bits in reg. 0 no change. 1(n) asserts FIQ P(n)		0000	(-MS) W
&04	FIQ_clr	1 clears bits in reg. 0 no change.				(-MS) W
&08	ForceFIQ_clr	1 clears bits every slave FIQ reg. 0 no change. 1(n)			(-M-) W
&10	MMU_LSN	Writes D[3:0] to A[24:21] of MMU			0000	(A--) W
&14	MMU_MSN	Writes D[3:0] to A[28:25] of MMU			0000	(A--) W
&18	MMU_set	1 sets bits in reg. 0 no change. 1(n) enables MMU for P(n)	0000	(A--) W
&1C	MMU_clr	1 clears bits in reg. 0 no change.				(-MS) W
&20	IRQ_set	1 sets bits in reg. 0 no change. 1(n) asserts IRQ P(n)		0000	(-MS) W
&24	IRQ_clr	1 clears bits in reg. 0 no change.				(-MS) W
&28	ForceIRQ_clr	1 clears bits every slave IRQ reg. 0 no change. 1(n)			(-M-) W
&30	Reset	Writes D[3:0] to reg. 1(n) to assert RST(n).		0000	(-M-) W
&34	X86_killer	Writes D[3:0] to reg. 0 is disabled 1111 max (SEQ 15/16ths)	0000	(A--) W
&38	Halt_set	1 sets bits in reg. 0 no change. 1(n) to halt P(n).		1111	(-MS) W
&3C	Halt_clr	1 clears bits in reg. 0 no change.				(-M-) W

Status Registers:

&00	FIQ_status	D(n)=1 if P(n) set interrupt. For D(n) <self> =1 then Master set interrupt.	(-MS) R
&04	FIQ_readback	D(0:3) returns data written to FIQ_set reg.			(-MS) R
&08	HardwareVer	D(0:3) with hardware id number (current version returns 1)		(A--) R
&18	MMU_status	D(n)=1 then MMU enabled for P(n).			0000	(A--) R
&1D	ID_status	D[3:0]  Master X0XX, P(0)=X100, P(1)=X101, P(2)=X110, P(0)=X111.		(A--) R
&20	IRQ_status	D(n)=1 if P(n) set interrupt. For D(n) <self> =1 then Master set interrupt.	(-MS) R
&24	IRQ_readback	D(0:3) returns data written to by IRQ_set reg.			(-MS) R
&30	RST_status	D(n)=1 then P(n) is still under RESET				(A--) R
&38	Halt_status	D(n)=1 then P(n) is halted				(A--) R

Access Flags:	A - Any processor, M - Master only,  S - Slave only
*NOTE:	MS - Master and Slave have context sensitive access

Obsolete Registers:

&08	PFIQ_set	1 sets bits in reg. 0 no change. 1(n) asserts PFIQ to master	0000	(--S) W
&0C	PFIQ_clr	1 clears bits in reg. 0 no change.				(-M-) W
&28	PIRQ_set	1 sets bits in reg. 0 no change. 1(n) asserts PIRQ to master.	0000	(--S) W
&2C	PIRQ_clr	1 clears bits in reg. 0 no change.				(-M-) W
&10	PFIQ_status	D(n)=1 if P(n) set interrupt.				(A--) R
&28	PIRQ_status	D(n)=1 if P(n) set interrupt.				(A--) R


Inter-processor interrupts

The Hydra card supports two identical interrupt structures, one for
IRQs and the other for FIQs.  In each case it is possible for a slave
to set the interrupt line of one or more slaves simultaneously by
writing to either the IRQ_set or FIQ_set registers.  Slaves
communicate to the Master processor by writing to these registers
which assert the appropriate podule interrupt lines.  The IRQ
mechanism is used by the API for message passing and should not be
used by user code.  However, FIQs may be freely used.  The default
owner of the FIQ vector is the register snapshot routine used by the
debugger.

Registers are written to by accessing a set register with the required
bits set to 1 and cleared by accessing the paired reset register with
bits set in the positions where bits are to be cleared.  In this way,
if other processors set additional interrupt bits, they won't be
accidentally cleared by the interrupted processor as writing a zero to
any of the registers has no effect.

Inter slave and master to slave interrupts:  FIQ_set: D(0:3) & IRQ_set: D(0:3)

	Set register:		Status register:

Bits:	D0    D1    D2    D3		D0    D1    D2    D3

Master:	M>S0  M>S1  M>S2  M>S3		S0>M  S1>M  S2>M  S3>M

Slave0:	S0>M  S0>S1 S0>S2 S0>S3		M>S0  S1>S0 S2>S0 S3>S0
Slave1:	S1>S0 S1>M  S1>S2 S1>S3		S0>S1 M>S1  S2>S1 S3>S1
Slave2:	S2>S0 S2>S1 S2>M  S2>S3		S0>S2 S1>S2 M>S2  S3>S2
Slave3:	S3>S0 S3>S1 S3>S2 S3>M		S0>S3 S1>S3 S2>S3 M>S3


A slave can send an interrupt to other slaves by writing the
appropriate bits to the set register. D(n) will send an interrupt to
slave processor n.  When a slave reads the register, a vertical slice
is read, with bits set for every processor that has posted it an
interrupt.  Slave 0 would set D(0), slave 1 set D(1) etc.  As sending
an interrupt to oneself has no purpose, the otherwise redundant
diagonal bits are used to store the interrupt bits written by the
Master processor to the slaves.  When writing to the register, the
master sets the flags of M(S0) M(S1) M(S2) M(S3), one for each slave.

FIQ_readback: D(0:3) & IRQ_readback: D(0:3)

It is possible for a processor to examine whether an interrupt has
been cleared by the recipient by reading the readback registers.  They
return the bitfield in the same format as the set registers.  Because
continuous polling of a register is bus-inefficient, it is expected
that an acknowledge interrupt will be returned to the sender after the
interrupt is serviced.

Slave to Master interrupts are performed by writing to the 'redundant'
bit that corresponds to the slave itself.  When a slave interrupts the
master it sets its flag bit in the 4 bit IRQ or FIQ set register.

Once an interrupt bit is set, it can only be cleared by the recipient
of the interrupt or by a system reset.  In case interrupts are sent to
a processor that is not fitted or running, the ForceFIQclear and
ForceIRQclear registers allow the master to clear all interrupts
destined for a particular slave.  D(0) clears all interrupts to
slave0, D(1) to slave1 etc.













































Simtec Hydra Multiprocessor hardware overview				Iss B 17th May 1996