Search This Blog

Saturday 31 July 2021

 Transputer: Stack Based With OS in Hardware

Picoputer: RP Pico Hardware Transputer

 

The transputer was and still is an odd beast. It has hardware support for processes (hence OS in hardware, well, sort of), and its assembly language is such a pain that Occam is a much better way to program it. It's a language that is close to the hardware and allows parallel processing to be built in at a basic level. They have four fast (for the time) 10mbps (20mbps later) links that are used to communicate between devices and other systems. As the processors improved in capabilities and speed, the links remained compatible.

The later transputers have floating point in hardware, which makes them useful for computationally intense work, especially when configured in networks.

The only real downside is that the chips were expensive, so they never really made it into common usage in embedded applications. Quite a few parallel processing systems using transputers were made, though.

I read all about transputers when they came out but never had a chance to use any. Recently, though, I was looking at vintage processors and transputers came up, which triggered some memories. Unfortunately, the downside of high cost seems to still exist, and vintage transputers are quite expensive. Not as expensive as when they were new, but costly enough that I didn't just buy a few. 

If you want to run some meaningful code on a transputer then you also need to add some RAM, as the chip itself only comes with about 2K onboard. A common 'unit of computing' using a transputer is the TRAM (TRAnsputer Module) which is a transputer plus some RAM. These are very expensive to buy, to the extent that creating a system with more than one processor in it is just not economically sensible.

Raspberry Pi Pico

At around the same time, the Raspberry Pi Pico came to my attention. This is a modern micro-controller board that uses the RP2040 device, which is interesting for me as it has a set of four intelligent hardware GPIO processors. I'm think that these are very useful for interfacing to old hardware buses, such as the FX702P display bus I sniffed with a Blue Pill, or the FX502P external interface bus. When I implemented these projects I used firmware to interface to the bus, which was just about possible using the Blue Pill as it has a high clock rate relative to the bus. Interrupts were necessary in the case of the FX502P bus. The RP2040, though, has programmable hardware that can operate at frequencies of tens to a few hundreds of MHz. This, hopefully, should make it possible to interface to some devices that have higher clock rates.


 

While I was looking at the RP2040, it suddenly occurred to me that the four links on a transputer could be implemented using the eight PIO state machine son an RP2040. Each state machine handles data in one direction, leaving the processor(s) free for other work. What other work? Well, how about running an emulator of a transputer on the core? That would give you a hardware emulation of a transputer. How fast would it be? Well, the original transputers were running at about 20MHz, and the Pico runs at 135MHz. So it probably wouldn't run at the same speed as an original, but it would only be about an order of magnitude slower, maybe. And you can, of course, just add more transputers (real or emulated)  to speed things up...

The links that the Pico provides can easily run at the standard 10MHz link speed (10Mbps) and running at the faster 20MHz shouldn't be a problem either. In fact, if only emulated transputers are talking then a faster link rate could maybe be used.

Host Communication

The transputer links can't be attached to a modern PC, but INMOS made some link adapter ICs (The IMSC011). These are fairly easy to buy, and provide two 8 bit data buses, one for the LinkIn direction and one for LinkOut. Adding one of these to an Arduino would give a way to interface a transputer to a PC.

 

As these are devices that run off 5V I decided to use an Arduino Mega Embedded, partly because I had one. The parallel buses can be wired up to the Mega, together with the Valid and Ack signals. these are used to signal that the data is valid (when Valid is active) and also allow data to be acknowledged (by Ack). The Arduino can then do whatever is needed with the data. i decided to send the data over USB to a host PC as that is the arrangement that the transputer originally used. The host PC then runs a server that handles the 'SP Protocol' which allows input and output on a terminal and keyboard and also allows access to files in the file system.

Booting

A transputer can be booted either from ROM or from a link. I didn't want to boot from ROM, although a program can easily be stored in flash and executed at startup. It's more flexible to book from a link as the host can then supply the code which can be compiled Occam, or C, or any of the other languages that can generate transputer object code. I'm particularly interested in Occam.

Booting from a link is built in to hardware and involves sending a small (up to 255 bytes) bootstrap loader. This then executes and loads further data (the boot loading phase). That boot-loader then loads more chunks of code over the link.

For the host, this is all rather simple, all it does is send the boot file to the transputer link. The format of the data is set up to drive the three stage boot process.


Using PIOs As Transputer Links

The transputer links use a protocol that is very similar to asynchronous serial data. You can view data packets as having a start bit, a type bit and eight data bits followed by a stop bit. An ACK packet follows much the same format, except the data bits are missing. The type bit is 1 in a data packet and 0 in an ACK packet. I started with the serial UART PIO code in the Pico examples and adjusted it to use the transputer protocol. I have a bit of work to do concerning the ACK packet, as I treat the ACK packet as a 10 bit frame at the moment, just with trailing zeros. This could possibly lead to problems if serial data is sent within 7 bit times of an ACK packet, but is working OK for now.

I used one PIO for LinkOut and one for LinkIn, and for the prototype I generate a 5MHz clock with a PIO for the IMSC011 ClockIn pin. 


 Once fired up and wired up this PIO code was capable of driving the IMSC011 and successfully sending and receiving data.

Host Code

The host code will eventually run an SP protocol which will give the full range of IO and file access. For a first pass, just a simple display of data coming in to the host was implemented, as a test of the links. The Arduino Mega sends the link data over a simple (and inefficient) protocol over USB to the host. Using this setup I was able to run a simple hello world program I found on the internet and have the text appear on the host once the Picoputer was booted.


I toyed with the idea of compiling the INMOS server tools

Real Hardware

Just for interest, I also ran this prototype set up using a real transputer that I managed to source. It's actually a motherboard for a larger system, but it has a transputer on it. I applied power and then reverse engineered some signals (most importantly the BootFromROM signal had to be de-asserted). Once this was done, the host code booted the hello world program which then ran and resulted in the 'Hello World' display.



Running Occam on the Pico

While running precompiled binaries is fine for a test, what I'd like to do is run Occam on the Pico. This turns out to be tricky as I can't find a compiler that runs on Linux and, most importantly, generates transputer object code. There's the KROC and the SPOC compilers, but they generate machine code for other processors. About the only option seems to be the original INMOS compilers. They, however, run on older operating systems, DOS being the one for the PC hardware platform. 

I have found a useful VM image on the geekdot website which allows me to run one of the INMOS compilers, so I can actually compile Occam, and then link and collect it down to transputer machine code. At the moment I'm copying the files on and off the VM using a virtual floppy disk image file. Not hugely convenient, hopefully I will be able to compile on a transputer, or maybe recompile the compiler for Linux.

I found a simple Occam program, which looks like this:

#INCLUDE "hostio.inc"  -- contains SP protocol
PROC simple (CHAN OF SP fs, ts)
  #USE "hostio.lib"
  [1000]BYTE buffer :
  BYTE result:
  INT length:
  SEQ
    so.write.string    (fs, ts,
                            "Please type your name :")
    so.read.echo.line  (fs, ts, length, buffer, result)
    so.write.nl        (fs, ts)
    so.write.string    (fs, ts, "Hello ")
    so.write.string.nl (fs, ts,
                             [buffer FROM 0 FOR length])
    so.exit            (fs, ts, sps.success)
:


This uses the 'SP Protocol' to perform input and output using the host system. After some fiddling (porting a third emulator and writing some SP protocol functions and various bug fixes), the host system displays this:

 

Port name:/dev/ttyUSB0
Bootfile:SIMPLE.BTL
Serial port OK
Sending boot file
Boot file sent
Please type your name :AndrewwHello Andrew

The key line here is the last one. The Occam program prompted for my name as it should, then I typed my name in and it displayed the result. OK, the newline is a 'w' and it took a few seconds to run, but it ran. The '.btl' file (BooT Link, or object file) for this program is 3935 bytes long, so a sizeable chunk of code that was loaded using the three stage bootloader mechanism. No bad opcodes, either.

The whole arrangement is not optimised for speed at all, it is optimised to get it working, so running this program does take a while. With some changes I should be able to get a binary transfer of code running, hopefully that will be faster. The emulator could be sped up a little, perhaps, but when single stepping it doesn't seem to be particularly inefficient.

The RP2040 is dual core, which means that I could run an emulator per core and have two transputers on the one board. Due to some excellent design, the transputer link architecture makes no distinction between hardware links and internal communication links, the code has no idea what it is dealing with. this should make it easy to set up communication between the two cores over links.

But, it works!

 



20 comments:

Tom Gardner said...

There is a modern commercially available equivalent, from some of the same people now in XMOS (e.g. David May).

The xCORE processors are effectively transputers on steroids: up to 32 cores and 4000MIPS/chip, expandable, FPGA-like IO, guaranteed hard realtime.

The xC language is effectively Occam with a different syntax.

I find them wonderfully easy to program; from a standing start I had my first application running within a day. Application was counting transitions on two 62.5Mb/s inputs and communicating results over USB to a PC. Makes realtime embedded fun again.

Tom Gardner said...

There is a modern commercially available equivalent, from some of the same people now in XMOS (e.g. David May).

The xCORE processors are effectively transputers on steroids: up to 32 cores and 4000MIPS/chip, expandable, FPGA-like IO, guaranteed hard realtime.

The xC language is effectively Occam with a different syntax and hard realtime extensions.

I find them wonderfully easy to program; from a standing start I had my first application running within a day. Application was counting transitions on two 62.5Mb/s inputs and communicating results over USB to a PC. Makes realtime embedded fun again.

Unknown said...

Oh... that nicely fits in my little TRAM adapter project I just finished - still missing the software to make sense ;-)
http://www.geekdot.com/pitram-ahead/

amen said...

I saw your pages while I was doing this. If you are after a transputer emulator, there's several out there. I used one called T4, based on the Julian Highfield one.
If you're after an SP server then it's not too complicated to writ eyour own. I was going to port an original but decided to write my own in the end. Not done, but enough to run some code, as you can see here.

amen said...

I've jut realised that it was your SDK for VirtualBox that I am using for the compilation of Occam, so thank you for that! It's a bit fiddly getting files in and out, but that because I'm using a virtual floppy.

Matsche said...

Hi Amen,

very cool project.
Could you try out my modification of iserver for 64bit-linux?
https://gitlab.com/tinkercnc/transputer-iserver-64bit
I'm curious if it works with your setup.
I think you have to use "iserver -sl /dev/ttyUSB0 ...".

And I'm very curious about your source. :)

Matsche

amen said...

Hi Matsche,
Ah, this is interesting, it looks like you have adjusted the original iserver sources to build on Linux. That is something I looked at briefly but didn't continue with. I can build the code apparently, so that is a huge step forward. Unfortunately I can't run the code with my Picoputer as the IMSC011 is on the other end of an Arduino Mega using my rather inefficient protocol. So it doesn't look like any of the well-known boards that iserver supports. I'm very interested in running this server though as it is the original code so will support all of the SP protocol out of the box.The code will appear on my GitHub, I just need to knock it into shape.

amen said...

I have created a github repo with the picoputer code in it. It's here:

https://github.com/blackjetrock/picoputer

It's work in progresss and messy.

Liam Proven said...

Fascinating project, and amazing work! Congratulations!

It may interest you to know that Helios, the original Transputer OS, is open-source now. I recently wrote about it.
https://www.theregister.com/2021/12/06/heliosng/

aclassifier said...

Thank you, this was really fun!

And equally (or more!) impressing!

I did work with transputers and occam, and it of course change my life!

(SPoC generates beautiful C code, which you probably know. But your text wasn't 100% concise on it)

I have linked you up at https://www.teigfam.net/oyvind/home/technology/212-notes-from-the-vault-0x02/

Øyvind Teig
Trondheim
Norway

Roger Shepherd said...

Although a couple of people forwarded me your post, I've only just got round to reading it.

I'm interested in how you've implemented the links. I can't tell from your write-up how you are handling things at the message level? Clearly bit-to/from-byte is handled in the PIO, but what about the byte-to/from-memory? Does the PIO have enough DMA capability to handle this? If not, perhaps the second processor could be used for this?

Roger

amen said...

Hi Roger,

The PIO handles the serial data and shifting the bits ointo bytes as you mention. The PIOs have queues in hardware that the bytes (actually words, bit that's not important here) can be pulls from or pushed into. The main emulator then collects or sends data through these queues.

The code is here:
https://github.com/blackjetrock/picoputer

if you want to have a look.

Andrew

Roger Shepherd said...

Andrew,

My reading of the code is that the link is only dealt with when server() is called, and sever() is only called when rescheduling can occur. This would seem to make the links slow. Of course, I may have misunderstood.

Roger

amen said...

It's been a little while since I looked at the code but yes that sounds correct. I think the emulator I used has copied the way the transputer operates with regard to the links. There's no handshaking on the links so all transfers have to be paced using the code running on the transputer. I think this means the emulator code is the same as the real transputers, in fcat I think the emulator tries to be a cycle accurate emulation of the transputer hardware. Fromm my understanding, the links aren't really bulk transport links like we have nowadays, they are more of a control and result transfer link, so are used more for message passing than bulk data transfer. For bulk data something like shared memory would be used, I suppose.
So for example, a Mandelbrot generation application would send messages to transputers to indicate which areas of the set to calculate, the remote nodes would crunch the data and send the results back. That's a very good fit to the architecture as there's a lot of processing but little data being sent back and forth. Probably why they used it as a demo.

Roger Shepherd said...

From my reading the emulator looks like an instruction level (rather than cycle level) emulation of the transputer CPU. It looks like it's been written with access to low-level transputer documentation (for example, the use of SNP (Start Next Process) as an abbreviation). There are some places where the emulator is simplified - the way block moves are implemented appears to be byte-at-a-time; the transputer had a mechanism which meant that the moves were performed word-at-a-time, even when the source and destination had different byte alignments.

In the transputer the 4 links transferred data autonomously. The CPU set up a transfer, the link performed it, and informed the CPU that the transfer had completed. To simplify, the CPU one end of the link executed an 'in' instruction and the CPU told the link (actually the input side of the link) to input the specified number of bytes of data to the specified location; meanwhile the CPU at the other end of the link executed an 'out' instruction and told the link (the output side of the link) to transfer the specified number of bytes of data from the specified location. When both ends of the link had been instructed to transfer the data, they would transfer the data. Once the data transfer was complete each link singled its CPU that the communicating process could be rescheduled. So, the link data transfers operated concurrently with the CPU executing instructions.

I'll make comments about "bulk data transfer" etc. in a separate comment.

Roger

Roger Shepherd said...

Part II:

I'm not exactly what you mean by "bulk transport links" but I'm guessing you mean something link PCIe? PCIe is a serial link and can be used to implement "remote memory".

"From my understanding, the links aren't really bulk transport links like we have nowadays, they are more of a control and result transfer link, so are used more for message passing than bulk data transfer. For bulk data something like shared memory would be used, I suppose."

It's possible that some people shared physical memory between transputers but that was not how they were designed to be used. The hardware and the programming model was based on multiple transputers with local memory, communicating by passing messages. Remember, the transputer was launched in 1984/85 and the world was very different then. There were shared memory computers around at that time but not in available as microprocessors. However, it is interesting to look some sort of broad comparison between the capabilities of the transputer and modern technologies. I'll consider the 20-MHz T414/T800 for these purposes - I'm really interested in orders of magnitude rather than small integer factors here.

The transputer was clocked at 20 MHz. It's 4k byte internal memory provided 80 Mbyte/s of memory bandwidth at 1 cycle latency; with a small external DRAM you'd get 26 Mbyte/s (3 cycle), and with a larger system 16 Mbyte/s (5 cycle).

For the sake of being concrete. I'll try using the Apple M1 SoC as the basis for comparison. Inside the M1 are two types of processors, IceStorm running at 2.1 GHz (100x T800) and FireStorm running at 3.2 GHz (150x). In terms of memory latency, the processors see 3/4 cycle latency to L1 cache (comparable to the transputer's internal memory) and 100 cycles to external DRAM. The external bandwidth of the M1 is about 60 Gbyte/s which is 2000-4000x the transputer's bandwidth - let's call 20x the transputer. A single Firestorm processor can saturate the memory bandwidth (hats off to the processor designers). Of course, there are 8 processors (and GPUs) which also share that bandwidth.

Now the 4 transputer links (8 wires) provide a total bandwidth of something like 5.6 Mbyte/s, or 33% of the bandwidth of a slow external memory system, 25% of a fast memory system. For a modern comparison, I've taken PCIe 5 which gives 4 GB/s bandwidth for a single lane (4 wires). So if we use 4 lanes to match the 4 transputer links, we get 16 GByte/s or 25% of the memory bandwidth of the M1.

I could go on at length about the fairness of this comparison etc but to first order we're seeing 4 PCIe implemented link providing the same proportion of memory bandwidth (25%) as the 4 transputer links. And the PCIe implementation uses twice the number of wires.

So, I'd conclude that - in context - the transputer links provide the same sort of bulk data transport as we see in today's processors. And they were used in that way. What they don't do is provide a memory model based on shared memory, which modern multiprocessors do. There are pros and cons to this, which I won't go into here, except to say that in some cases the performance of memory multiprocessor systems when doing parallel operations on top of shared memory is dreadful - orders of magnitude worse that you might expect.

Roger Shepherd said...

Part III:

Regarding Mandlebrot:

"So for example, a Mandelbrot generation application would send messages to transputers to indicate which areas of the set to calculate, the remote nodes would crunch the data and send the results back. That's a very good fit to the architecture as there's a lot of processing but little data being sent back and forth. Probably why they used it as a demo."

That's pretty much what it did. There was a coordinator, which handed out work to the workers and assembled the results, and I guess looked after the graphics. A modern version might do that same, although there would probably be the mess of a shared work queue, with the workers fighting for access to the shared structure.

Mandlebrot was great for demoing graphics and parallel processing. It was so hard that you really could see the speed up. It is also compute bound, so the costs of distributing the problem are small compared to the cost of solving the problem - it scaled well. Interestingly, the compute required per pixel varies, so it's not well suited to SIMD-like parallelisation. The problem distributed is tiny (X and Y) with just a colour having to be returned - but the problem is sufficiently hard that even if the workpacket had to be larger it would have scaled well. The other thing that made it suitable is that the code and data for the solution is tiny. I guess the whole thing would have fitted into the internal memory of a transputer, let alone a small (256k ?) external memory.

The issue of problem size is one where, perhaps, shared memory models have a benefit. In practice the limit of 3D rendering was the size of the world model - including textures - which, if you follow the Mandlebrot solution, has to be replicated in each node.

So, in conclusion, I think the big difference is the programming model rather then the performance of links v. memory.

aclassifier said...

Roger wrote: "I'll make comments about "bulk data transfer" etc. in a separate comment."

Please do! There might be a student out there who also might want to read it. Especially, how accurate could one time stamp data coming in? Like how would the transputer's time stamping mechanism work with this? Four simultaneous data streams on the four links, how accurately could they be time stamped to the same time? Is it the occam timer input that would be the basis for this, or is there anything at a lower level?

Øyvind

Unknown said...

Hi!

I've been adding transputer support to TinyCC

https://github.com/agentdavo/tinycc-transputer/commit/29dc25b4e13136db6bfcb7d76167858a3a384ad2

It seems like you are a transputer genuis so if you wish to help do let me know.

Kind regards, David.

Roger Shepherd said...

Hi David, I'd be happy to help if I can.


As a start, looking at allocaTransputer.S, I see the lines

61: alloca:
ldl 0
ldl 4

rev
adc 15
ldc -16
and
cj p3

which looks wrong to me. Perhaps line 61 shouldn't be bank? Why use "ldl 0; ldl 4; rev" rather tham "ldl 4" ldl 0"?

Roger