Search This Blog

Thursday 14 October 2021

Code the Specification: 6303 Assembler in 2000 Lines of Tcl Using Regular Expressions

 Code the Specification:

6303 Assembler in 2000 Lines of Tcl Using Regular Expressions

 

I have been interfacing to the Psion Organiser 2 recently, firstly using a Raspberry Pi RP2040 to replace a datapack and provide SD card storage of pack images:

and also in the top slot of the organiser, which provides a way to add hardware to the organiser:


When adding hardware in this way the organiser allows the code that drives the hardware to be held on the hardware as an embedded datapack. This means that you never lose the drivers for a piece of hardware as it is built in to the device itself. I made a prototyping board using a raspberry Pi Pico running modified datapack code from the RP2040 project. The drivers for the hardware are usually written in assembly code as direct access to the hardware is required.

Datapacks that drive hardware usually add commands to the main menu of the organiser, and I wanted to do that for my prototyping hardware. To do this requires building a datapack image containing the command code plus the extra information that datapacks contain that enables the organiser to load the code.

Psion supplied an assembler in the 80s that was able to perform the assembly of code together with the extra features that provide the datapack image. This is still available but only runs under DOS, so must be run in some kind of emulator, or on a native DOS machine. I don't have a DOS machine and although I do have a DOS emulator and have run this assembler, I really wanted a version of the assembler that runs on Linux. There are a few 6303 assemblers about, but I couldn't find one that did all of the special organiser things, so I decided to write one.

An assembler is quite simple in that it takes assembly language instructions from a text file and builds binary object code. The datasheet for a processor usually provides a list of the instructions and the corresponding object code, so lends itself well to a method I call 'Code the Spec'.

The idea of coding the specification is to take some core information directly from the specification, copy it (cut and paste ideally)  and build a data structure that the code uses to do whatever it needs to do. In this case the assembly language instructions are copied from the processor datasheet and form an array of data that the code uses to parse input files.

The 6303 datasheet I found has a list like this:

 The assembly instructions are in the second column and the machine code is in a column per addressing mode. Addressing modes pose an extra problem as you have to recognise which addressing mode the instruction is using in order to determine the machine code that must be generated. I decided to use regular expressions to parse the assembly instructions as they are compact and powerful and can handle this parsing. The corresponding table in the code for the above instructions is here:


    {ADCA ____ 89   99   A9   B9   ____ ____ ____}
    {ADCB ____ C9   D9   E9   F9   ____ ____ ____}
    {ADDA ____ 8B   9B   AB   BB   ____ ____ ____}
    {ADDB ____ CB   DB   EB   FB   ____ ____ ____}
    {ADDD ____ C3.3 D3   E3   F3   ____ ____ ____}
    {AIM  ____ ____ ____ ____ ____ ____ 71   61  }

Well, most of the instructions are there, as the datasheet is a scan I couldn't cut and paste from it so I used an alphabetical list of 6303 instructions to create the array.  Taking the ADCB instruction as an example:

    {ADCB ____ C9   D9   E9   F9   ____ ____ ____}

the instruction to be matched is shown as ADCB and the opcode for each addressing mode is in a column in the same order as the datasheet. You can see that a cut and paste, if you could do it, would quickly create this table. The instruction length for each instruction is set as a default for each addressing mode as that usually fixes the length. there are always some exceptions, and the ADDD instruction has an instruction length over-ride by using C3.3 to specify a length of 3 bytes for this instruction.

The datasheet only shows the addressing modes for the instructions it is describing, so the array has more columns than the datasheet as it has all possible addressing modes.

Each addressing mode modifies the syntax of the instructions, so a list of regular expressions is used to both determine the addressing mode being used, and pull out the operands for each instruction.

these variables help with expression complexity:

set ::RE_EXPR "\[A-Z0-9a-z_$^\]+"
set ::RE_EXPR ".+"
set ::IMM_RE "^%s\[ \t\]+#\[ \t\]*(\[A-Z0-9a-z_$^\]+)"


set ::ADDMODE {
{"REL" 2 "^%s\[\t\]+($::RE_EXPR)"}

{"IMM" 2 "^%s\[ \t\]+#\[ \t\]*($::RE_EXPR)"}

{"DIR" 2 "^%s\[ \t\]+($::RE_EXPR)"}

{"IDX" 2 "^%s\[ \t\]+($::RE_EXPR)\[ \t\]*,\[ \t\]*(\[Xx\])"}

{"EXT" 3 "^%s\[ \t\]+($::RE_EXPR)"}

{"IMP" 1 "^%s\[ \t\]*$"}

{"XIM" 3 "^%s\[ \t\]+#\[ \t\]*($::RE_EXPR)\[ \t\]*,\[ \t\]*($::RE_EXPR)" }
{"XXM" 3 "^%s\[ \t\]+#\[ \t\]*($::RE_EXPR)\[ \t\]*,\[ \t\]*($::RE_EXPR)\[ \t\]*,\[ \t\]*\[Xx\]"}
}

Each addressing mode has the name of th emode, the default instruction length for instructions of that addressing mode and the regular expression that can be used to parse the data and also to decide if an instruction is using a particular addressing mode.

The assember can test each assembly instruction against a regular expression made up of the instruction name and then each addressing mode in turn. When it gets a match it can build the object code from the table entries.

The assembler is a multi pass assembler as the addressing mode

There are some complications, such as macros and code overlays. There are also various directives to define byte, word and string data, and the concept of 'pack address' which is the address that a byte is located at in the 'EPROM' of a datapack, which is different to the address defined using ORG directives.  The Psion organiser also requires relocatable code which is done using a 'fixup' list at the end of the object code.

The assembler supports all of these features in just under 2000 lines, and manages to assemble Psion example code with almost exactly the same object code. I say almost exactly as I found that the Psion assembler example I was using used a less efficient addressing mode for one of the instructions, but only one occurance of it. My assembler generated a more efficient equivalent instruction and so was one byte shorter than the Psion version.

 The Psion assembler list output (this is from the XDICT.SRC example):

 165   20B4 97 E2                           sta    a,flag:

 372   21F2 B7 00E2                         sta    a,flag 

My assembler outputs the first form once it knows that the data is zero-page. The colon at the end of the label may be a reson but I haven't found documentation of that syntax.

The assembler can also embed the object code in a C program as an array using embedded comments to mark where the data should appear. This is useful as the code needs to be in a C file where it is compiled and flashed to a RP Pico.

This example of 'Code the Spec' didn't use exactly the same syntax as the datasheet, for another example, have a look at the Z80 Arduino Shield

 
 
https://trochilidae.blogspot.com/2019/12/z80-arduino-using-mega-as-debugger-ever.html
 
The code is on github
 
https://github.com/blackjetrock/z80_shield/blob/master/software/disasm/disasm.c  
 
and the disassembler uses a table of this form:
 
"00rrr110 nn :LD r, n:",
"01rrr110 :LD r, (HL):",
"11y11101 01rrr110 dd :LD r, (y+d):",
"01110rrr :LD (HL), r:",
"11y11101 01110rrr dd :LD (y+d), r:",
"01rrrsss :LD r,s:",

 If you look in the Z80 programming databook you will see that the instructions are defined in exactly this form, and the table was copied from those definitions.

 

No comments: