PA-RISC Architecture
Overview
PA-RISC is Hewlett Packard’s Reduced Instruction Set Computing (RISC) architecture developed in the 1980s and used until the mid-2000s in Unix and industrial HP computers. The computers covered on this site, the HP 9000, are based on the Precision Architecture and PA-RISC processors and used custom HP system designs.
There were three versions of PA-RISC architecture
Version | Bits | Instructions | Features | Processors | Years |
---|---|---|---|---|---|
PA-RISC 1.0 | 32-bit | 140 | Original | TS-1, NS-1, NS-2, PCX | 1986-1990 |
PA-RISC 1.1 | 32-bit | 190 | Low-cost |
PA-7000,
PA-7100,
PA-7200, PA-7100LC, PA-7300LC |
1991-1996 |
PA-RISC 2.0 | 64-bit | PA-8000,
PA-8200,
PA-8500,
PA-8600, PA-8700, PA-8800, PA-8900 |
1996-2005 |
Precision Architecture RISC
PA-RISC is Hewlett Packard’s Reduced Instruction Set Computing (RISC) architecture from the 1980s and an offspring from active HP research and development undertakings from that time. The aim of the Precision Architecture was to replace 16-bit stack-based CPUs in HP 3000 servers and Motorola 680x0 CPUs in HP’s Unix systems with a common system architecture.
The PA-RISC architecture and instruction set were built from the ground up
by HP engineers.
PA-RISC was implemented almost exlusively in HP processors, from early version in TTL and NMOS in the 1980s to more modern integrated 32-bit (PA-7x00) and 64-bit (PA-8x00) RISC processors.
Overall PA-RISC was a rather conservative RISC design for the 1980s, updated until the early 2000s.
HP described the Precision Architecture as the result of years of studying
RISC:
- Reduced instruction set (
simple, efficient instructions
) - Instruction set is implemented in hardware (
hardwired
) and not microcoded. - Instruction size is fixed length and fixed format — one word (32-bit), which facilitates pipelining
- Only three addressing modes: long/short displacement and indexed.
- Load/store design: Only load and store operation access the memory, computational instructions do not.
- Single cycle operation: Many simple and frequently used instructions execute in just one cycle, more complex computation are assigned to assist processors or software algorithms.
- Optimizing compilers
The instruction set was tested widely by HP scientists on a wide range of programs with billions of instructions
(in the 1980s!) to identify the PA-RISC instruction set, only instructions that could add value were selected.
HP added a few complementary features to increase flexibility and performance:
Extended addressing, co- and multi-processor support and memory-mapped I/O.
PA-RISC supported very wide adressing from the beginning and was designed as a SMP-capable architecture, with memory-mapped I/O simplifying the overall design.
Compared to other RISC architectures, original PA-RISC was rather simple and unspectacular — it had limited extra features but remained always at competitive speeds, especially in Floating Point and multiprocessing.
Later on, HP was the first to include multimedia extension in commercially available microprocessors, MAX-1 in the PA-7100LC and MAX-2 64-bit in the PA-8000, which allowed vector operations on two or four 16-bit subwords in 32-bit or 64-bit integer registers. HP was also slowish in bringing PA-RISC to 64-bit in the PA-8000 processors, another conservative design.
Spectrum
During development, PA-RISC was called Spectrum at HP Labs in Palo Alto, where it was forged in laboratories based on experimental data
to achieve a simple architecture.
A talk given be Joel Birnbaum laid this out in 1986.
Design objectives for Spectrum, and then PA-RISC, were: leadership in price/performance, migration path from current HP products, unified scalable architecture. Spectrum was planned to be scalable from single chips up to mainframe machines with constant compatibility for programs
In the early 1980s, when HP Labs was developing PA-RISC (Spectrum), RISC had not been commercially successful yet, caused by incompatibility with software written for previous CISC architecture.
The advantages of RISC vs. CISC were hotly debated at the time, with HP doing fundamental research on performance factors.
This resulted, in HP’s words, in many innovations in PA-RISC, such as a comparable path length to conventional
machines but superior throughput.
Spectrum had 140 instructions, a fraction of then current CISC designs but more than contemporary RISC platforms – specified really by the compiler people.
Most instructions were single-cycle with only some multi-cycle instructions required for languages such as COBOL and FORTRAN.
Important for the Precision Architecture were fast control and data paths without microcode, resulting in fewer transistors needed, and high speed registers, with 32 general purpose identified as optimal.
HP designed a few beyond RISC
features into PA-RISC: very large address space (48-bit direct addressing for 64-bit virtual), split instruction and data caches, high-bandwidth internal buses, memory mapped I/O for high-speed devices of the future, precision interrupts for real-time response and optimized I/O bus interfaces to enable easy bus converters.
As HP fabbed VLSI chips inhouse at the time (and kept doing that until the 1990s), it touted expansion possibilities of the CPU through support chips such as FPU and SMP possibilities.
PA-RISC 1.0
PA-RISC 1.0 was the implementation of the HP Spectrum research undertaking on RISC and Precision Architecture. The PA-RISC 1.0 design was 32-bit with a single instruction/data bus, it had 140 instructions, more than contemporary RISC designs. PA-RISC later on moved to a Harvard-style architecture with seperate instruction and data buses.
PA-RISC 1.0 has thirty-two 32-bit integer general purpose registers (GR0-GR31), seven shadow registers (SR0-SR6) for fast-interrupts and thirty-two 64-bit Floating Point registers for the FPU, which also could be combined to 64×32-bit and 16×128-bit. The FPU is able to execute a Floating Point instruction simultaneously to the ALU.
Processors based on PA-RISC 1.0 were implemented in a multitude of fabrication technologies – TTL, NMOS, CMOS for the early PA-RISC CPUs.
Addressing in PA1.0 was 48-bit wide, later on expanded to 64-bit with the introduction of the PA-8000 line in PA-RISC 2.0.
PA-RISC 1.1
PA-RISC architecture was extended to version 1.1 with the PA-7000 processor in 1991, with now 190 instructions. The major change in PA-RISC 1.1 was the inclusion of a MMU (memory management unit) that enabled PA-RISC computers to use virtual memory. Starting with the the second PA-RISC 1.1 processor, the PA-7100, all PA processors implement superscalar instruction execution — the ability to execute multiple instructions simultaneously.
32-bit PA-RISC 1.1 processors are up to two-way superscalar, later 64-bit processors up to four-way. Other significant developments in PA1.1 include the PA-7100LC and PA-7300LC processors (LC for low cost) , which integrated the memory and I/O controller onto the processor die, on the PA-7300LC additionally the cache controller and first-level cache.
PA-RISC 2.0
In 1996 the 64-bit redesign of PA-RISC was introduced with the PA-RISC 2.0 PA-8000 processor. The architectural changes were rather intrusive but stayed compatible with 32-bit PA-RISC 1.1. On a side note, the PA-RISC 2.0 and the PA-8000 were introduced before the last 32-bit PA-RISC processor — the PA-7300LC — shipped.
Main changes and features of PA-RISC 2.0 include:
- All registers and functional units extended to 64-bit
- Virtual address space extended to 64-bit
- Physical address space is 40-bit on PA-8000 to PA-8600 (for 1 TB of addressable physical memory) and 44-bit (16 TB memory) on PA-8700 and later
- Out-of-Order (OoO) execution capability with the IRB (Instruction Reorder Buffer), which stores up to 28 computation and 28 load/store instructions and reorders and prepares the for execution on the fly. It tracks interdependecies and branch prediction outcomes as well. The IRB is the key part in the OoO execution capability of PA-RISC 2.0.
- FPMAC (Floating Point Multiply Accumulate) units
The later PA-8x00 processors of the 2000s did not introduce significant changes to the architecture or logic, besides higher integration of large L1 caches in the PA-8600 and dual-core PA-8800 and PA-8900. The processors after the PA-8000 were mostly redesigns and extensions of that processor core.
Post-PA-RISC
From the mid-1990s on a parallel track to PA-RISC 2.0 development HP joined Intel in developing the VLIW Itanium architecture from its own R&D projects, called EPIC, which resulted in the Intel/HP IA64 architecture.
Since the early-2000s HP sold two lines of Unix computers and servers in parallel — PA-RISC 2.0 and Itanium. These competing designs were apparent in the Integrity servers — with the rp servers (PA-RISC) and rx servers (Itanium).
These post-PA-RISC designs were not the success many hoped and HP after the turn of the century switched to standard Intel x86 fare.
Pre-PA-RISC
The predecessor of PA-RISC in the early 1980s was the HP FOCUS architecture from the HP 9000 Series 500. FOCUS was a stack architecture, with 230 instructions both 32 bits and 16 bits wide, a segmented memory model, and no general purpose programmer-visible registers. There are thirty-nine 32-bit registers in the CPU hardware, thirty-one internal 32-bit general purpose registers, two 32-bit ALU registers, and others.
Floating Point Unit (FPU)
The Floating Point Unit is an assist processor logically added to a system to improve the performance on floating-point operations. The processor can be on a seperate chip (e.g., PA-7000) or integrated onto the central CPU die (all PA-RISC CPUs upwards). The FPU executes special floating point instruction to perform arithmetic on its own set of independent registers (register file) and to move data between its own registers and the system’s lower memory hierarchy. The FPU execution stage is pipelined. All PA-RISC FPUs contain thirty-two 64-bit registers, which can also be used as sixty-four 32-bit registers and sixteen 128-bit registers.
Memory and I/O Controller (MIOC)
The Memory and I/O Controller (MIOC) in the PA-7100LC and PA-7300LC processor integrates DRAM, cache and I/O controllers onto the processor die. MIOC is similar on both CPUs, with the PA-7300LC MIOC having wider data paths to L2 cache and RAM and supporting the advanced GSC+ bus over the older GSC on PA-7100LC.
MIOC’ integrated memory controller requires only buffers and DRAM modules to build up complete memory subsystem. The PA-7300LC MIOC memory controller includes a Second Level Cache Controller SLC, which provides an optional L2 cache, ranging from 32 KB to 8 MB. It shares the data bus with the DRAM subsystem, so it has the same width and same optional SEDC error control.
- Execution units and internal caches attach on-chip to the MIOC
- External cache, L1 on PA-7100LC, L2 on PA-7300LC, attach to MIOC via 64-bit or 128-bit
- Memory attaches to MIOC via 64-bit, on PA-7100LCm or 128-bit, on PA-7300LC
- GSC, the system main bus, attaches to MIOC
- Support for 4, 16, 64 and 256 Mbit modules, FPM and EDO DRAM at 3.3 or 5.0 V
- Up to 16 physical memory slots
- Support for a wide range of core frequencies
Transition Lookaside Buffer (TLB)
The Translation Lookaside Buffer is a hardware structure doing virtual-to-physical memory address translations which takes virtual page numbers and returns the corresponding physical page number. The PA-7000 is the last PA-RISC processor with seperate instruction and data TLBs, all later PA 1.1 and 2.0 CPUs use combined TLBs while older PA-RISC 1.0 processors use huge TLBs (even for today’s standards):
CPU | TLB entries |
---|---|
PA-7000 | 96 I and 96 D |
PA-7100 | 120 |
PA-7100LC | 64 |
PA-7200 | 120 |
PA-7300LC | 96 |
PA-8000 | 96 |
PA-8200 (PCX-U+) | 120 |
PA-8500 (PCX-W) | 160 |
PA-8600 (PCX-W+) | 160 |
PA-8700 | 240 |
PA-8800 | 2×240 |
PA-8900 | 2×240 |
Hitachi PA/50 | 32 I and 64 D |
Hitachi HARP-1 | 128 I and 128 D |
TS-1 | 4096 I/D |
NS-1 | 4096 I/D |
NS-2 | 16384 I/D |
CMOS26B (PCX) | 8192 I/D |
Translation and miss handling in PA-RISC TLBs is as follows:
- PA 1.1: If a virtual address has to be translated to a physical address, the corresponding TLB is searched for an entry matching the Virtual Page number. If an entry is found, the 20-bit Physical Page number, delivered by the TLB, is concatenated with the original 12-bit page offset to the build up the 32-bit absolute physical address.
- Hardware: If the CPU implementation provides a hardware TLB miss handler, it attempts to find the virtual-to-physical translation in the Page Table. If successful, the translation and protection fields are inserted in the TLB. If not successful, an interruption occurs so the software miss handler can complete the translation.
- Software: If software TLB miss handling is implemented, a TLB miss fault interruption routine performs the translation. It inserts the translation and protection fields in the TLB and afterward restarts the interrupted routine, in which the TLB miss occurred.
Block Transition Lookaside Buffer (BTLB)
Similar to the TLB, the BTLB provides virtual-to-physical address translations. The BTLB however maps large address ranges rather that single pages as the TLB. These large address ranges are block translations and therefore stored in the Block Translation Lookaside Buffer. These block translations are useful for virtual address ranges that do not get paged in or out.
BTLBs were only implemented on 32-bit PA-RISC processors (PA-7x00), 64-bit PA-RISC instead implemented variable page sizes, thus any entry can be of >4k mapping.
Superscalar PA-RISC
A superscalar processor implementation decodes, dispatches and executes multiple instructions per cycle if dependencies between the instructions permit. This is possible if the instruction stream contains independent instructions. Superscalarity can be gained from a decoupled floating point unit (FPU) which executes floating point operations indepently from the integer ALU. More complicated variations allow for parallel load/store operations, integer calculations and so on, which need a more complex CPU design that analyzes the instructions/branches.
Every PA-RISC processor from the PA-7100 on implements superscalar execution. Instructions proceed together through the execution pipeline, which is called instruction bundling. The superscalar execution is functionally transparent to the software, the effects of any given instruction are the same whether it was executed as part of a bundle or alone. Bundling rules are applied at run-time by the hardware; optimal performance may only be gained by proper ordering of the instructions so the processor can use its full superscalar potential. Several kinds of restrictions are placed upon the instruction bundling in PA-RISC:
- Functional unit contention
- Data dependency restrictions
- Control flow restrictions
- Special instruction restrictions
For bundling purposes instructions are divided into classes:
Class | Description |
---|---|
FLOP | Floating point operation |
LDST | Loads and stores |
ALU | Integer ALU |
MM | Shifts, extracts, deposits |
NUL | Might nullify successor |
BV | Branch Vectored (BV) local, Branch (BE) external |
BR | Other branches |
FSYS | FTEST and FP status/exception |
SYS | System control instructions |
PA-7100 superscalar capabilities
The PA-7100 is two-way superscalar with one integer ALU and one FPU.
First instruction | Second instruction |
---|---|
ALU | + FLOP |
LDST | + FLOP |
FLOP | + ALU/LDST/Branch |
PA-7100LC/PA-7300LC superscalar capabilities
These are 2-way superscalar processor implementations with two integer ALUs and one FPU. Notably only one of the two ALUs is capable to handle loads, stores and shifts.
First instruction | Second instruction |
---|---|
FLOP | + LDST/ALU/MM/NUL/BV/BR |
LDST | + FLOP/ALU/MM/NUL/BR |
ALU | + FLOP/LDST/ALU/MM/NUL/BR/FSYS |
MM | + FLOP/LDST/ALU/FSYS |
NUL | + FLOP |
SYS | Never bundled |
Besides from these bundles, LDST + LDST bundles are under certain circumstances also possible. These are then called double word load/store. Several kinds of instructions cannot be bundled together because of inter-instruction data dependencies:
- An instruction that modifies a register will not be bundled with another instruction that takes
this register as operand.
Exception: a FLOP can be bundled with a FP store of the FLOP’s result register. - A FP load to one word of a doubleword register will not be bundled with a FLOP that uses the other doubleword of this register.
- A FLOP will not be bundled with a FP load if both instructions have the same target register.
- An instruction that could set the carry/borrow bits will not be bundled with an instruction that uses carry/borrow bits.
- An instruction which is in the delay slot of a branch is never bundled with other instructions.
- An instruction which is at an odd word address and executed as a target of a taken branch is never bundled.
- An instruction which might nullify its successor is never bundled with this successor. Only if the successor is a FLOP instruction this bundle is allowed.
PA-7200 superscalar capabilities
This is a 2-way superscalar processor implementation. It has two integer ALUs and one FPU. Similar to the PA-7100LC, shift-merge and test condition units are not duplicated in the second ALU. To support the superscalar capabilities one additional write port and two additional read ports were added to the general registers (GR*).
First instruction | Second instruction |
---|---|
FLOP | + LDST/ALU/MM/NUL/BV/BR |
LDST | + FLOP/ALU/MM/NUL/BR |
ALU | + FLOP/LDST/ALU/MM/NUL/BR/FSYS |
MM | + FLOP/LDST/ALU/FSYS |
NUL | + FLOP |
PA-8x00 superscalar capabilities
To be described.
Multimedia Acceleration MAX-1 and MAX-2
MAX-1 (32-bit)
MAX-1 are the original multimedia extensions from the 1990s introduced with the HP PA-7100LC processor and later also the PA-7300LC. The aim from HP in its design was to enable contemporary workstations with these CPUs to provide real-time MPEG video decompression and playback at a rate of 30 frames/second without the need for a special DSP (digital signal processing) chip, not an easy feat.
The HP design process for the PA-7100LC processor in the early 1990s included for the first time multimedia benchmarks for analyzing optimizations in the instruction set design.
The actual implementation used a small set of SIMD-MIMD instructions to faciliate the application of instructions on bundled subword data. Since these instructions use the same data paths and execution units within the processor as the regular instructions, the design team termed this intrinsic signal processing (ISP).
Sticking to conventional RISC principles, the design team decided against adding complex special-purpose instructions to the design but opted for the elegant use of the existing facilities in the CPU, which were slightly modified to understand new, packed subword data.
In 1994, the MAX-1 extensions made their way into the final PA-7100LC product and as such were the first SIMD instructions found in a general microprocessor. Less than 0.2 percent of the processor silicon area had to be used for MAX-1 additions and modifications, while allowing a very significant performance boost in affected applications.
As an example, the then-highend HP 9000 735/99 workstation with a 99 MHz processes and 512 KB cache achieved 18.7 FPS at MPEG decompression benchmarks — the new entry-level 712 workstation at 60 MHz and 64 KB cache achieved 26 FPS, an impressive feat for the time an 1990s information technology.
New MAX-1 multimedia instructions include: parallel add, parallel subtract, parallel shift left & add (i.e. multiply with integer), parallel shift right & add (i.e. division), parallel average.
MAX-2 (64-bit)
With the introduction of the new 64-bit PA-RISC 2.0 architecture in 1996 HP unveiled a new set of multimedia-oriented instructions aimed at using the processor’s resources more effectively for sub-word data. The basic components of the contemporary multimedia data were often represented as 8, 12 or 16-bit integers, for example audio sampling and pixel color depth. Doing arithmetic with data of this length would waste an considerable amount of the processor’s execution capacities, a simple addition of 16-bit data would only use one quarter of the 64-bit wide integer units datapath. To remedy this situation, MAX allows for packing of these subword data into larger words near the processor’s natural word width (64-bit on PA-RISC 2.0 processors) and using parallel instructions on them. An example would be four 16-bit additions by the 64-bit adder on four 16-bit packed subwords.
The basic functionality from the earlier 32-bit MAX-1 was taken over and four more instructions added for MAX-2. Additionally, due to the wider integer registers (now 64-bit) more subwords can be packed in one cycle, doubling the effective speed of these multimedia instructions. The MAX-2 multimedia instructions include (new in MAX-2 are in bold): parallel add, parallel subtract, parallel shift left & add (i.e. multiply with integer), parallel shift right & add (i.e. division), parallel average, parallel shift right, parallel shift left, mix and permute.
MAX-2 debuted 1996 with the PA-8000 processor and later featured on all subsequent PA-RISC 2.0 processors (PA-8x00). In contrast to contemporary multimedia extensions, MAX-2 required only very little die space (0.1 percent on the PA-8000).
Further reading
Selected papers and articles for further reading on the PA-RISC architecture and platform
- Hewlett-Packard Precision Architecture: The Processor (.pdf) M. Mahon et al (August 1986: Hewlett Packard Journal. Accessed May 2009)
- PA-RISC 1.1 Architecture and Instruction Set Reference Manual (.pdf) Hewlett-Packard Company (February 1994, third edition. Accessed May 2009 at PA-RISC Linux FTP)
- PA-RISC 2.0 Instruction Set Architecture (.pdf) Hewlett-Packard Company (1995. Accessed May 2009 at PA-RISC Linux FTP)
- Great Microprocessors of the Past and Present, John Bayko (June 2001/V 12.1.1: BURKS. Accessed 28 Dec 2007)
- Single Instruction Multiple Data, Multiple Instruction Multiple Data (MIMD), see for example the SIMD Wikipedia article and MIMD Wikipedia article
- Accelerating Multimedia with Enhanced Microprocessor (PDF, 2.4 MB) Discussion of the MAX-1 instructions. Ruby Lee, April 1995, IEEE Micro, Volume 15 Number 2.
- 64-bit and Multimedia Extensions in the PA-RISC 2.0 Architecture (PDF, 66 KB) New features of the 64-bit PA-RISC 2.0 architecture and overview on the MAX introduced with it. Ruby Lee and Jerry Huck, 1996, Hewlett-Packard Company.
- Subword Parallelism with MAX-2 (PDF, 1.5 MB) Discussion of the MAX-2 instructions. Ruby Lee, August 1996, IEEE Micro, Volume 16 Number 4.
- HEWLETT-PACKARD FILLS IN PRECISION RISC DETAILS, CBR Online, February 17, 1994
- Intel, HP Ally on New Processor Architecture, MICROPROCESSOR REPORT, June 20, 1994
- HP Precision architecture - A new perspective, Hewlett Packard 1986, archive at 1000bit.it
- Beyond RISC - Spectrum Introduction, Hewlett Packard 1986, archive at 1000bit.it
- PA7100LC ERS (External Reference Specification) (.pdf) Hewlett-Packard Company (1999)
- The PA 7100LC Microprocessor: A Case Study of IC Design Decisions in a Competitive Environment Mick Bass et al (PDF, HP Journal 4/95, archive.org mirror)
- PA7300LC ERS (External Reference Specification) (PDF, 716 KB) Hewlett-Packard Company (1996)
-
The PA-7300LC: the first
System on a Chip
(archive.org mirror) Tom Meyer (1996: Presentation for Microprocessor Forum 1995)