# Precision-Timed (PRET) Machines

Stephen A. Edwards

Columbia University

Joint work with Edward A. Lee, University of California, Berkeley

## A Major Historical Event

In 1980, Patterson and Ditzel did not invent reduced instruction set computers (RISC machines).

D. A. Patterson and D. R. Ditzel, "The case for the reduced instruction set computer," ACM SIGARCH Computer Architecture News, 8(6):25-33, Oct. 1980.

## Another Major Historical Event

In 2006, Lee and Edwards did not invent reduced precision-timed computers (PRET machines).

S. A. Edwards and E. A. Lee, "The Case for the Precision Timed (PRET) Machine," EECS Department, University of California, Berkeley, Technical Report No. UCB/EECS-2006-149, November 17, 2006.

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-149.html

#### The World as We Know It

We do not consider how fast a processor runs when we evaluate whether it is "correct."



Salvador Dali, *The Persistence of Memory*, 1931. (detail)

#### This Is Sometimes Useful For

- Programming languages
- Virtual memory
- Caches
- Dynamic dispatch
- Speculative execution
- Power management (voltage scaling)
- Memory management (garbage collection)
- Just-in-time (JIT) compilation
- Multitasking (threads and processes)
- Component technologies (OO design)
- Networking (TCP)

#### **But Time Sometimes Matters**



Kevin Harvick winning the Daytona 500 by 20 ms, February 2007. (Source: Reuters)

## Isn't Real-Time Scheduling Solved?



Fixed-priority (RMA): schedulable if < 69% utilization Variable-priority (EDF): schedulable if < 100% utilization Hinges on knowing task execution times

## Interrupt Latency and Response



Need longest interrupt-disabled time + scheduling time After Labrosse, *MicroC/OS-II: The Real-Time Kernel*, 1999.

## Jitter from Delaying for One Tick



#### Worst-Case Execution Time

Virtually impossible to compute on modern processors.

| Feature           | Nearby<br>instructions | Distant instructions | Memory<br>layout |
|-------------------|------------------------|----------------------|------------------|
| Pipelines         | $\sqrt{}$              |                      |                  |
| Branch Prediction | $1 \qquad \sqrt{}$     | $\sqrt{}$            |                  |
| Caches            | $\sqrt{}$              | $\sqrt{}$            | $\sqrt{}$        |

#### State-of-the-art WCET

- Motorola ColdFire
- Two coupled pipelines (7-stage)
- Shared instruction & data cache
- Artificial example from Airbus
- Twelve independent tasks
- Simple control structures
- Cache/Pipeline interaction leads to large integer linear programming problem



C. Ferdinand et al., "Reliable and precise WCET determination for a real-life processor," EMSOFT 2001

#### Certification in Avionics

- Rather expensive
- Software is *not* certified
- Entire system is certified
- Slight change, e.g., in the microprocessor, requires recertification
- Solution: stockpile parts; trust nobody



(Source: NASA)

#### The Problem

Digital hardware provides extremely precise timing



20.000 MHz ( $\pm$  100 ppm)

and architectural complexity discards it.

### Our Vision: PRET Machines

PREcision-Timed processors: Performance & Predicability



(Image: John Harrison's H4, first clock to solve longitude problem)

#### Our Vision: PRET Machines

Predictable performance, not just good average case

| Current                   | Alternative                     |
|---------------------------|---------------------------------|
| Caches                    | Scratchpads                     |
| Pipelines                 | Thread-interleaved pipelines    |
| Function-only ISAs        | ISAs with timing                |
| Function-only languages   | Languages with timing           |
| Best-effort communication | Fixed-latency communication     |
| Time-sharing              | Multiple independent processors |

## **Application Areas**

#### Hard real-time systems

- Avionics
- Automotive
- Multimedia
- Consumer Electronics
- Simple digital hardware











## Basic Idea

Q: How do you make software run at a precise speed?



### Basic Idea

Q: How do you make software run at a precise speed?

A: Give it access to a clock.



## One Usual Way: Timers

Period timer interrupt triggers scheduler

Large period reduces overhead

Linux uses a 10 ms clock

Result: OS provides 10 ms resolution at best

Higher precision requires more overhead

0 ms 10 ms 20 ms 30 ms 40 ms 50 ms 60 ms

## Or NOPs/cycle counting

Code from Linux arch/i386/kernel/timers/timer none.c

```
delay_none:
```

%ebp 0: push %esp,%ebp 1: mov

3: sub \$0x4,%esp

6: 0x8(%ebp),%eax mov

9: 10 jmp

10: 20 jmp

20: dec %eax

21: jns 20

23: %eax, -4(%ebp) mov

26: leave

27: ret Tricky

Clock speed + cache behavior

+ branch behavior +?

This example worries about

cache alignment

Very much an

assembly-language trick

1000s of lines of code in

Linux needed for busy wait

#### Related Work: Giotto

Giotto [Henzinger, Horowitz, Kirsch: Proc. IEEE 2003]

The RTOS style: specify a collection of tasks and modes. Compiler produces schedule (task priorities).

Precision limited by periodic timer interrupt.

```
mode forward() period 200 {
   actfreq 1 do leftJet(leftMotor);
   actfreq 1 do rightJet(rightMotor);
   exitfreq 1 do point(goPoint);
   exitfreq 1 do idle(goIdle);
   exitfreq 1 do rotate(goRotate);
   taskfreq 2 do errorTask(getPos);
   taskfreq 1 do forwardTask(getErr);
}
```

#### Related Work: STI



Software Thread Integration [Dean: RTSS 1998]

Insert code for a non-real-time thread into a real-time thread.

Pad the rest with NOPs

Often creates code explosion

Requires PRET processors; he uses AVRs

#### Related Work: VISA

VISA [Meuller et al.: ISCA 2003]

Run two processors:

- Slow and predictable
- Fast and unpredictable

Start tasks on both.

If fast completes first, use extra time.

If fast misses a checkpoint, switch over to slow.



## A First Attempt

MIPS-like processor with 16-bit data path as proof of concept

One additional "deadline" instruction:

dead timer, timeout

Wait until *timer* expires, then immediately reload it with *timeout*.

Nicholas Ip and Stephen A. Edwards, "A Processor Extension for Cycle-Accurate Real-Time Software," Proceedings of EUC, Seoul, Korea, August 2006.

## Programmer's Model

#### General-purpose Registers

| 15        | 0 |
|-----------|---|
| \$0 (= 0) |   |
| \$1       |   |
| \$2       | 7 |
| •         |   |
| \$13      | 7 |
| \$14      |   |
| \$15      |   |

#### **Timers**

| 15 |      | 0 |
|----|------|---|
|    | \$t0 |   |
|    | \$t1 |   |
|    | \$t2 |   |
|    | \$t3 |   |

#### **Program counter**



## Instructions

| add   | Rd, Rs, Rt        | or    | Rd, Rs, Rt          |
|-------|-------------------|-------|---------------------|
| addi  | Rd, Rs, imm16     | ori   | Rd, Rs, imm16       |
| and   | Rd, Rs, Rt        | sb    | Rd, (Rt + Rs)       |
| andi  | Rd, Rs, imm16     | sbi   | Rd, $(Rs + offset)$ |
| be    | Rd, Rs, offset    | sll   | Rd, Rs, Rt          |
| bne   | Rd, Rs, offset    | slli  | Rd, Rs, imm16       |
| j     | target            | srl   | Rd, Rs, Rt          |
| lb    | Rd, (Rt + Rs)     | srli  | Rd, Rs, imm16       |
| lbi   | Rd, (Rs + offset) | sub   | Rd, Rs, Rt          |
| mov   | Rd, Rs            | subi  | Rd, Rs, imm16       |
| movi  | Rd, imm16         | dead  | T, Rs               |
| nand  | Rd, Rs, Rt        | deadi | T, imm16            |
| nandi | Rd, Rs, imm16     | xnor  | Rd, Rs, Rt          |
| nop   |                   | xnori | Rd, Rs, imm16       |
| nor   | Rd, Rs, Rt        | xor   | Rd, Rs, Rt          |
| nori  | Rd, Rs, imm16     | xori  | Rd, Rs, imm16       |

## Architecture



## Behavior of Dead

|                                                                        | C             | ycle                | instruction         | \$t0          |          |
|------------------------------------------------------------------------|---------------|---------------------|---------------------|---------------|----------|
|                                                                        |               | <b>-4</b>           | deadi \$t0, 8       | 3             |          |
|                                                                        |               | -3                  | "                   | 2             |          |
| deadi \$t0, 8 add \$r1, \$r2, \$r3 deadi \$t0, 10 add \$r1, \$r2, \$r3 |               | -2 "                | 11                  | \ 1           |          |
|                                                                        |               | -1                  | 11                  | 0             |          |
|                                                                        | 0             | add \$r1, \$r2, \$r | 3 (7)               |               |          |
|                                                                        | 1             | deadi \$t0, 10      | 6                   |               |          |
|                                                                        | φ11, φ12, φ13 | 2                   | "                   | 5             |          |
|                                                                        |               | •                   | •                   | \ :           | 8 cycles |
|                                                                        | 7             | 7                   | 11                  | $\setminus 0$ |          |
|                                                                        |               | 8                   | add \$r1, \$r2, \$r | 3 9           |          |

## Idioms: Straightline Code

```
deadi $t0, 42

First block will take deadi $t0, 58 at least 42 cycles.

Second block: at deadi $t0, 100 least 58 cycles.
```

# Idioms: Loops

```
L1:
deadi $t0, 42 ← Put a deadline in a loop:
Each iteration will take at bne $r1, $r2, L1 least 42 cycles.
```

## Case Study: Video

 $80 \times 30$  text-mode display, 25 MHz pixel clock

Need 40 ns precision

Shift register in hardware; everything else in software



## Case Study: Video

```
Two nested loops:
                        ; reset line address
        $2, 0
 movi
row:
        $7, 0
                        ; reset line in char
 movi
                                                   Active line
line:
 deadi $t1, 96
                        ; h. sync period
        $14, HS+HB
 movi
                                                    Character
        $3, $7, FONT
                        ; font base address
 ori
 deadi $t1, 48
                        ; back porch period
 movi
        $14, HB
                                              Two timers:
 deadi $t1, 640
                        ; active video period
        $1, 0
                        ; column number
 mov
char:
                                                  $t1 for line timing
        $5, ($2+$1)
 1b
                        ; load character
                        ; *16 = lines/char
 shli
        $5, $5, 4

    $t0 for character

 deadi $t0, 8
                        ; wait for next character
 1b
        $14, ($5+$3)
                        ; fetch and emit pixels
 addi
        $1, $1, 1
                        ; next column
        $1, $11, char
 bne
                                              78 lines of assembly
                        ; front porch period
 deadi
        $t1, 16
                                              replaces 450 lines
 movi
        $14, HB
 addi
        $7, $7, 1
                        ; next row in char
                                              of VHDL (1/5th)
        $7, $13, line
                        ; repeat until bottom
 bne
 addi
        $2, $2, 80
                        ; next line
        $2, $12, row
                        ; until at end
 bne
```

# Case Study: Serial Receiver

```
Sampling rate under
                    ; final bit mask (10 bits)
 movi $3, 0x0400
                    ; half bit time for 9600 baud software control
 movi $5, 651
 shli $6, $5, 1
                    ; calculate full bit time
wait for start:
                                              Standard algorithm:
 bne $15, $0, wait for start
got start:
 wait $t1, $5
                    ; sample at center of bit
                                                 1. Find falling edge
                    ; clear received byte
 movi $14, 0
                                                     of start bit
                    ; received bit mask
 movi $2, 1
 movi $4, 0
                    ; clear parity
 dead $t1, $6
                    ; skip start bit
                                                 2. Wait half a bit
receive bit:
 dead $t1, $6
                    ; wait until center of next bit
                                                     time
 mov $1, $15
                    ; sample
 xor $4, $4, $1
                    ; update parity
 and $1, $1, $2
                    ; mask the received bit
                                                 3. Sample
 or $14, $14, $1
                    ; accumulate result
 shli $2, $2, 1
                    ; advance to next bit
 bne $2, $3, receive bit
                                                 4. Wait full bit time
check parity:
      $4, $0, detect baud rate
 andi $14, $14, 0xff; discard parity and stop bits
                                                 5. Repeat 3. and 4.
```

## Implementation

Synthesized on an

Altera Cyclone II FPGA

(DE2 board)

Coded in VHDL

Runs at 50 MHz

Unpipelined

Uses on-chip memory



#### Conclusions

- Embedded applications need timing control
- RTOSes on modern processors too unpredictable
- We need hardware support
- High-performance processors with predictable timing
- Predictable performance our mantra
- A first cut: MIPS-like processor with timers
- 50 MHz on an Altera Cyclone II FPGA
- *Dead* instruction waits for timeout, then reloads
- Video controller 1/5 the size of VHDL
- Serial controller even simpler