Microblaze timing

Microblaze Memory Map

This is the standard configuration given to you for lab3.

0x00000000 - 0x00000FFF (4kb) on chip fast memory (LMB SRAM)
0x00800000 - 0x0087FFFF (512kb) OPB external SRAM
0xFEFE0000 - 0xFEFEFFFF (64kb)  OPB audio device
0xFEFF0000 - 0xFEFF00FF (256) OPB interrupt controller 
0xFEFF0100 - 0xFEFF01FF (256) OPB uart (serial) device


The program (code, data, stack etc) can be located in both LMB-SRAM or external OPB-SRAM.
Obviously, there is a tradeoff : LMB is faster but is has lower capacity.

In your first labs you used only LMB ( remember the 4 kb limitation ? ).
For a biger project, however, you have to decide what code / data will be stored in LMB versus OPB.



The video display takes the first 640*480=307200 bytes of OPB-SRAM memory (i.e. 0x00800000 - 0x0084AFFF).

Microblaze is configured w/ a 2kb instruction cache, which maps the last 128kb of the
OPB-SRAM : 0x00860000 - 0x0087FFFF; there is NO data cache.
This area is ideal to store code, but can also store data.

The gap between (i.e. 0x0084B000 - 0x0085FFFF - not cached !) can also be freely used.
In my samples - for simplicity - all upper code / data are stored starting from 0x00860000.
Remember of this 84 kb gap if you run out of memory in your final project !

Estimate your worst case

The following section will allow you to make speed ESTIMATES of your code.
As most embedded applications are real-time, this is ESSENTIAL. For the same reason,
we are intersted in estimating the WORST CASE, not the average.

Microblaze execution speed

Microblaze is a 32 bit RISC CPU (so it's a load / store architecture)

Our microblaze clock is 50 MHz (FPGA's are ideal for prototyping, but slow)

An instruction takes a various number of cycles:

fetch penalty + instr. latency + data access (only for load / store)

- instr. latency 
most instr. execute in 1 cycle (i.e. ADD, AND, etc.)
several instr. (i.e. branches, barrel shifts, etc.) take 2-3 cycles.

- fetch penalty
if the code executes from LMB or is cached, this is 0 : you run at full speed
if executes from OPB-SRAM (not cached), this is 7 cycles !!!!

- data access
all LMB data accesses take 1 cycle, regardless of size or direction (load vs. store)
the OPB-SRAM accesses take:
store 32/16/8 ( SW/SH/SB )  : 3 cycles
load 32 ( LW ) : 7 cycles !!!!
load 16/8 (LHU, LBU) : 5 cycles !!


EXAMPLE
We want to clear the screen. This takes 640*480 / 4 = 76800 32bit stores to the video mem.
Assuming that the loop is unrolled, and the code executes from LMB, a write will take
1 + 3 = 4 cycles. Total : 76800 * 4 = 307200 cycles. As the clock is 50 MHz, the time will
be : 6.1 ms. Plus a little loop overhead, we can safely guarantee 8 ms.

CACHE
The instr. cache is 2kb, direct mapped, 32 bit (1 word) cache line.
This sounds poor, but you'll see it does miracles.

What if the code is in cacheable OPB-SRAM ?
Assuming (worst case !) the code is not cached. The first loop will take a "big" time,
but the next thousand loops will execute at full speed.
Result : we can also guarantee 8 ms.
Conclusion : place the code in the OPB-SRAM, why wasting precious LMB ?


EXAMPLE
Multiplication. You already know that microblaze /spartan does not have a h/w multiplier.
Instead, these ops are done by s/w routines.

The mulsi routine has ~ 22 instr. and executes ~ 210 instr. in the worst case
(210 > 22 because of the loops :-)

If executed from the LMB, it will take approx 210 cycles (that's 4.2 us).

What if the code is in SRAM ? Worst case: not cached.
Overhead : 22 * 7 = 154 cycles.

Is this the only penalty we have to pay ? No !
Assume mulsi is called from a function also stored in OPB - SRAM.
In the WORST case, the mulsi code will replace in the cache EXACTLY the code that follows the call.
So, after mulsi returns, we need another 154 cycles to cache it back.

Worst case : 154 + 210 + 154 = 518 cycles (that's 10.36 us, more than double)


Conclusion : Real - Time design is playing against the devil.
You CANNOT assume that the above scenario will not occur.


So: where to place mulsi ? It takes 22 * 4 = 88 bytes.
It is not a small size for LMB, but not a disaster either.
Estimate how many muls you need execute in a second.
Compute what time it takes for both situations.
Accordingly, solve the compromise in one direction.


Assumption 1: 100 mul/s : 420 us / 1036 us  -- place it in OPB-SRAM

Assumption 2: 10000 mul/s : 42 ms / 103.6 ms -- place it in LMB

Assumption 3: 1000000 mul/s : 4.2 s / 10.36 s -- OOPS -- you have to build some h/w to help you