# FPGA JPEG Image Compression Accelerator

**EECS 4840** 

Yuxiang Chen, Xinyi Chang, Song Wang, Nan Zhao Electrical Engineering, Columbia University Prof. Stephen A. Edwards

### JPEG Image Compression



#### Software to FPGA

- > 64 pixels x 8 bits/pixel = 256 bits
- > 256 bits / 32 bits/stream = 16 stream
- Input = buffer[i] + buffer[i+1] << 8 + buffer[i+2] << 16 + buffer[i+3] << 24</p>
- Decode the 32 bits data in HW
- > HW waits for 16 write states, then go to the next state computing DCT

| ta 3      | data 2 | data 1 |  |
|-----------|--------|--------|--|
| → 32 bits |        |        |  |
|           |        |        |  |

$$y(k) = w(k) \sum_{n=1}^{N} x(n) \cos\left(\frac{\pi}{2N} (2n-1)(k-1)\right), \quad k = 1, 2, \dots, N, \qquad w(k) = \begin{cases} \frac{1}{\sqrt{N}}, & k = 1, \\ \sqrt{\frac{2}{N}}, & 2 \le k \le N, \end{cases}$$

#### **Loeffler Algorithm**

- Number of multiplications reach the theoretical low limit.
- 4 Stages
- MultAddSub Blocks



#### [1]M. Jridi and A. Alfalou,

#### Canonical signed digit (CSD) representation

$$y(k) = w(k) \sum_{n=1}^{N} x(n) \cos\left(\frac{\pi}{2N} (2n-1)(k-1)\right), \quad k = 1, 2, \dots, N, \qquad w(k) = \begin{cases} \frac{1}{\sqrt{N}}, & k = 1, \\ \sqrt{\frac{2}{N}}, & 2 \le k \le N, \end{cases}$$

#### TABLE I 8-POINT DCT FIXED COEFFICIENT REPRESENTATION

#### CSD

- Signed representation containing the fewest number of nonzero bits
- Effective way to carry out constant multiplier for DCT.
- Number of additions and subtractions will be minimized.
- Identified common elements in CSD constant coefficients and shared required resource

 $X = 2^a \pm 2^b \pm 2^c \pm ....$ 

| Real value            | Decimal    | Natural binary | Partial products | CSD       | Partial products |  |
|-----------------------|------------|----------------|------------------|-----------|------------------|--|
| $cos \frac{3\pi}{16}$ | 106        | 01101010       | 4                | +0-0+0+0  | 4                |  |
| $sin\frac{3\pi}{16}$  | 71         | 01000111       | 4                | 0+00+00-  | 3                |  |
| $cos \frac{\pi}{16}$  | 126        | 01111110       | 6                | +00000-0  | 2                |  |
| $sin\frac{\pi}{16}$   | 25         | 00011001       | 3                | 00+0-00+  | 3                |  |
| $cos \frac{6\pi}{16}$ | 49         | 00110001       | 3                | 0+0-000+  | 3                |  |
| $sin\frac{6\pi}{16}$  | 118        | 01110110       | 5                | +000-0-0  | 3                |  |
| V(2)                  | 181        | 10110101       | 5                | +0-0-0+0+ | 5                |  |
| Total Partia          | l products | 1              | 30               | 23        |                  |  |

#### [1]M. Jridi and A. Alfalou,

# RTL Block Diagram for DCT-1 and DCT-2



# RTL Block Diagram for DCT-2



#### Quantization

| 16  | 16  | 16  | 16  | 32  | 64  | 64  | 64  |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 16  | 16  | 16  | 16  | 32  | 64  | 64  | 64  |
| 16  | 16  | 16  | 32  | 32  | 64  | 64  | 64  |
| 16  | 16  | 32  | 32  | 32  | 64  | 64  | 64  |
| 32  | 32  | 32  | 64  | 128 | 128 | 128 | 128 |
| 64  | 64  | 64  | 64  | 128 | 128 | 128 | 128 |
| 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
| 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |

Table 2: Modified Normalization Matrix For Hardware Simplification

- The step where we actually throw away data.
- Reduce most of the less important high frequency DCT coefficients to zero,
- Lower numbers in the upper left direction and large numbers in the lower right direction

Zigzag



Zigzag Scan Order

• Obtain the one-dimensional vectors with a lot of consecutive zeroes

|  | Run Cate | egory Bit Value |  | Run | Category | Bit Value | EOB |  |
|--|----------|-----------------|--|-----|----------|-----------|-----|--|
|--|----------|-----------------|--|-----|----------|-----------|-----|--|

- Run represents the number of previous consecutive zeros.
- Category represents the bit value length of non-zero value.
- End with EOB when last bits are 0..

# **DC Huffman Encoding**

- dc\_diff = dc\_current -dc\_previous
- > dc\_diff\_length = getCategory(dc\_diff)
- > dc\_codeword = dc\_lookup\_table(dc\_diff)
- register\_line = register\_line + (ac\_codeword << category) + dc\_diff</pre>



### AC Huffman Encoding

- ac\_diff\_length = getCategory(bit\_value)
- > ac\_codeword = ac\_lookup\_table()
- register\_line = register\_line + (ac\_codeword << category) + bit\_value</pre>



#### **Bit Stream Compression**

- Initialized a 1024-bit length register\_line,
- $\succ$  While (there is data):
  - register\_line = (register\_line << data\_length) + data;</pre>
  - total\_line\_size += data\_length



#### Compressed Data to Software

- register\_line << register\_length,</pre>
- ≻ do:
  - data\_back = register\_line[1023:991]
  - Register\_line << 32 bits</li>
- > while(data != 0)



# Result

| 1. DCT input:                                                                                                                                                                                                      | 2. D                   | CT output:                                                                                                                                                                                                                                                            | 3.Quantization output:                               |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| Start                                                                                                                                                                                                              | input =                |                                                                                                                                                                                                                                                                       | output =                                             |
| Input:Block:1<br>100, 0, 0, 100, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0, 0,<br>100, 200, 0, 0, 0, 0, 0, 0, 32,<br>100, 0, 0, 0, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0, 0, | 12 -8 -28<br>82 -15 29 | $\begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                  | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ |
| 0, 0, 0, 0, 0, 0, 0, 0, 0,<br>0, 0, 0, 0, 0, 0, 0, 100,                                                                                                                                                            |                        | Columns 1 through 22                                                                                                                                                                                                                                                  |                                                      |
| 0, 0, 0, 0, 0, 0, 0, 100,                                                                                                                                                                                          | 4. Zigzag output       | <ul> <li>-61 6 4 0 3 8 3 (</li> <li>Columns 23 through 44</li> </ul>                                                                                                                                                                                                  | 0 -7 0 2 0 -2 2 2 0 0 -1 0 0 0                       |
| 5. RLC output:                                                                                                                                                                                                     |                        | 0 0 0 1 0 0 0<br>Columns 45 through 64                                                                                                                                                                                                                                | 0 0 0 0 0 0 0 0 0 0 0 0 0 0                          |
| DC: -61 -> (14) ->                                                                                                                                                                                                 | 111000001              | 0 0 0 0 0 0 0                                                                                                                                                                                                                                                         | 0 0 0 0 0 0 0 0 0 0 0 0                              |
| AC: (0, 6)->(0,3,6)->(100,110)->38<br>(0, 4)->(0,3,4)->(100,100)->36                                                                                                                                               | 100110<br>100100       | <ol><li>Bitstream out</li></ol>                                                                                                                                                                                                                                       | tput:                                                |
| $ \begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                              |                        | Start<br>Input:Block:1<br>100, 0, 0, 100, 0, 77, 0, 88,<br>100, 0, 0, 0, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0,<br>100, 200, 0, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0,<br>100, 0, 0, 0, 0, 0, 0, 0,<br>0, 0, 0, 0, 0, 0, 0, 0,<br>0, 0, 0, 0, 0, 0, 0, 100, |                                                      |
| (7,1)->(7,1,1)->(11111010,1)->501<br>(0,0)->(1010)->10                                                                                                                                                             | 111110101<br>1010      | Output:<br>1110000010100110100100110111110                                                                                                                                                                                                                            | 01110000111111100100011011101101011001101110001111   |

#### Reference

[1] M. Jridi, A. Alfalou, "A low-power, high-speed DCT architecture for image compression: Principle and implementation," 18th IEEE/IFIP VLSI System on Chip Conference (VLSI-SoC), 2010, pp. 304-309.

[2] Y. H. Chen , T. Y. Chang and C. Y. Li , "Highthroughput DA-based DCT with high accuracy error-compensated adder tree" , IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst. , vol. 19 , no. 4 , pp.709 -714 , 2011

[3] V. Gupta, D. Mohapatra, A. Raghunathan and K. Roy, "Low-power digital signal processing using approximate adders", IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp.124-137, 2013

[4] P. Kulkarni, P. Gupta, and M. D. Ercegovac, "Trading accuracy for power in a multiplier architecture," J. Low Power Electron., vol. 7, no. 4, pp.490–501, 2011.

[5] Y. V. Ivanov and C. J. Bleakley, "Real-time h.264 video encoding in software with fast mode decision and dynamic complexity control," ACM Trans. Multimedia Comput. Commun. Applicat., vol. 6, pp. 5:1–5:21, Feb. 2010.

[6] Wei-Yi We, National Taiwan University, "An Introduction to Image Compression"