



|       | Introduction              | 4  |
|-------|---------------------------|----|
| 1.1   | Aim                       | 4  |
| 1.2   | Overview                  | 4  |
| 2     | Design Architecture       | 5  |
| 2.1   | System Architecture       | 5  |
| 2.2   | Hardware Section          | 6  |
| 2.2.1 | Avalon Bus                | 7  |
| 2.3   | Software Section          | 7  |
| 3     | Simulation                | 9  |
| 3.1   | Introduction              | 9  |
| 3.2   | Simulation Test Bench     | 9  |
| 3.2.1 | Component Simulation      | 10 |
| 3.2.2 | RAM Simulation 1          | 10 |
| 3.2.3 | Conclusion                | 10 |
| 4     | Hardware Design           | 11 |
| 4.1   | Interfacing with Software | 11 |
| 4.1.1 | Memory - RAM              | 11 |

|   |   | - | ۰ |
|---|---|---|---|
|   |   |   | ١ |
| 1 | ۱ |   | J |

|       |                              | 3  |
|-------|------------------------------|----|
| 4.2   | Network Fabric               | 12 |
| 4.2.1 | Crossbar Switch Model        | 12 |
| 4.3   | Scheduling Algorithm         | 14 |
| 4.3.1 | Single Input Queue Scheduler |    |
| 4.3.2 | Performance Comparison       |    |
| 4.3.3 | The whole suite              | 18 |
| 5     | Software                     | 20 |
| 5.1   | Implementation details       | 20 |
| 5.1.1 | Packet Generator             | 20 |
| 5.1.2 | Validator                    | 20 |
| 6     | <b>Evaluation</b>            | 22 |
| 6.1   | FPGA Switch Performance      | 22 |
| 7     | Conclusion                   | 24 |
| 7.1   | Lessons Learnt               | 24 |
| 7.2   | Future Work                  | 25 |
| 8     | Appendix                     | 26 |
| 8.1   | File Listings                | 26 |



# 1.1 Aim

The aim of the project is to create a FPGA based switch. The main focus of the project is in optimising the throughput of a network switch through the implementation of a scheduler. Decoding of actual incoming packets will not be considered in this project. Therefore the packets being generated will contain a few items:

- Randomly generated data payload of variable length
- 8 bits of header that determines the destination port
- 8 bits storing the random seed number used for the payload generation

## 1.2 Overview

The FPGA contains a few components that make up the entire switch. The routing algorithm is handled by the scheduler within the FPGA to optimise the amount of throughput that the switch can handle<sup>2</sup>. The scheduler has to maintain correctness while working towards maximum efficiency. Random Access Memory (RAM) blocks also exist on the FPGA and model the real world input and output ports.

The user space consist of the packet generator and validator which interface with FPGA. They are responsible for generating packets with random destination ports and feed them into the FPGA module. The validator then reads from the output RAM and ensures that packets are routed correctly and no segments are dropped.

<sup>&</sup>lt;sup>1</sup>There will be no decoder in the mainframe of the project and so any packet that is generated and sent to the switch will pass through to the port specified.

<sup>&</sup>lt;sup>2</sup>Throughput is defined as the number of packets received at the output port in one clock cycle.



# 2.1 System Architecture

The design architecture of the system is as shown in Figure 2.1 where both software components and hardware components are exhibited in the block diagram.



Figure 2.1: Block diagram showing the overall functionality and flow of the system

The userspace packet generator is responsible for generating random packets each with:

- 8 bits of header representing the destination port
- 8 bits storing the random seed that the data is generated based upon
- variable length data payload up to 64 bits

These packets are then sent to the packet sorting fabric on the FPGA which will decide which RAMs the packets will be sorted into based on the source port and the destination port. Each of the 4 inputs to the Scheduler has a cascade of 4 RAMs which identify which destination port the corresponding packet has to be routed to. The Scheduler then runs and proceeds to route the packets from the source RAMs into the corresponding destination port. The main aim of the scheduler is to maximise throughput by routing the most number of packets through the switch at every clock cycle. The RAMs located at the output then store these outputs. Each of the corresponding RAMs will only contain packets whose destination port corresponds to that specific output. The final step in the system is the Userspace Validation where the data stored in the memory locations of the output RAM will be retrieve and used to validate the integrity of the packets being sent through the switch.

## 2.2 Hardware Section

The hardware section of the entire system consists of the the following blocks as shown in Figure 2.2. The hardware segment of the system is responsible for storing and routing the input packets into the correct destination output port. The hardware segment being implemented on the FPGA interfaces with the userspace software program using the master-slave architecture (CHECK). One thing to note is that the hardware architecture is not affected by length of the packet that needs to be routed, it will continuously route that same packet to the destination port until an 'end-of-packet' identifier has been reached. This is being transmitted as zeros of 32 bits in length.



Figure 2.2: Hardware segment of the system

# 2.2.1 Avalon Bus

The userspace talks to the FPGA using avalon bus. Userspace has access to various registers which are registered to to the device drivers which communicate through ioctl 32 read and write calls. In this project this is the only part that has not been evaluated on Verilator because the slides actually show the real scenario for the assert signals. Figure 2.3 and 2.4 shows the readdata and writedata transfer timing diagram that is used throughout in the project.



Figure 2.3: Avalon Bus Read Signal



Figure 2.4: Avalon Bus Read Signal

# 2.3 Software Section

The software segment of the system is responsible for generating the input packets and validating the output packets after it has been routed through the switch. This consists of the userspace packet generator and the userspace validator as shown in Figure 2.5 below. The userspace packet generator uses a random number generator to generate data payload of variable length of up to 64 bits in length. It also includes within that packet a header

containing the destination port and the seed number that is used in that generation. This is done to ensure that at the validation side of the userspace, the software program can regenerate the given packet using that same seed number to verify the integrity of the packet. This will be explained in detail in a later section.



Figure 2.5: Software segment of the system



## 3.1 Introduction

The project depends heavily on simulations, so a robust test suite is created for the simulations. Out of the various compilers available for simulations Verilator was chosen for compiling the hardware code. An exhaustive test bench was created in C++ for interfacing with the compiled hardware code. Now it should be noted that the hardware code that is compiled is actually used in Quartus to compile it down on the hardware and therefore has some nuances and quirks. For example, Altera's compiler limits the number of iterations in the for loop to 255 which Verilator does not. Furthermore there are many such differences between the Verilator simulations and the actual hardware implementation and one should be careful while experimenting.

# 3.2 Simulation Test Bench

Simulating Altera's IP core in Verilator was an integral as well as the most challenging part of the simulation. Since the project's progress contained different IP cores like Fifo , MUX and Ram. In the final design after many iterations several of these IP cores were removed/replaced but that would not have been possible without getting a deeper understanding of timing diagrams as well as the designed issues that needed to be resolved.

SwitchON can simulate altera's IP core into the design using several caveats. For instance, the RAM module is defined in altera\_mf.v file in the eda simulation directly, but that file is not standard i.e it cannot be compiled by Verilator, hence several of the other components (100k lines) have to be removed. Further more several helper functions needs to be added. Altera uses lots of tri state logic which does not simulate properly in Verilator, these can be removed but then care must be taken to add extra warning lints for, they cause the values to be used in block and no block. The veripool community is very helpful and

some of the scripts they provided using Veripool-perl was instrumental in simulating the altsync RAMS. Again the idea was similar to converting the tri state logics to wire logics.

# 3.2.1 Component Simulation

The test bench can compile each component separately as well as the full model suite. The ingenuity of such a modelling style allowed for amazing level of detailing that can be put to each module. This This allowed to optimize each clock cycle and made us achieve really high throughput through the scheduler. Each IP core has it's quirks and though it was frustrating when they didn't work the expected, it was really a nice learning experience.

#### 3.2.2 RAM Simulation

The simulation of altsync RAM was the most challenging part the project. First challenge was actually to find the library in which the module was defined. Running grep system wide did help to locate the module in the eda simulation directory of Quartus. But the RAM that altera uses has lots of tri state logics which prevent the data from coming to the output q port during verilator simulation, in Verilator's defense it did warn about those tri-state logics. Finally converting all such tri-states to wire leads to easy simulation of altsync RAM in verilator <sup>1</sup>. Simulating results for the particular sections is shown in Evaluation.

### 3.2.3 Conclusion

Verilator provides a really easy to use platform which is fast and actually simulates what goes into the hardware. The compilation time is actually nothing compared to Quartus and it provides a natural way to input any random signal into the model so that it can be tested to it's limit. Furthermore the output signals can be verified by just scripting the generated signals. It does have a steep learning curve but it's actually worth simulating.

<sup>&</sup>lt;sup>1</sup>There is a script by Todd Strader here https://github.com/twosigma/verilator\_support. It can convert the tri-state logics directly to wire logic. Also, for a quick solution just convert the tri-state logics to wire and comment that section using appropriate verilator escape lint.



# 4.1 Interfacing with Software

This is the front interface of the hardware. The packet data coming from the user space is received here. The main function of this module is to channel the packet data into appropriate RAMs. These RAMs are symbolic of the input ports of a network switch.

# 4.1.1 Memory - RAM

The Random Access Memory(RAM) modules are used in the system to simulate the input and output ports. These modules are implemented on the FPGA in the form of an embedded memory IP block supported by the Altera's Mega Wizard plugin in the Altera Quartus software.

A RAM is typically a type of computer data storage that allows data items stored into the memory module to be accessed quickly. It has typically much faster read and write times but is a form of volatile memory that loses its stored data when it loses power

# The Altera Embedded Memory IP Block

The RAM modules that are implemented on the FPGA are of the form of a Simple dual-port RAM. This supports simultaneous one read and one write operations to different locations which is important in this system to minimise the number of clock cycles required to access data from the RAMs. Figure 4.1 below shows the inputs and outputs that are configured for the RAM module. It takes in the input clock from the overall clock of the system, has a word length of 32 bits and a storage space of 4096 words. These are controlled by the input signals rden(read enable) and wren(write enable).

Figure 4.2 shows the timing diagram of the Altera RAM module.



Figure 4.1: Snippet of code showing the inputs and outputs attached to the RAM module



Figure 4.2: Timing diagram of the read and write operations of the RAM

## 4.2 Network Fabric

A network switching fabric is the hardware topology of the network that is laid out and is responsible for transporting the input packet to its respective output port. The network fabric being employed in this project is the crossbar architecture. The crossbar architecture is basically a network topology that is in the form of a matrix as shown in Figure 4.3 below:

#### 4.2.1 Crossbar Switch Model

In this project, a single layer 4×4 topology with 4 inputs and 4 outputs is utilised. The Figure 4.3 above illustrates how every input is being connected to every output by the intersections of the matrix, termed crosspoints. The implementation of the crossbar switch model is done on the FPGA. Each input to output connection is completely independent of each other and can therefore support simultaneous communications, except in the case



Figure 4.3: Illustration of the Crossbar Architecture that will be responsible for the network switching fabric

when two ports wish to use the same output port.

#### How the Crossbar Switch Works

The crossbar switch architecture works in a similar way to that of active addressing in an LED(Light emitting diode) matrix. The inputs are connected to every output by lines that can be turned on and off depending on the destination of the source packet. For example in Figure 4.3, the orange line shows how the input 1 is able to send a packet through the network fabric to output 2 by turning on it's horizontal line and the vertical line that corresponds to output 2. As mentioned earlier, the lines are independent of one another and therefore in a single time slot, both input 1 and input 2 can send packets to outputs 2 and 3 respectively without colliding. Theoretically and in some cases practically it is possible to get n<sup>1</sup> number of packets in the output.

#### Implementation in the system

In our implementation, 16 RAM modules are used at the input port of the network fabric. This means that each input port of the network fabric has exactly 4 RAMs, one for each output port. They have the above mentioned capacity and word length. This is as shown in Figure 4.4. The functionality of these input rams are to store the packets distributed according to their output ports. When the scheduling algorithm runs and selects the packet to be routed through the network fabric to the output ports, the stored packets are accessed and removed from the input RAMs and routed through. The exact same RAMs are utilised in the output port for storage of the routed packets.

<sup>&</sup>lt;sup>1</sup>where n is number of output ports, in this case 4



Figure 4.4: An exploded view of a single input port to the network fabric

# 4.3 Scheduling Algorithm

The Scheduling algorithm is at the core of the network switch. The scheduler makes all the decisions regarding the routing of the packets. It looks at the incoming packets and based on the header decides the output port of the packet and routes the content accordingly.

The first preference to the scheduler is always correctness; to make sure none of the packets are lost. The priority is that all the information is transferred as required. Then comes the efficiency. How fast the scheduler can route the data on the input ports utilizing the least number of clock cycles. The scheduling algorithm for our project was also developed in similar two stages.

Another important thing to add is that the hardware code is agnostic to the size of the packet. It marks the start of a packet with a header, containing the port information, followed by unknown number of 32 bit words followed by a 32-bit zero value to mark the end of packet. Once the zero value is encountered, the Scheduler understands that the packet has ended and prepares itself for the next packet.

One performance constraint with both the designs is the RAMs that have been used to simulate the input and output ports. 3 clock cycles are required to analyze and transfer each 32 segment of the data. This is required by the RAM. It takes one clock cycle to increase the address and another for the output to appear. While working with two clock cycles also, sometimes the data would appear late resulting in consistent errors. Another clock cycle has to be spared to make sure that the data is stabilized. So the speeds that are achieved can be

scaled by an appropriate factor considering the real world scenario.

We here discuss the Scheduler algorithm. The initial design focuses only on correctness while the second one tries to improve the performance with some additional hardware logic. Both designs will be discussed below.

## 4.3.1 Single Input Queue Scheduler

The initial design of the scheduler was a simple crossbar switch. The scheduler looks at the head of the packets on different ports, and simply routes the data according to that information. In case of collision, the data is transferred one by one, holding the data at one of the ports while the other one is transferred and then transferring the data from the next port.

The preference in case of a collision is always given to the lower numbered port. What this means is that if there are two packets at ports 1 and 2 both waiting to go to the output port 2, the preference will always be given to the packet at port 1. The upside to this approach is that it is very simple to implement. The downside being that if the next packet on port 1 also has to go to port 2, it will still precede the packet on port 2. This can lead to starvation and theoretically the port 2 packet may never flow through, if all the packets on port 1 are to destined to port 2.

The way the scheduler achieves this is by storing state of different input ports. One variable per port to store the destination port of the current packet coming in through the input port. Another variable is used to indicate the End-of-packet signal which meant the scheduler had to to refresh its transfer information in the next cycle.

#### **PPS Architecture**

In the second part of the project an attempt to optimize the performance by adding hardware complexity is discussed. Instead of using one RAM per port, four (the number of output ports) such RAMs are used per port. For the user space, the packets are still being sent to the four input ports instead of sixteen. However, a layer inside the hardware divides these packets based on their output destination.

Figure 4.5 show how the Scheduler looks like. the inp[4][4] signals are the input signals from the rams. input\_ram\_rd\_add signals control the address of the ram from which data is being read. The outp[4] signals are the output signals which contain the packets on their destination port and out\_ram\_wr are to control writes to the output RAMs.

The way it helps is that segregation of packets based on their destination ports greatly improves timing efficiency. A packet meant for the output port two does not have to wait behind another packet meant for output port one. It eliminates the time where one or more output ports has to wait lying empty because none of those packets were at the front of the



Figure 4.5: Block Diagram of the Scheduler

queue on their respective input ports.

This approach optimizes for different ports but still faces the starvation problem faced by the initial design because here also the preference always goes to the packet on the port one. It still performs better because packets for other ports does not have to be stuck because the front packet cannot pass through.

Figure 4.6 and 4.7 give the timing diagrams of the scheduler implemented. While Figure 4.6 shows the situation where packets to different output ports appear on the input ports, the Figure 4.7 shows the case, where all packets are meant for the same output port.

### 4.3.2 Performance Comparison

A comparison of performance between both Scheduling algorithm is done to investigate the differences. Random test runs of 100 packets of lengths ranging from 4 to 64 uniformly spread across the source ports but randomly across the destination ports. Figure 4.8 shows the relative average speeds achieved by using the two different Schedulers. The performance statistics for the initial design is shown in red while blue highlights the performance of the optimized scheduler.



Figure 4.6: Scheduler timing diagram showing packets to different output ports

| Signals            | Waves    |            |        |           |            |            |         |            |           |         |          |          |
|--------------------|----------|------------|--------|-----------|------------|------------|---------|------------|-----------|---------|----------|----------|
| Time               | 90 ns    | 100 ns 110 | ns 120 | ns 130    | ns 140     | ns 150     | ns 160  | ns 170     | ns 180    | ns 190  | ns 200   | ns 210 n |
| clk=               |          |            |        |           |            |            |         |            |           |         |          |          |
| inp(0)(0)[31:0] =  | 00000000 | 00000001   | 90     | 000000    |            |            |         |            |           |         |          |          |
| inp(1)(0)[31:0] =  | 00000000 | 00000002   |        |           |            | 0000000    | 0       |            |           |         |          |          |
| inp(2)(0)[31:0] =  | 00000000 | 00000003   |        |           |            |            |         | 0000       | 00000     |         |          |          |
| inp(3)(0)[31:0] =  | 00000000 | 00000004   |        |           |            |            |         |            |           |         | 00000000 |          |
| outp(0)[31:0] =    | 00000000 |            | 0000   | 00001 0   | 0000000    | 00000002   | 0000000 | 00000      | 0000      | 00000 0 | 0000004  | 00000000 |
| total_time[31:0] = | 00000005 | 00000000   | 00+    | 00+ 00+ 0 | 0+ 00+ 00+ | 00+ 00+ 00 | 00+ 00+ | 00+ 00+ 00 | + 00+ 00+ | 00+00+0 | 00+ 00+  | 00000016 |
|                    |          |            |        |           |            |            |         |            |           |         |          |          |

Figure 4.7: Scheduler timing diagram showing packets to the same output port

It is clearly visible that the optimized algorithm shows higher performance both in terms of average as well as the most optimal performance. The performance of the initial design is lower which is consistent with the expectation from the algorithm. While it is easy to see that the average performance is higher, the worst case scenario for both cases occurs when all the packets are scheduled to the same output port. The calculation for such a case is as follows: Since, 32-bits of data is transferred every three clock cycles:

$$Speed = \frac{32}{3}bits/cycle$$

$$Speed = \frac{32}{3 \times 20 \times 10^{-9}}$$

$$Speed = 0.533 \times 10^{9}$$

$$Speed = 508.626 Mb/s$$

This comes out to be consistent with the data presented in the test runs. If the test bench is modified to ensure that all packets are sent to the same output port, the transfer speed matches the above mentioned speed up to three decimal places.

This also gives a logical explanation for the wider variance seen in the optimized algorithm



Figure 4.8: Comparison plot of the performance achieved with the two Scheduler algorithm

graph as compared to the initial one. Seen the spectrum for the optimized algorithm is higher with a higher average, the variation and the peaks are also higher.

## 4.3.3 The whole suite

Figure 4.9 shows the block diagram of the entire suite and its interaction signals with the avalon slave bus. Figure 4.10 and 4.11 show the timing diagrams of the flow of packets to and from the user space. It can be seen that the sanctity of the packets is maintained in its movement through the system.



Figure 4.9: Flow of packet from the user space



Figure 4.10: Flow of packet from the user space



Figure 4.11: Flow of packet to the user space



The software side talks to the Hardware using the ioctl32 calls. It also generates the packets which are to be routed through the Switch. Furthermore it reads back from the Output RAMS and locally generates the packet and verifies it, if all packets pass the verification it then calculates the throughput through the Switch for that iteration.

# 5.1 Implementation details

The userspace consists of:-

- Packet Generator: It generates seeded random number of packets upto NUM\_PACKETS(defined in packetgen.h file).
- Validator: After packet transfer is complete validator runs over the contents of each RAM verifying for consistency in terms of content and port matching.

#### 5.1.1 Packet Generator

Packet Generator consists of packetgen.c and packetgen.h. The entry point to these modules is through main.c. Based on the seeded input value it generates a 32 bit random number of which each 8 bits except the first 2 MSB bits have their own minimum requirements which are again defined in packetgen.h header file.

Once the packet generator has sent all the packets using the ioctl32 calls before shutting itself off, it sends the WRITE\_ENABLE\_SCHEDULER and READ\_ENABLE opcodes to the module which kicks in the Scheduler.

#### 5.1.2 Validator

After the FPGA processing when the packets are routed to their appropriate output ports (modeled by memory locations) the validator runs and checks that the packets should be

| Last Bits | Output Port |
|-----------|-------------|
| 0000 0000 | Port 0      |
| 0000 0001 | Port 1      |
| 0000 0010 | Port 2      |
| 0000 0011 | Port 3      |

Table 5.1: I/O Mapping from RAM

stored on correct memory locations. As discussed above, the generated packet consists of random sequence of bits with the last two bits representing the destination port. The validator makes sure that this values matches the memory space in which the packet is stored and reports any errors encountered.

The validator validates the packets which are received from the output RAMS. Now using the stored seed, the validator seeds itself to stored seed and then starts generating the packets locally. Now each octal of the received packet part is compared against the locally generated octal, if match happens it that part of the packet is marked OK and the validator moves to check other packet. Now should a packet not match the generated seed, error is thrown and the program exits. For the simulation scenario, it is ensured that none of the packets are dropped and none of the packets are wrongly stored.

After all the validation passes, the validator sends an OP-Code to the FPGA which then returns the total\_clock\_cycles it took to transmit that data. Using this information the throughput of the Switch for that particular iteration can be calculated. It has been observed that both in case of PPS and Single Input Architecture the throughput fairly remains constant with a small swing along the average  $\pm$  200 Mbits/s, which is fair in terms of packets that are being sent. In other words since the packets are being generated randomly it might so happen that the generation might be skewed towards a particular output port, which leads to number of cycles being increased as packets are now queued thus leading to more time for transfer. This issue will be fairly common in both the architecture because all the packets can now go to only one part,so in each clock cycle a single packet will be transferred, in other words packet transfer would be linear.



# 6. Evaluation

# **6.1** FPGA Switch Performance

Having implemented and tested the functionality of the FPGA switch, the next step would be to evaluate the performance of the switch and how it performs under various types of load.



Figure 6.1: Plot showing the data transfer speed over 25 iterations

The first test is to measure and find the average speed of data transfer that the switch is capable of achieving. This is shown in Figure 6.1 above. It can be seen that there are massive fluctuations in the data plot, there are however some points to note which are circled in

blue. These points are outliers in the data plot that occurs whenever there is a concentrated number of packets that are sent to a specific port. Due to the random distribution of packets that are sent to each destination port, there will be a case where for example 50 of the 100 packets generated are destined to output port 3. Such a data point will then result in an outlier where the data transfer speed is severely crippled because a higher number of clock cycles are required to process that concentration of packets destined for a single output port. The average transfer rate of the switch including these outliers still remains at an impressive speed of approximately 1400 Megabits/second (Mbps)



Figure 6.2: Plot showing the the number of clock cycles needed to process a given data size

The second test involves incrementing the total transferable data size to investigate how the number of clock cycles required to complete the routing changes. It can be seen from the Figure 6.2 that the number of clock cycles required increments linearly with increasing data sizes. The gradient of the slope gives the data transfer speed at that given point, seeing as how the graphs proves to be a linear plot, it is safe to say that the transfer speed of the switch remains constant regardless of the transferable data size. This means that under heavy data loads, the switch will still be able to perform at its maximum capacity.



Using the simulated Switch implemented in hardware, it can be concluded that for all real cases in which the packet arrival can be Poisson, PPS architecture would be really helpful because then Head of line blocking can be avoided and throughput would increase as can be seen from the results. Furthermore it's futile to expect that the throughput would increase in the order of the input ports because of the distribution in which the packets arrive. Though this can be avoided using fixed input ports for a particular packet sizes, but is not entirely avoidable.

# 7.1 Lessons Learnt

Throughout the course of the project, there are a few lessons to be learnt:

- Hardware is hard. Like seriously programming hardware is very different from working on just purely software. In software, logic take precedence where a good logic will mean an efficient and perfectly functioning piece of code. This differs greatly in hardware where logic has to be perfect but also timing of every hardware module must be taken into consideration due to data stability reasons, few modules might be slow in giving out the data so everything cannot work as per the designed clock cycles.
- Simulating what hardware does in software that produces timing diagrams is the best way to debug hardware issues. Timing diagrams while tedious and time-consuming to do and set up, provide an insight into what the hardware is actually doing, saving you time in the end.
- While simulations provide insight, it may not be truly representative of what actually goes on in hardware. More often than not, the simulations hold true. But every once in awhile, it goes way off tangent so always check what the actual hardware is telling you. For example there were cases in which Verilator would actually simulate the

7.2 Future Work 25

altsync ram, but there would be no output on the q port. Furthermore Quartus limits the number of iterations to 255 in a for loop, but since verialtor is platform agnostic it synthesizes the code,hence there might be a case in which the logic would work in Simulation only to fail in hardware.

• Hardware documentation is not as robust as those that you will find on open sourced stuff such as python libraries. Simulating and creating test benches early and often is a nice way to debug. Also the simulated code should be as close as possible to the code synthesized in Quartus.

### 7.2 Future Work

These are the below directions that can be taken to take the project forward.

- DMA can be implemented so that the simulation can be run for large number of packets.
- Different scheduling algorithms, which are less greedy in practice can be simulated to check the perforance.



Figure 7.1: Finally

# 8. Appendix

# 8.1 File Listings

Following are the files included:

- 1. Hardware
  - (a) **VGA\_LED.sv** Interfaces with Avalon slave. Responsible to handle the incoming packets from the Slave bus.
  - (b) **Scheduler.sv** Routes the data through the Switch. Contains the PPS algorithm.
  - (c) **Buffer.sv** Interfaces with the avalon slave again. Responsible to send the packets through the slave bus to the validator.
  - (d) **oScheduler.sv** The old scheduler implementation with single input queues. For reference purposes.
  - (e) Does not include the RAM ipcore files generated by Mega Wizard required for simulation.

#### 2. Verilator

- (a) **vgacounter.cpp** cpp file used to simulate the top level vcd. Includes appropriate signal changes throughout the Switch.
- (b) **schedulercounter.cpp** cpp file to simulate the scheduler in verilator.
- (c) **ramcounter.cpp** cpp file to simulate the ram
- (d) **buffercounter.cpp** -
- (e) Makefile
- (f) Does not include the modified RAM files used by verilator for compilation.

#### 3. Software

- (a) **main.c** top level file used to generate and send packets through the avalon bus.
- (b) **packetgen.h** header file for the packet generator.
- (c) **packetgen.c** packet generator file reponsible for generating the packets. Used by main.c

(d) **validator.c** - validator responsible for reading the packets from output RAMs. Validates the count, sequence and length and also calculates the transfer speed.

- (e) **vga\_led.h** header file for vga\_led.c
- (f) **vga\_led.c** vga\_led.c file similar included as part of lab3, with code changes to support 32 bit transfers, 4-bit addresses and read from the slave.
- (g) **Makefile** make main for main.c and make validator for validator.c. make for vga\_led.c and insmod for installing it to the kernel

### Hardware: VGA\_LED.sv

```
//START_MODULE_NAME
2 //
3 // Module Name
                      : VGA LED
4 //
5 // Description
                         Reads values from RAMS and enables the scheduler.
6 //
7 // Limitation
                      : None
8 //
9 // Results expected:
                         Enables the Scheduler and Buffer, communicate with
10 //
                      ioctl.
11 //
12 //END MODULE NAME
13
14 module VGA_LED(input logic
                                    clk,
          input logic
                             reset,
15
          input logic [31:0]
                               writedata,
          input logic
                             write, read,
          input
                                chipselect,
18
          input logic [3:0]
                                address,
19
20
          output logic [7:0] VGA_R, VGA_G, VGA_B,
                             VGA_CLK, VGA_HS, VGA_VS, VGA_BLANK_n,
          output logic
          output logic
                             VGA_SYNC_n,
          output logic [31:0] readdata);
24
25
      // Naming convention is the part of module the signal is for
26
     followed by
      // the use of the signal, written in camel case. For example,
27
     fifo_in
      logic [31:0]
                       inp[4][4], outp[4], input_ram_wr_in[4][4];
28
                       input_ram_rd_add[4][4], input_ram_wr_add[4][4];
      logic [11:0]
                       input_ram_rden[4][4], input_ram_wren[4][4];
      logic
30
    logic
                     out_ram_wr[4];
      // logic signals to enable write and read to the output RAM.
33
                       write_enable, read_enable;
      logic
34
      // signal to reset rams. Not being used right now. Was giving us
     problems.
      // We burn the hardware again after each test run of packets.
      // The only visible option.
37
      logic [1:0]
                      reset_rams;
38
      // Calculates the number of clock cycles it takes to transfer the
39
     entire
      // data from the input rams. Necessary to calculate the effective
     speed.
      logic [31:0]
                       total_time;
41
      logic [1:0]
                       port [4];
42
      logic
                       eop[4];
```

```
44
    initial begin
45
      reset_rams = 0; write_enable = 0; read_enable = 0; total_time = 0;
46
          for (int i=0; i<4; i++) begin
47
              for (int j=0; j<4; j++) begin
                  input_ram_rd_add[i][j] = 0;
                  input_ram_wr_add[i][j] = 0;
50
                  input_ram_rden[i][j] = 0;
51
                  input_ram_wren[i][j] = 0;
              end
              port[i] = 0;
54
              eop[i] = 1;
55
          end
56
    end
57
58
      //Incoming packets modeled as 16 rams, 1 for each combination of
59
     input and
      //output port
60
    RAM input_ram00(.clock(clk), .data(input_ram_wr_in[0][0]),
61
          .rdaddress(input_ram_rd_add[0][0]), .rden(input_ram_rden[0][0]),
62
            . wraddress(input_ram_wr_add[0][0]), .wren(input_ram_wren
63
     [0][0], .q(inp[0][0]);
    RAM input_ram01 (. clock(clk), . data(input_ram_wr_in[0][1]),
64
          .rdaddress(input_ram_rd_add[0][1]), .rden(input_ram_rden[0][1]),
65
            . \ wraddress (input\_ram\_wr\_add \ [0] \ [1]) \ , \ \ . \ wren (input\_ram\_wren
66
     [0][1], .q(inp[0][1]);
    RAM input_ram02 (.clock(clk), .data(input_ram_wr_in[0][2]),
67
          .rdaddress(input ram rd add[0][2]), .rden(input ram rden[0][2]),
68
            . wraddress(input_ram_wr_add[0][2]), . wren(input_ram_wren
69
     [0][2], .q(inp[0][2]);
    RAM input_ram03 (.clock(clk), .data(input_ram_wr_in[0][3]),
70
          .rdaddress(input_ram_rd_add[0][3]), .rden(input_ram_rden[0][3]),
71
            . wraddress(input_ram_wr_add[0][3]), .wren(input_ram_wren
72
     [0][3], .q(inp[0][3]);
    RAM input_ram10(.clock(clk), .data(input_ram_wr_in[1][0]),
73
          .rdaddress(input_ram_rd_add[1][0]), .rden(input_ram_rden[1][0]),
74
            . \ wraddress (input\_ram\_wr\_add [1][0]) \ , \ . \ wren (input\_ram\_wren
75
     [1][0], .q(inp[1][0]);
    RAM input_ram11 (.clock(clk), .data(input_ram_wr_in[1][1]),
76
          .rdaddress(input_ram_rd_add[1][1]), .rden(input_ram_rden[1][1]),
77
            . wraddress (input_ram_wr_add [1][1]), . wren (input_ram_wren
78
     [1][1]), .q(inp[1][1]));
    RAM input_ram12(.clock(clk), .data(input_ram_wr_in[1][2]),
79
          .rdaddress(input_ram_rd_add[1][2]), .rden(input_ram_rden[1][2]),
80
            .wraddress(input_ram_wr_add[1][2]), .wren(input_ram_wren
81
     [1][2], .q(inp[1][2]);
    RAM input_ram13(.clock(clk), .data(input_ram_wr_in[1][3]),
82
          83
            . wraddress (input_ram_wr_add[1][3]), . wren(input_ram_wren
84
     [1][3]), .q(inp[1][3]));
    RAM input_ram20 (.clock(clk), .data(input_ram_wr_in[2][0]),
85
          .rdaddress(input_ram_rd_add[2][0]), .rden(input_ram_rden[2][0]),
```

```
87
             . wraddress (input_ram_wr_add [2][0]), . wren (input_ram_wren
      [2][0], .q(inp[2][0]);
     RAM input_ram21 (. clock(clk), . data(input_ram_wr_in[2][1]),
88
           .rdaddress(input_ram_rd_add[2][1]), .rden(input_ram_rden[2][1]),
89
             . wraddress (input_ram_wr_add [2][1]), . wren (input_ram_wren
90
      [2][1], .q(inp[2][1]);
     RAM input_ram22 (. clock (clk), . data (input_ram_wr_in[2][2]),
91
           .rdaddress(input_ram_rd_add[2][2]), .rden(input_ram_rden[2][2]),
92
             . wraddress (input_ram_wr_add [2][2]), . wren (input_ram_wren
      [2][2], .q(inp[2][2]);
     RAM input_ram23 (.clock(clk), .data(input_ram_wr_in[2][3]),
94
           .rdaddress(input_ram_rd_add[2][3]), .rden(input_ram_rden[2][3]),
95
             . wraddress (input_ram_wr_add [2][3]), . wren (input_ram_wren
96
      [2][3], .q(inp[2][3]);
     RAM input_ram30 (.clock(clk), .data(input_ram_wr_in[3][0]),
97
           .rdaddress(input_ram_rd_add[3][0]), .rden(input_ram_rden[3][0]),
98
             . wraddress (input_ram_wr_add[3][0]), . wren(input_ram_wren
99
      [3][0]), .q(inp[3][0]);
     RAM input_ram31(.clock(clk), .data(input_ram_wr_in[3][1]),
100
           .rdaddress(input_ram_rd_add[3][1]), .rden(input_ram_rden[3][1]),
101
             . wraddress(input_ram_wr_add[3][1]), .wren(input_ram_wren
102
      [3][1]), .q(inp[3][1]));
     RAM input_ram32 (. clock(clk), . data(input_ram_wr_in[3][2]),
103
           .rdaddress(input_ram_rd_add[3][2]), .rden(input_ram_rden[3][2]),
104
             . wraddress (input_ram_wr_add[3][2]), . wren (input_ram_wren
105
      [3][2], .q(inp[3][2]);
     RAM input_ram33 (. clock(clk), . data(input_ram_wr_in[3][3]),
106
           .rdaddress(input ram rd add[3][3]), .rden(input ram rden[3][3]),
107
             . wraddress (input_ram_wr_add[3][3]), . wren(input_ram_wren
108
      [3][3], .q(inp[3][3]);
109
      Scheduler scheduler (.*);
     Buffer buffer(.*);
111
     always_ff @(posedge clk)begin
113
       if (reset_rams == 1) begin
114
         reset_rams = 2;
115
       else if (reset_rams == 2) begin
117
         reset_rams = 0;
118
       end
119
120
           for (int i=0; i<4; i++) begin
121
               for (int j=0; j<4; j++) begin
                    if (input_ram_wren[i][j]) begin
                        input_ram_wren[i][j] = 0;
124
                        input_ram_wr_add[i][j] = input_ram_wr_add[i][j] + 1;
125
                    end
               end
127
           end
128
129
       if (chipselect && write) begin
130
```

```
131
          case (address)
            0 : begin
132
                // If the previous packet has finished
133
                // transferring (characterized by 32 bit zero values, the
134
                // port information has to be re-established from the
135
                // packet header.
136
                          if (eop[0] && writedata) begin
137
                              eop[0] = 0;
138
                              port[0] = writedata[1:0];
139
                         end
140
                          // If in between transfer of a packet, continue
141
                          // transferring to the same port.
142
                          if (!eop[0]) begin
143
                              for (int i=0; i<4; i++) begin
                                   if(port[0] == i) begin
145
                                       input_ram_wr_in[0][i] = writedata;
                                       input_ram_wren[0][i] = 1;
147
                                   end
148
                              end
149
150
                         end
                          // If the end of packet is reached(32 bit zero value
      ), eop
                          // signal is set to high. In the next cycle the port
153
                          // information will be re-established.
                          if (! writedata) begin
154
                              eop[0] = 1;
                         end
156
            end
157
158
            1 : begin
159
                          if (eop[1] && writedata) begin
160
                              eop[1] = 0;
161
                              port[1] = writedata[1:0];
162
163
                          if (!eop[1]) begin
                              for (int i=0; i<4; i++) begin
165
                                   if (port[1] == i) begin
                                       input_ram_wr_in[1][i] = writedata;
167
                                       input_ram_wren[1][i] = 1;
168
                                   end
169
170
                              end
                          if (! writedata) begin
172
                              eop[1] = 1;
                         end
174
            end
175
176
            2 : begin
                          if (eop[2] && writedata) begin
178
                              eop[2] = 0;
179
                              port[2] = writedata[1:0];
180
                         end
181
```

```
182
                         if (!eop[2]) begin
                              for (int i=0; i<4; i++) begin
183
                                  if(port[2] == i) begin
                                       input_ram_wr_in[2][i] = writedata;
185
186
                                      input_ram_wren[2][i] = 1;
                                  end
187
                             end
188
                         end
189
                         if (! writedata) begin
                              eop[2] = 1;
191
                         end
           end
193
194
           3 : begin
                         if (eop[3] && writedata) begin
196
                             eop[3] = 0;
                              port[3] = writedata[1:0];
198
                         end
                         if (!eop[3]) begin
200
                              for (int i=0; i<4; i++) begin
                                  if(port[3] == i) begin
202
                                       input_ram_wr_in[3][i] = writedata;
                                      input_ram_wren[3][i] = 1;
204
205
                                  end
                             end
206
                         end
                         if (! writedata) begin
208
                             eop[3] = 1;
209
                         end
           end
           // Special signal to control the flow of data within the
           // switch from input port to output port. Required to
213
           // specifically determine the number of cycles it took for
           // data transfer and hence the speed.
215
                     15 : write_enable = 1;
                     // Controls the read from the output rams. Not really
217
                     // necessary, but we have added this in our user space
      code
                     // and may have a valid use case.
           14 : read_enable = 1;
220
           // Reset all rams. Not being used.
           13 : begin
222
                         for (int i=0; i<4; i++) begin
                              for (int j=0; j<4; j++) begin
224
                                  input_ram_wr_add[i][j] = 0;
                             end
226
                         end
227
              reset_rams = 1;
             end
229
         endcase
230
       else begin
232
```

#### Hardware: Scheduler.sv

```
//START_MODULE_NAME
2 //
3 // Module Name
                         Scheduler
4 //
5 // Description
                         Reads values from RAMS and schedules to prevent
                         collisions.
6 //
7 //
8 // Limitation
                      : NONE
9 //
10 // Results expected: Packets routed to proper ports.
11 // //
12 //END MODULE NAME
13
module Scheduler (input logic clk,
          input logic [31:0] inp[4][4],
          input logic
                                write_enable,
          input logic [1:0]
                                reset_rams,
18
          input logic [11:0] input_ram_wr_add[4][4],
19
20
          output logic [31:0] total_time,
          output logic
                                out_ram_wr[4],
          output logic [31:0] outp[4],
          output logic [11:0] input_ram_rd_add[4][4],
24
                                input_ram_rden[4][4]);
          output logic
25
26
      //Write cycle, to make sure that the signal is stable on the output
27
      //of the RAMs. It usually takes three clock cycles for the data to
28
      // stabilize: one clock for the address to be incremented, second for
29
      //data to be appear on the output wire. Theoritically, it should
30
     take two,
      //but sometimes there was a delay and it didnt. Hence, added the
31
     third.
      logic [1:0] write_cycle;
32
      // For end of packets.
      logic
                   eop[4];
34
35
      // Source packet information.
      logic [1:0] sport [4];
36
      // To determine if the total time should be incremented.
37
      logic
                   time_inc;
38
39
      initial begin
40
          write_cycle = 0;
41
          for (int i=0; i<4; i++) begin
42
               eop[i] = 1;
43
               sport[i] = 0;
```

```
45
           end
      end
46
      always_ff @(posedge clk) begin
48
           // Reset ram code. Not being used.
           if (reset_rams) begin
50
               for (int i=0; i<4; i++) begin
51
                   for (int j=0; j<4; j++) begin
52
                        input_ram_rd_add[i][j] = 0;
                   end
54
               end
           end
57
           // If the write enable is high.
           if (write_enable) begin
59
               time_inc = 0;
               for (int i=0; i<4; i++) begin
61
                   for (int j=0; j<4; j++) begin
                        // We tried setting these read signals high once and
63
       for
                        // all in the beginning, but if the ram is empty
64
      this
                        // tends to go low. Hence, doing this in every cycle
65
      . May
                        // be a better way, but going with brute force to
66
      avoid
                        // any unnecessary nuisance,
67
                        input ram rden[i][i] = 1;
68
                        // time_inc = 1 if for any ram, the read address is
69
      less
                        // than the right address, which means there is data
70
       to be
                        // read.
71
                        time_inc = time_inc |
                                    (input_ram_rd_add[i][j] <</pre>
73
     input_ram_wr_add[i][j]);
74
                   end
               end
75
               total_time = total_time + time_inc;
               if (write_cycle == 2) begin
78
                   write_cycle = 0;
79
                   // Here i represents the output rams and corresponding j
      , i
                   // represent the input ram from which the information is
81
                    // flowing. So, in essence internally its a 16x4 flow
82
      network.
                   for (int i=0; i<4; i++) be gin
83
                        for (int j=0; j<4; j++) begin
84
                            // Similar to the code in VGA_LED.
85
                            // If eop is reached and there is a next packet,
86
      set
```

```
//eop low and set the port informtion.
87
                              if (eop[i] && inp[j][i] &&
88
                                       input_ram_rd_add[j][i] <</pre>
      input_ram_wr_add[j][i]) begin
                                  eop[i] = 0;
                                  sport[i] = j;
91
                              end
92
                              // If eop is not reached (eop is low), check from
93
       which
                              // input roam is the information is flowing,
94
      transfer
                              // the word and increment the address for the
95
      next
                              // word. Also, if the word is empty, set eop
96
      high.
                              if (!eop[i] && sport[i]==j)begin
97
                                  outp[i] = inp[j][i];
98
                                  out_ram_wr[i] = 1;
99
                                  input_ram_rd_add[j][i] = input_ram_rd_add[j
100
      ][i] + 1;
                                  if (!inp[j][i]) begin
101
                                      eop[i] = 1;
102
                     break;
103
104
                                  end
                              end
105
                         end
                    end
107
                end
108
                else begin
109
                     write_cycle = write_cycle + 1;
                     for (int i=0; i<4; i++) begin
111
                         // Set write enable signals to the rams low.
                         out_ram_wr[i] = 0;
113
                    end
114
                end
115
           end
116
       end
118 endmodule
```

#### Hardware: Buffer.sv

```
2 //START_MODULE_NAME
3 //
4 // Module Name
                         Buffer
5 //
6 // Description
                         Stores the data coming from Scheduler into the
     RAMS
7 //
8 // Limitation
                         NONE
9 //
10 // Results expected: Packets stored with appropriate lengths to proper
11 // //
12 //END MODULE NAME
13
module Buffer (input logic clk,
                                chipselect, read, read_enable,
          input logic
      input logic [1:0]
                           reset_rams,
17
18
          input logic [3:0]
                                address,
          input logic [31:0]
                                outp[4],
19
          input logic
                                out_ram_wr[4],
20
          input logic [31:0]
                                total_time,
          output logic [31:0] readdata);
24
      // Output RAM signals. Read & Write address, enable signals and
25
     output
      // signals.
26
      logic [11:0] ram0_rdaddress, ram1_rdaddress, ram2_rdaddress,
     ram3 rdaddress;
      logic[11:0] ram0_wraddress, ram1_wraddress, ram2_wraddress,
     ram3_wraddress;
                ram0_wren , ram1_wren , ram2_wren , ram3_wren ;
    logic
29
                  ram0_rden, ram1_rden, ram2_rden, ram3_rden;
      logic[31:0] ram0_q, ram1_q, ram2_q, ram3_q;
31
      // read cycle signals to ensure that address is incremented only
     once
      // while reading from the RAM. We toggle these logic signals to
33
     ensure
      // that all the work at Buffer happens only during one clock cycle
34
     out
      // of the two used by the Avalon bus.
35
      logic
                   read_cycle0 , read_cycle1 , read_cycle2 , read_cycle3 ;
36
37
      // Four output RAMs that model the four output ports.
38
      RAM output ram0(.clock(clk), .data(outp[0]), .rdaddress(
39
     ram0_rdaddress),
```

```
.rden(ram0_rden), .wraddress(ram0_wraddress), .wren(ram0_wren),
40
          .q(ram0_q));
41
      RAM output_ram1 (.clock(clk), .data(outp[1]), .rdaddress(
     ram1_rdaddress),
          .rden(ram1_rden), .wraddress(ram1_wraddress), .wren(ram1_wren),
          .q(ram1_q));
      RAM output_ram2 (.clock(clk), .data(outp[2]), .rdaddress(
45
     ram2_rdaddress),
          .rden(ram2_rden), .wraddress(ram2_wraddress), .wren(ram2_wren),
          .q(ram2_q));
47
      RAM output ram3 (.clock(clk), .data(outp[3]), .rdaddress(
     ram3 rdaddress),
          .rden(ram3 rden), .wraddress(ram3 wraddress), .wren(ram3 wren),
          .q(ram3_q));
      initial begin
52
        ram0_wraddress = 0; ram1_wraddress = 0; ram2_wraddress = 0;
53
     ram3_wraddress = 0;
      ram0_rdaddress = 0; ram1_rdaddress = 0; ram2_rdaddress = 0;
54
     ram3_rdaddress = 0;
          ram0_wren = 0; ram1_wren = 0; ram2_wren = 0; ram3_wren = 0;
          ram0\_rden = 0; ram1\_rden = 0; ram2\_rden = 0; ram3\_rden = 0;
          read_cycle0 = 1; read_cycle1 = 1; read_cycle2 = 1; read_cycle3 =
57
      1;
      end
58
      // We store the values in the outp[i] signals in the RAM passed by
60
      // Scheduler along with the write signals controlled by the same.
61
      // Here we are delaying the storage by one clock cycle just to make
62
      // that the signal is strong when we save it to the RAM.
63
      always_ff @(posedge clk) begin
      if (reset rams) begin
65
        ram0_wraddress <= 0; ram1_wraddress <= 0; ram2_wraddress <= 0;
     ram3_wraddress <= 0;
      end
          if (out_ram_wr[0])
68
               if (ram0_wren)
                   ram0_wraddress <= ram0_wraddress + 1;
70
               else
                   ram0_wren <= 1;
          e 1 s e
               if (ram0_wren) begin
                   ram0_wren <= 0;
75
                   ram0_wraddress <= ram0_wraddress + 1;</pre>
               end
77
          if (out_ram_wr[1])
79
               if (ram1_wren)
80
                   ram1_wraddress <= ram1_wraddress + 1;
81
               else
```

```
83
                     ram1_wren <= 1;
            else
84
                 if (ram1_wren) begin
                     ram1_wren <= 0;
86
                     ram1_wraddress <= ram1_wraddress + 1;</pre>
                 end
            if (out_ram_wr[2])
90
                 if (ram2_wren)
                     ram2_wraddress <= ram2_wraddress + 1;</pre>
92
                 e 1 s e
                     ram2_wren <= 1;
94
            else
                 if (ram2_wren) begin
                     ram2\_wren <= 0;
97
                     ram2_wraddress <= ram2_wraddress + 1;</pre>
                 end
99
            if (out_ram_wr[3])
101
                 if (ram3_wren)
                     ram3_wraddress <= ram3_wraddress + 1;</pre>
103
                 else
                     ram3_wren <= 1;
105
            else
106
                 if (ram3_wren) begin
107
                     ram3_wren <= 0;
                     ram3_wraddress <= ram3_wraddress + 1;</pre>
109
                 end
110
       end
       // The read signals to all the RAMs are turned high as soon as the
113
       // read_enable signal is turned on.
114
       always_ff @(posedge clk) begin
115
            if (read_enable) begin
116
                 ram0_rden <= 1; ram1_rden = 1; ram2_rden = 1; ram3_rden = 1;
117
118
            end
       end
119
120
       // Block to control the reads from the output RAM.
       always_ff @(posedge clk) begin
       if (reset_rams) begin
          ram0_rdaddress <= 0; ram1_rdaddress <= 0; ram2_rdaddress <= 0;
124
       ram3_rdaddress <= 0;
       end
125
            if (chipselect && read) begin
126
                 case (address)
127
                     7 : readdata <= total time;
128
                     8 : readdata <= ram0_rdaddress;
129
                     9 : readdata <= ram1_rdaddress;
130
                     10 : readdata <= ram2_rdaddress;</pre>
131
                     11 : readdata <= ram3_rdaddress;</pre>
                     12 : readdata <= ram0_wraddress;</pre>
133
```

```
13 : readdata <= ram1_wraddress;</pre>
134
                       14 : readdata <= ram2_wraddress;
135
                       15 : readdata <= ram3_wraddress;</pre>
                       0 : begin
138
                            if(ram0_rdaddress <= ram0_wraddress) begin</pre>
139
                                 if (read_cycle0) begin
140
                                      ram0_rdaddress <= ram0_rdaddress + 1;</pre>
141
                                      read_cycle0 <= 0;</pre>
                                       readdata <= ram0_q;</pre>
143
                                end
                                 else begin
145
                                      read_cycle0 <= 1;
146
                                 end
                            end
148
                     else
                       readdata <= ram0_q;</pre>
150
                       end
152
                       1 : begin
153
                            if(ram1_rdaddress <= ram1_wraddress) begin</pre>
154
                                 if (read_cycle1) begin
                                      ram1_rdaddress <= ram1_rdaddress + 1;</pre>
156
                                      read_cycle1 <= 0;
157
                                       readdata <= ram1_q;</pre>
158
                                end
                                 else begin
160
                                      read_cycle1 <= 1;
161
                                 end
162
                            end
163
                     else
                       readdata <= ram1_q;</pre>
165
                       end
166
167
                       2: begin
168
                            if(ram2_rdaddress <= ram2_wraddress) begin</pre>
169
                                 if (read_cycle2) begin
                                      ram2_rdaddress <= ram2_rdaddress + 1;</pre>
                                      read_cycle2 <= 0;
172
                                      readdata <= ram2_q;</pre>
                                 end
                                 else begin
175
                                      read_cycle2 <= 1;</pre>
176
                                 end
                            end
178
                     e1se
179
                       readdata <= ram2_q;
180
181
                       end
182
                       3 : begin
183
                   if(ram3_rdaddress <= ram3_wraddress) begin</pre>
184
                       if (read_cycle3) begin
185
```

```
ram3_rdaddress <= ram3_rdaddress + 1;</pre>
186
187
                          read_cycle3 <= 0;</pre>
                          readdata <= ram3_q;
                       end \\
189
                        else begin
                          read_cycle3 <= 1;</pre>
                       end
192
                    end
193
                    e l s e
                        readdata <= ram3_q;</pre>
195
                end
196
                      default : readdata <= 255;
197
                  endcase
198
             end
199
        end
200
201 endmodule
```

### Hardware:oscheduler.sv

```
2 //START_MODULE_NAME
3 //
4 // Module Name
                      : Old Scheduler
5 //
                      : Reads values from RAMS (4 x 4 architecture) and
6 // Description
          //
                                      schedules to prevent collisions.
7 //
8 // Limitation
                         None
9 //
10 // Results expected:
                         Schedules without collisions to appropriate RAM's
11 // //
12 //END_MODULE_NAME
13
module Scheduler (input logic clk,
          input logic [31:0] input1, input2, input3,
          input logic [11:0] input_ram_wr_add1, input_ram_wr_add2,
     input_ram_wr_add3,
          input logic
                                write_enable,
18
19
          output logic
                                out_ram_wr1, out_ram_wr2, out_ram_wr3,
20
          output logic [31:0] output1, output2, output3,
21
          output logic [11:0] input_ram_rd_add1, input_ram_rd_add2,
22
     input_ram_rd_add3,
          output logic
                                input_ram_rden1, input_ram_rden2,
23
     input_ram_rden3);
24
      logic empty1, empty2, empty3;
25
      logic[1:0] write_cycle;
26
27
      initial begin
28
          write_cycle = 0;
          output1 = 0; output2 = 0; output3 = 0;
30
          out_ram_wr1 = 0; out_ram_wr2 = 0; out_ram_wr3 = 0;
          input_ram_rd_add1 = 0; input_ram_rd_add2 = 0; input_ram_rd_add3
     = 0;
      end
33
34
      function logic set_rd(logic [31:0] data, logic empty);
35
          if (!empty)
36
               case (data [1:0])
37
                   2'b00 : if (!out_ram_wr2) begin
38
                       output2 = data;
39
                       out ram wr2 = 1;
40
                       return 1;
41
                   end
42
                   e l s e
```

```
44
                        return 0;
                    2'b10 : if (!out_ram_wr2) begin
45
                        output2 = data;
                        out_ram_wr2 = 1;
47
                        return 1;
                    end
                    else
50
                        return 0;
51
                    2'b01 : if (!out_ram_wr1) begin
                        output1 = data;
                        out ram wr1 = 1;
                        return 1;
55
                    end
                    else
                        return 0;
58
                    2'b11 : if (!out_ram_wr3) begin
                        output3 = data;
60
                        out_ram_wr3 = 1;
                        return 1;
62
                    end
                    e1se
                         return 0;
               endcase
66
           else
               return 0;
68
      endfunction
70
      always_ff @(posedge clk) begin
71
           input_ram_rden1 = 1; input_ram_rden2 = 1; input_ram_rden3 = 1;
73
           // all packets have been written to RAM
           if (write_enable) begin
74
               if (write_cycle == 2) begin
75
                    write_cycle = 0;
                    if(input_ram_rd_add1 < input_ram_wr_add1)</pre>
77
                        empty1 = 0;
                    e1se
79
                        empty1 = 1;
                    if(input_ram_rd_add2 < input_ram_wr_add2)</pre>
81
                        empty2 = 0;
                    e1se
83
                        empty2 = 1;
                    if(input_ram_rd_add3 < input_ram_wr_add3)</pre>
85
                        empty3 = 0;
                    else
                        empty3 = 1;
88
89
                    input_ram_rd_add1 = input_ram_rd_add1 + set_rd(input1,
90
     empty1);
                    input_ram_rd_add2 = input_ram_rd_add2 + set_rd(input2,
91
     empty2);
                    input_ram_rd_add3 = input_ram_rd_add3 + set_rd(input3,
92
     empty3);
```

## Verilator:vgacounter.cpp

```
1 // Instantiates the VGA_LED.sv and exercises it for 200 input and 200
      read // cycles
2 #include "VVGA_LED.h"
3 #include "verilated.h"
4 #include "verilated_vcd_c.h"
5 #include < stdlib .h>
6 #include <time.h>
7 #include <iostream>
8 // This is required otherwise the module doesn't get instantiated and
      the linker
9 // throws an error.
vluint64_t main_time = 0;
                                       // Current simulation time
           // This is a 64-bit integer to reduce wrap over issues and
11
           // allow modulus. You can also use a double, if you wish.
           double sc_time_stamp () {
                                                // Called by $time in Verilog
13
                                                // converts to double, to match
                return main_time;
                                                // what SystemC does
  int main(int argc, char** argv)
17
18
       Verilated::commandArgs(argc, argv);
19
20
       time_t t;
       // init top verilog instance
      VVGA\_LED* top = new VVGA\_LED();
22
       // init trace dump
23
       Verilated::traceEverOn(true);
24
       VerilatedVcdC* tfp = new VerilatedVcdC;
25
       top->trace(tfp, 99);
26
       tfp -> open("vgaled.vcd");
27
       // initialize simulation inputs
28
       top->clk
                    = 1;
29
       top \rightarrow write = 0;
30
       top \rightarrow reset = 0;
       top \rightarrow read = 0;
       int num_packets = 10;
       srand((unsigned) time(&t));
34
       // run simulation for 100 clock periods
       for (int i = 0; i < 300; i++)
36
37
           if (i \ge 8 \&\& i < 8 + 8 * num_packets) 
                top \rightarrow write = 1;
39
                top \rightarrow chipselect = 1;
                // top \rightarrow address = 1;
41
                if (i\%8==0)
                    top \rightarrow address = i/8\%4;
43
                if (i\%2 == 0 \&\& i\%8 < 6)
                    top \rightarrow writedata = rand() + 1;
45
                else if (i\%8 == 6)
                    top \rightarrow writedata = 0;
47
```

```
49
              else if (i \ge 10 + 8 \times num_packets & i < 12 + 8 \times num_packets & i % 2 = 0)
                          top \rightarrow write = 1;
50
                          top \rightarrow chipselect = 1;
51
                          top \rightarrow address = 15;
52
                          top \rightarrow writedata = 0;
              else if (i\%2 == 0) {
55
                         top \rightarrow write = 0;
56
                          top \rightarrow chipselect = 0;
                          top \rightarrow address = 0;
                          top \rightarrow writedata = 0;
              }
60
61
              for (int clk = 0; clk < 2; ++clk)
63
                    top \rightarrow eval();
                    tfp \rightarrow dump((2 * i) + clk);
65
                    if (c1k == 1){
                               top \rightarrow clk = !top \rightarrow clk;
67
                }
69
         }
70
         int ram0_size = top->v__DOT__buffer__DOT__ram0_wraddress;
72
         int ram1_size = top->v__DOT__buffer__DOT__ram1_wraddress;
73
         int ram2_size = top->v__DOT__buffer__DOT__ram2_wraddress;
         int ram3_size = top->v__DOT__buffer__DOT__ram3_wraddress;
75
         int i = 0;
76
78
         for (int i = 300; i < 600; i++)
79
         {
80
              if(i < 312){
81
                    top \rightarrow chipselect = 1;
82
                    top \rightarrow address = 14;
                    top \rightarrow write = 1;
              }else if(j < ram0_size){</pre>
                    top \rightarrow write = 0;
86
                    top \rightarrow chipselect = 1;
                    top \rightarrow address = 0;
88
                    if (i\%6 < 4)
                          top \rightarrow read = 1;
90
                    e1se
                         top \rightarrow read = 0;
92
93
              else if (j \ge ram0\_size \&\& j < ram1\_size + ram0\_size)
94
                    top \rightarrow write = 0;
95
                    top \rightarrow chipselect = 1;
                    top \rightarrow address = 1;
97
                    if (i\%6 < 4)
98
                          top \rightarrow read = 1;
99
100
                    else
```

```
101
                           top \rightarrow read = 0;
102
               else if (j >= ram1_size + ram0_size && j < ram1_size + ram2_size
        + ram0_size){
                     top \rightarrow write = 0;
104
                     top \rightarrow chipselect = 1;
105
                     top \rightarrow address = 2;
106
                     if (i\%6 < 4)
107
                           top \rightarrow read = 1;
108
                     else
109
110
                           top \rightarrow read = 0;
112
               else if (j >= ram1_size + ram0_size + ram2_size && j < ram1_size
        + ram2_size + ram3_size + ram0_size){
                     top \rightarrow write = 0;
113
                     top \rightarrow chipselect = 1;
                     top \rightarrow address = 3;
115
                     if (i\%6 < 4)
                           top \rightarrow read = 1;
117
                     else
                           top \rightarrow read = 0;
119
               else if (i >= 590 \&\& i < 592)
120
                     top \rightarrow write = 1;
                     top \rightarrow chipselect = 1;
122
                     top \rightarrow address = 13;
123
                     top \rightarrow read = 0;
               } else {
125
                     top \rightarrow write = 0;
126
                     top \rightarrow chipselect = 0;
                     top \rightarrow address = 0;
128
                     top \rightarrow read = 0;
129
               }
130
131
               if (i > 312 && i %6==5)
133
                     j++;
134
               for (int clk = 0; clk < 2; ++clk)
136
                     top \rightarrow eval();
137
                     tfp \rightarrow dump((2 * i) + clk);
138
                     if (c1k == 1){
139
                                 top \rightarrow clk = !top \rightarrow clk;
140
142
         tfp ->close();
144
145
```

# Verilator:schedulercounter.cpp

```
1 // For easy interfacing with the scheduler.
3 #include "VScheduler.h"
4 #include "verilated.h"
5 #include "verilated_vcd_c.h"
  int main(int argc, char** argv)
8 {
9
       Verilated::commandArgs(argc, argv);
       // init top verilog instance
       VScheduler* top = new VScheduler();
12
13
       // init trace dump
       Verilated::traceEverOn(true);
15
       VerilatedVcdC* tfp = new VerilatedVcdC;
       top->trace(tfp, 99);
       tfp -> open ("scheduler.vcd");
       // initialize simulation inputs
19
       top \rightarrow c1k = 1;
20
       top->write_enable = 1;
22
       top \rightarrow reset_rams = 0;
23
       // run simulation for 100 clock periods
       for (int i = 0; i < 24; i++)
25
       {
26
            if (i == 8)
                 top \rightarrow input_ram_wr_add[0][0] = 2;
28
                 top \rightarrow input_ram_wr_add[1][1] = 2;
                 top \rightarrow input_ram_wr_add[2][2] = 2;
30
                 top \rightarrow input_ram_wr_add[3][3] = 2;
            }
            if(top->input_ram_rd_add[0][0] == 0)
                 top \rightarrow inp[0][0] = 1;
            e1se
36
                 top \rightarrow inp[0][0] = 0;
            if(top \rightarrow input_ram_rd_add[1][1] == 0)
                 top \rightarrow inp[1][1] = 2;
            e1se
40
                 top \rightarrow inp[1][1] = 0;
            if(top \rightarrow input_ram_rd_add[2][2] == 0)
                 top \rightarrow inp[2][2] = 3;
43
            e1se
                 top \rightarrow inp[2][2] = 0;
            if(top->input_ram_rd_add[3][3] == 0)
                 top \rightarrow inp[3][3] = 4;
            else
                 top \rightarrow inp[3][3] = 0;
49
```

```
51
              for (int clk = 0; clk < 2; ++clk)
              {
52
                    top \rightarrow eval();
                    tfp \rightarrow dump((2 * i) + clk);
54
                    if (clk == 1)
                          top \rightarrow clk = !top \rightarrow clk;
57
              }
58
         for (int j = 0; j < 4; j++){
60
              for (int k = 0; k < 4; k++) {
61
                    top \rightarrow input_ram_rd_add[j][k] = 0;
62
63
        top \rightarrow input_ram_wr_add[0][0] = 0;
65
        top \rightarrow input_ram_wr_add[1][1] = 0;
         top \rightarrow input_ram_wr_add[2][2] = 0;
67
         top \rightarrow input_ram_wr_add[3][3] = 0;
        top \rightarrow total_time = 0;
69
         for (int i = 24; i < 96; i++)
72
              if (i == 32){
73
                    top \rightarrow input_ram_wr_add[0][0] = 2;
                    top \rightarrow input_ram_wr_add[1][0] = 2;
75
                    top \rightarrow input_ram_wr_add[2][0] = 2;
76
                    top \rightarrow input_ram_wr_add[3][0] = 2;
              if(top->input_ram_rd_add[0][0] == 0)
                    top \rightarrow inp[0][0] = 1;
80
              e1se
81
                    top \rightarrow inp[0][0] = 0;
82
              if(top->input_ram_rd_add[1][0] == 0)
                    top \rightarrow inp[1][0] = 2;
              else
                    top \rightarrow inp[1][0] = 0;
              if(top \rightarrow input_ram_rd_add[2][0] == 0)
                    top \rightarrow inp[2][0] = 3;
88
              else
                    top \rightarrow inp[2][0] = 0;
90
              if(top->input_ram_rd_add[3][0] == 0)
91
                    top \rightarrow inp[3][0] = 4;
92
              e1se
                    top \rightarrow inp[3][0] = 0;
94
95
96
              for (int clk = 0; clk < 2; ++clk)
97
              {
                    top \rightarrow eval();
99
                    tfp \rightarrow dump((2 * i) + clk);
100
                    if (clk == 1)
101
                         top \rightarrow clk = !top \rightarrow clk;
102
```

```
103 }
104 }
105 }
106 tfp -> close();
107 }
```

## **Verilator:ramcounter.cpp**

```
1 // For easy interfacing with the Scheduler
2 #include "VRAM.h"
3 #include "verilated.h"
4 #include "verilated_vcd_c.h"
5 vluint64_t main_time = 0;
                                        // Current simulation time
           // This is a 64-bit integer to reduce wrap over issues and
           // allow modulus. You can also use a double, if you wish.
                                                 // Called by $time in Verilog
           double sc_time_stamp () {
                                                  // converts to double, to match
                return main_time;
                                                  // what SystemC does
            }
int main(int argc, char** argv)
14
       Verilated::commandArgs(argc, argv);
15
       // init top verilog instance
      VRAM* top = new VRAM();
19
       // init trace dump
20
       Verilated::traceEverOn(true);
       VerilatedVcdC* tfp = new VerilatedVcdC;
22
23
       top->trace(tfp, 99);
24
       tfp -> open("ram. vcd");
25
26
       // initialize simulation inputs
       top \rightarrow clock = 0;
28
        // run simulation for 100 clock periods
29
       for(int i = 0; i < 100; i++)
30
31
                if (i >= 13 \&\& i < 15){
                          top \rightarrow data = 0xA;
33
                          top \rightarrow wren = 0x1;
34
                          top \rightarrow wraddress = 0x1;
36
                else if (i \ge 15 \&\& i < 17)
                          top \rightarrow data = 0xB;
                          top \rightarrow wren = 0x1;
                          top \rightarrow wraddress = 0x2;
40
                }
41
                else {
                          top \rightarrow data = 0;
43
                          top \rightarrow wren = 0;
45
                }
                if (i >= 17 \&\& i < 19){
47
                          top \rightarrow rden = 0x1;
48
                          top \rightarrow rdaddress = 0x1;
49
```

```
else if (i >= 19 \&\& i < 21){
51
52
                                top \rightarrow rden = 0x1;
                                top \rightarrow rdaddress = 0x2;
                    }
54
                    else {
                                top \rightarrow rden = 0;
                    }
58
              for (int clk = 0; clk < 2; ++clk) {
60
                    top -> eval();
61
                    tfp \rightarrow dump((2 * i) + clk);
                    if (clk == 1)
63
                                top \rightarrow clock = !top \rightarrow clock;
65
                }
67
         }
69
        tfp -> close();
70
71 }
```

## **Verilator: buffercounter.cpp**

```
1 // For simulating Buffer, its better to simulate the full suite
2 #include "VBuffer.h"
3 #include "verilated.h"
4 #include "verilated_vcd_c.h"
5 #include "iostream"
6 vluint64_t main_time = 0;
                                    // Current simulation time
          // This is a 64-bit integer to reduce wrap over issues and
           // allow modulus. You can also use a double, if you wish.
           double sc_time_stamp () {
                                              // Called by $time in Verilog
               return main_time;
                                              // converts to double, to match
                                              // what SystemC does
int main(int argc, char** argv)
      Verilated::commandArgs(argc, argv);
15
      // init top verilog instance
      VBuffer* top = new VBuffer();
19
      // init trace dump
20
      Verilated::traceEverOn(true);
21
      VerilatedVcdC* tfp = new VerilatedVcdC;
22
23
      top->trace(tfp, 99);
24
      tfp -> open("buffer.vcd");
25
      top->read_enable = 1;
26
      // initialize simulation inputs
28
      top->clk
                   = 1;
29
       // run simulation for 100 clock periods
30
      int add = 0;
31
      for (int i = 0; i < 100; i++) {
32
           // Place a dummy data on write bus. You need to write first.
34
           // Write to RAM 1
           //RAM 0 & RAM 1
           if (i >= 10 \&\& i < 14){
37
                   top \rightarrow out\_ram\_wr[0] = 1; //Enable ramen1 for 1 clock
      cycles
                   top \rightarrow outp[0] = 1; // Put data on the result signal
30
40
                    top \rightarrow out_ram_wr[1] = 1;
                    top \rightarrow outp[1] = 2;
42
           }
43
           else {
                    top->out_ram_wr[0]=0; // Toggle ramen1
                    top \rightarrow outp[0] = 0; // Toggle result 1
46
47
           //RAM 2 & RAM 3
48
           if (i >= 14 \&\& i < 18){
```

```
50
                          top \rightarrow out_ram_wr[2] = 1;
                          top \rightarrow outp[2] = 3;
51
                          top \rightarrow out_ram_wr[3] = 1;
53
                          top \rightarrow outp[3] = 4;
              }
55
              else {
56
                          top \rightarrow out_ram_wr[2] = 0;
57
                          top \rightarrow outp[2] = 0;
              // Generate read signals
60
              if (i \ge 20 \&\& i < 36)
61
                          top \rightarrow chipselect = 1;
62
                          top \rightarrow read = 1;
                          top -> address = add;
64
                          if(i\%4 == 3)
                                add = add + 1;
66
                          printf("%i\n", add);
68
              else {
                          top \rightarrow chipselect = 0;
70
                          top \rightarrow address = 0;
71
                          top \rightarrow read = 0;
72
73
74
              for (int clk = 0; clk < 2; ++clk)
76
                    top -> eval();
                    tfp \rightarrow dump((2 * i) + clk);
                    if (c1k == 1){
79
                                top \rightarrow c1k = !top \rightarrow c1k;
80
81
82
83
        tfp -> close();
84
85 }
```

#### Verilator: Makefile

```
# SwitchON hardware simulation file. Compiles all the modules
     individually or
2 # can compile them into one top module.
4 # List the includes here
5 # altera_mf.v contains scfifo and altsync modules.
6 INCLUDES=altera_mf.v
7 # List all the warning flags with the reason to skip them.
9 WFLAGS= -Wno-INITIALDLY -Wno-lint -Wno-MULTIDRIVEN -Wno-UNOPTFLAT -Wno-
     COMBDLY
10 #WFLAGS=
# Warning Flags Description(http://www.veripool.org/projects/verilator/
     wiki/
12 # Manual-verilator)
# 1)-Wno-INITIALDLY:-
14 # Warns that you have a delayed assignment inside of an initial or final
15 # block. If this message is suppressed, Verilator will convert this to a
16 # non-delayed assignment. See also the COMBDLY warning. Ignoring this
17 # warning may make Verilator simulations differ from other simulaors.
18 # Our Observation:
20 # Since some of the Altera modules (more than hundreds) did not have
 # this explicitly set we disabled it, and have not faced any issue as
     such.
^{23} # 2)-Wno-1int:-
24 #
      Disable all lint related warning messages, and all style warnings.
25 #
      This is equivalent to "-Wno-ALWCOMBORDER -Wno-CASEINCOMPLETE
     -Wno-CASEOVERLAP -Wno-CASEX -Wno-CASEWITHX -Wno-CMPCONST -Wno-
26 #
     ENDLABEL
     -Wno-IMPLICIT -Wno-LITENDIAN -Wno-PINCONNECTEMPTY -Wno-PINMISSING
27 #
     -Wno-SYNCASYNCNET -Wno-UNDRIVEN -Wno-UNSIGNED -Wno-UNUSED -Wno-WIDTH
      plus the list shown for Wno-style.
# It is strongly recommended you cleanup your code rather than using
     this
# option, it is only intended to be use when running test-cases of code
32 # received from third parties.
34 # 3)—Wno–MULTIDRIVEN:—
35 #
      Warns that the specified signal comes from multiple always blocks.
     is often unsupported by synthesis tools, and is considered bad style
36 #
37 #
     It will also cause longer runtimes due to reduced optimizations.
      this warning will only slow simulations, it will simulate correctly.
40 # 4)—Wno—UNOPTFLAT:—
```

```
41 #
      Warns that due to some construct, optimization of the specified
     signal
      or block is disabled. The construct should be cleaned up to improve
42 #
      runtime. A less obvious case of this is when a module instantiates
43 #
      two submodules. Inside submodule A, signal I is input and signal O
      output. Likewise in submodule B, signal O is an input and I is an
45 #
     output.
     A loop exists and a UNOPT warning will result if AI & AO both come
46 #
      and go to combinatorial blocks in both submodules, even if they are
      unrelated always blocks. This affects performance because Verilator
48 #
      would have to evaluate each submodule multiple times to stabilize
50 #
      signals crossing between the modules. Ignoring this warning will only
      slow simulations, it will simulate correctly.
52 # 5)—Wno—COMBDLY:—
53 #
      Warns that you have a delayed assignment inside of a combinatorial
     Using delayed assignments in this way is considered bad form, and
55 #
      lead to the simulator not matching synthesis. If this message is
      suppressed, Verilator, like synthesis, will convert this to a
56 #
      non-delayed assignment, which may result in logic races or other
     nasties
     . See http://www.sunburst-design.com/papers/
     CummingsSNUG2000SJ_NBA_rev1_2.pdf
59 # Ignoring this warning may make Verilator simulations differ from other
60 # simulators.
63 TOPMODULE=VGA LED # Name of the TOP MODULE into which all modules will
     be mushed.
64
66 # Define individual modules below with the appropriate simulators.
67 # Notation to define simulation file is <modulenamecounter.cpp>
69 # TOP level module depends on Fifo.v Scheduler.v Buffer.v megamux.v
 VGA_LED_SIM=vgacounter.cpp # Define the simulation file you for this
     module.
vgaled:
    verilator $(WFLAGS) -top-module $(TOPMODULE) -I $(INCLUDES) -cc \
      -trace VGA_LED.sv --exe $(VGA_LED_SIM)
    make -j -C obj_dir/ -f VVGA_LED.mk VVGA_LED
    obj_dir/VVGA_LED
75
77 # The RAM's on the output port of the Switch
 buffer_SIM=buffercounter.cpp # Define the simulation file you for this
     module.
79 buffer:
 verilator -Wno-lint -top-module Buffer -I $(INCLUDES) -cc \
```

```
-trace Buffer.sv -exe $(buffer_SIM)$
    make -j -C obj_dir/ -f VBuffer.mk VBuffer
82
    obj_dir/VBuffer
83
85 #Compiles the scheduler depends on None. This is the Crossbar switch
scheduler_SIM=schedulercounter.cpp
  scheduler:
    verilator -Wno-lint -cc --trace Scheduler.sv --exe $(scheduler_SIM)$
    make -j -C obj_dir/ -f VScheduler.mk VScheduler
    obj_dir/VScheduler
92 # Compiles into Altera's scfifo depends on scfifo.v
93 #fifo_SIM=fifocounter.cpp
94 #fifo:
    #verilator -Wno-INITIALDLY -Wno-lint -Wno-MULTIDRIVEN --top-module
      Fifo \
      #-cc --trace Fifo.v --exe $(fifo_SIM)
96
    #make -j -C obj_dir/ -f VFifo.mk VFifo
    #obj_dir/VFifo
  ## Compiles the Megamuxes
#mux_SIM=muxcounter.cpp
  #mux:
    #verilator -Wno-lint -cc --trace lpm_mux.v --top-module lpm_mux --exe
102
      #$ (mux_SIM)$
103
    #make -j -C obj_dir/ -f Vlpm_mux.mk Vlpm_mux
    #obj_dir/Vlpm_mux
105
107 #Compiles the scheduler depends on None. This is the Crossbar switch
108 ram:
    verilator $(WFLAGS) -I $(INCLUDES)$ -cc --trace RAM.v -top-module RAM
       --exe ramcounter.cpp
    make -j -C obj_dir/ -f VRAM.mk VRAM
110
    obj_dir/VRAM
112
113 clean:
    rm -rf obj_dir
115
    rm - f *.vcd
```

### Software:main.c

```
* Userspace program that communicates with the led_vga device driver
  * primarily through ioctls
 * Based on Stephen Edwards's Code.
  * Specific Words (see packetgen.h) reserved for RAM/Scheduler control.
  * Architecuture of the Switch
          | Address |
                           | Status |
               15
                           write_enable // Kicks the Scheduler into motion.
               14
                           read_enable // Kicks the output RAMS.
  */
11 #include < stdio.h>
12 #include < stdlib.h>
13 #include <time.h>
14 #include "vga_led.h"
15 #include < sys/ioctl.h>
16 #include < sys/types.h>
17 #include < sys / stat.h>
18 #include <fcntl.h>
19 #include < string.h>
20 #include <unistd.h>
21 #include "packetgen.h"
int vga_led_fd;
  int sent[VGA_LED_DIGITS], received[VGA_LED_DIGITS];
  int main()
26
27
      vga_led_arg_t vla;
28
      int i;
29
      time_t t; // Use the system time to seed the pseudo random generator
30
      srand((unsigned) time(&t));
      static const char filename[] = "/dev/vga_led";
      printf("Switch ON Packet Generator started\n");
      if (vga\_led\_fd = open(filename, O\_RDWR)) == -1) {
34
          fprintf(stderr, "could not open %s\n", filename);
          return -1;
36
      for (i=0; i < VGA\_LED\_DIGITS; i++){
38
          sent[i] = 0;
          received[i] = 0;
40
41
      int* input;
42
      char* packet_info;
43
      // Generate the packet and sends it.
      for (i = 0 ; i < NUM PACKETS; i++) {
45
          packet_info = mkpkt();
          input = generate(packet_info);
47
          int sport = i\%4;
          printf("Sending packet to port: %u, of length: %u, with seed: %u
     \n", packet_info[0], packet_info[2], packet_info[1]);
```

```
write_segments(vga_led_fd, input, sport, packet_info[2]);
51
          sent[packet_info[0]%4]++;
      for (i=0; i < VGA\_LED\_DIGITS; i++){
53
          printf("Packets sent to RAM %i: %i\n", i, sent[i]);
55
      printf("Done Sending Packets, run validator to check!!, terminating\n
56
     ");
      vla.digit = WRITE_ENABLE_SCHEDULER; // For starting the Scheduler
      vla.segments = 0; // No address needed
58
      if (ioctl(vga_led_fd, VGA_LED_WRITE_DIGIT, &vla)) {
          perror("ioctl(VGA_LED_WRITE_DIGIT) failed");
          return;
61
      vla.digit = READ_ENABLE_SCHEDULER; // For Read Enabling the
63
     Scheduling
      vla.segments = 0;
64
      if (ioctl(vga_led_fd, VGA_LED_WRITE_DIGIT, &vla)) {
          perror("ioctl(VGA_LED_WRITE_DIGIT) failed");
          return;
      }
68
      return 0;
69
70
```

# Software:packetgen.h

```
1 /*
  * packetgen headers:
  * Contains various headers defining packet parameters.
  * Team SwitchON
  * Columbia University
7 */
8 #include <stdint.h>
9 #ifndef __PACKETGEN_H__
10 #define __PACKETGEN_H__
12 /* Packet parameters */
14 // Crossbar Architecture
          -1 -|-|-|
          -2 -|-|-|
         -3 -|-|-|
             1 2 3
21 // Packet Structure (all length in bytes)
22 // | LENGTH | LENGTH | SEED | DPORT |
           1
                   -1
                          1
24 // Destination port parameters
25 #define MIN_DPORT 1 // Minimum dst port that must be generated
26 #define DPORT_BITS 256 //1 Byte
27 #define NUM PACKETS 150 // Total Packets to be sent.
28 #define SEED_BITS 256 // Keep the seed of 1 byte
<sup>29</sup> #define WRITE_ENABLE_SCHEDULER 15 // Write Enable the scheduler.
30 #define READ_ENABLE_SCHEDULER 14 // Read Enable the Ouput Rams.
31 #define NUM_RAMS 4 // Define the number of RAMS.
#define TIME_PER_CYCLE 20*10^-9
char* mkpkt();
34 #endif
void write_segments(int vga_led_fd, int* input, int sport, int len);
int * generate();
```

#### packetgen.c

```
1 /*
2 * Userspace program that generates packets with random contents
  * Headers are defined in packetgen.h
  * Define the function prototypes in the packetgen.h headers
  */
7 #include < stdlib.h>
8 #include "packetgen.h"
9 #include "vga led.h"
#include <stdio.h>
12 // Mkpkt returns a char pointer to the input. Generates an array with
13 // randonly generated packets.
14 char* mkpkt() {
          char* input = (char *) malloc(4);
15
          input[0] = rand()%DPORT_BITS; // LSB 8 bits destination port.
          input[1] = rand()%SEED_BITS; // Seed for the data.
          input[2] = rand()\%60+4; // Length of the packet.
          input[3] = 0; // Length of packet MSB
19
20
          return input;
21 }
22 // Writes the packet to vla.segment.
void write_segments(int vga_led_fd, int* packet, int sport, int len)
24
 {
      vga_led_arg_t vla;
25
      int i;
26
      vla.digit = sport; // Make source port on which to send.
27
      for (i = 0 ; i < len; i++)
28
          vla.segments = packet[i];
          if (ioctl(vga_led_fd, VGA_LED_WRITE_DIGIT, &vla)) {
30
              perror("ioctl(VGA_LED_WRITE_DIGIT) failed");
              return;
32
          }
33
34
 // Pushes the 32 bits and then generates a packet which is exactly 32
     bytes.
 int* generate(char packet_info[4]){
37
      int i = 0;
      int len = (int) packet_info[2];
40
      int* input = (int *) malloc(len*4);
      input[0] = (packet_info[3] << 24) | (packet_info[2] << 16) |
41
                   (packet_info[1] << 8) | (packet_info[0]);
42
      srand((unsigned)(packet_info[1]));
43
      for (i=1; i < len -1; i++){
          input[i] = rand() + 1;
45
46
      input[len-1] = 0;
47
      return input;
48
49 }
```

### Software:validator.c

```
* Switch ON validator, after main has completed sending the packets,
     this * connects to the vga_led device and extracts all the
     information about t * he current status of the RAM's, based on which
     it extracts the packet fr * om each RAM. Now it also locally seeds
     itself with the encoded packet's * seed and then matches the
     information one by one till EOP((End of packe * t), at which stage
     it resets it's seed and waits for another packet.
3
4 */
5 #include < stdio.h>
6 #include < stdlib.h>
7 #include <time.h>
8 #include "vga_led.h"
9 #include <sys/ioctl.h>
10 #include <sys/types.h>
#include <sys/stat.h>
12 #include <fcntl.h>
13 #include < string . h>
14 #include <unistd.h>
15 #include <math.h>
16 #include "packetgen.h"
int vga_led_fd;
 int received[VGA_LED_DIGITS], packets[VGA_LED_DIGITS];
 int main()
20
21
      vga_led_arg_t vla;
22
      int i,j,k,total_packets = 0,transferred_data=0;
23
      static const char filename[] = "/dev/vga_led";
24
      printf(" Userspace Validation of sent data \n");
      if (vga led fd = open(filename, ORDWR)) == -1)
          fprintf(stderr, "could not open %s\n", filename);
          return -1;
28
      for(i=0; i < VGA_LED_DIGITS; i++){</pre>
30
          received[i] = 0;
          packets[i] = 0;
      // Output Ram's count
34
      for (i=0; i < VGA\_LED\_DIGITS; i++){
35
          vla.digit = 12+i;
          if (ioctl(vga_led_fd, VGA_LED_READ_DIGIT, &vla)) {
              perror("ioctl(VGA_LED_READ_DIGIT) failed");
              return;
          }
          received[i] = vla.segments;
41
          printf("RAM %i (32 Bits Transferred, includes all 4): %i\n", i,
     received[i]);
          transferred_data = transferred_data + received[i];
```

```
44
      printf("Transferred Data (Bytes) : %i\n",transferred_data*4);
45
      for(i=0; i<VGA_LED_DIGITS; i++){</pre>
          vla.digit = 8+i;
47
          if (ioctl(vga_led_fd, VGA_LED_READ_DIGIT, &vla)) {
               perror("ioctl(VGA_LED_READ_DIGIT) failed");
               return;
50
          }
51
      for(i = 0; i < VGA\_LED\_DIGITS; i++){
          // Start extracting values from the Output Rams
           printf("Validating from RAM: %i\n", i);
55
          vla.digit = i;
          for (j=0; j < received[i]; j++)
               // Extract the values from the rams.
58
               if (ioctl(vga_led_fd, VGA_LED_READ_DIGIT, &vla)) {
                   perror("ioctl(VGA_LED_READ_DIGIT) failed");
60
                   return;
62
               if(vla.segments == 0)
                   printf("Received 0");
                   continue;
66
               unsigned int seedMask = 65280; // Extract the middle bits
               int length = vla.segments;
68
               int seed = length;
               int dport = seed;
70
               length = length >> 16;
71
               seed = ((seed & seedMask)>>8); // Extracts the seed from
      packet
               dport = dport%4;
73
74
               if (dport!=i) {
                   printf("Invalid RAM location and dport from packet
      header\n");
                   exit(1);
               }
77
               srand(seed);
               // Do some error handling.
79
               for (k=1; k< length; k++)
                   if (ioctl(vga_led_fd, VGA_LED_READ_DIGIT, &vla)) {
81
                        perror("ioctl(VGA_LED_READ_DIGIT) failed");
82
                       return;
83
                   if(k < length - 1){
                        int a = rand() + 1;
86
                        if (vla. segments != a) {
87
                            printf ("Packet value does not match: %i, %i\n",
88
     a, vla.segments);
                            exit(1);
89
90
                   else if (k == length -1 && vla.segments != 0)
91
                        printf("Length of packet reached but 0 not received
92
```

```
.\n");
                        exit(1);
93
                   j++;
95
           packets[i]++;
           total_packets++; // Increment the total packet sent counter.
99
100
       printf("All RAM's have passed Validation!! \n");
101
       printf("Total Packets Sent : %i\n", total_packets);
       for (i = 0; i < 4; i++)
103
           printf("Output RAM: %i Packet Count: %i\n", i, packets[i]);
104
       vla.digit = 7;
106
       if (ioctl(vga_led_fd, VGA_LED_READ_DIGIT, &vla)) {
           perror("ioctl(VGA_LED_READ_DIGIT) failed");
108
           return;
109
       int num_clock_cycles = 0; // Number of clock cycles it took in total
       num_clock_cycles = vla.segments;
112
       printf("Number of cycles required for transfer: %i\n",
      num_clock_cycles);
       float var = 20E-9; // Assuming FPGA runs on 50 MHZ clock.
114
       printf("Speed through the Switch is:%f (in Mbits/s) \n",(
115
      transferred_data *4*8) /( var *1024 * 1024 * num_clock_cycles ) );
      return 0;
116
117 }
```

# Software:vga\_led.h

```
#ifndef _VGA_LED_H
#define _VGA_LED_H

#include <linux/ioctl.h>

#define VGA_LED_DIGITS 4

typedef struct {
    unsigned char digit;
    unsigned int segments; /* LSB is segment a, MSB is decimal point */
} vga_led_arg_t;

#define VGA_LED_MAGIC 'q'

/* ioctls and their arguments */
#define VGA_LED_WRITE_DIGIT _IOW(VGA_LED_MAGIC, 1, vga_led_arg_t *)
#define VGA_LED_READ_DIGIT _IOWR(VGA_LED_MAGIC, 2, vga_led_arg_t *)
#define VGA_LED_READ_DIGIT _IOWR(VGA_LED_MAGIC, 2, vga_led_arg_t *)
#endif
```

# Software:vga\_led.c

```
* Device driver for the VGA LED Emulator
  * A Platform device implemented using the misc subsystem
  * Stephen A. Edwards
  * Columbia University
  * References:
  * Linux source: Documentation/driver-model/platform.txt
                  drivers/misc/arm-charled.c
11
  * http://www.linuxforu.com/tag/linux-device-drivers/
  * http://free-electrons.com/docs/
13
  * "make" to build
15
  * insmod vga_led.ko
  * Check code style with
 * checkpatch.pl — file — no-tree vga_led.c
22 #include ux/module.h>
23 #include ux/init.h>
24 #include ux/errno.h>
25 #include ux/version.h>
26 #include ux/kernel.h>
27 #include ux/platform_device.h>
28 #include ux/miscdevice.h>
29 #include ux/slab.h>
30 #include ux/io.h>
31 #include ux/of.h>
32 #include ux/of address.h>
33 #include ux/fs.h>
34 #include uaccess.h>
_{35} #include "vga_led.h"
37 #define DRIVER_NAME "vga_led"
38
  * Information about our device
 */
42 struct vga_led_dev {
   struct resource res; /* Resource: our registers */
   void __iomem *virtbase; /* Where registers can be accessed in memory
   u32 segments [VGA_LED_DIGITS];
 } dev;
46
47
48 /*
* Write segments of a single digit
```

```
* Assumes digit is in range and the device information has been set up
  */
51
static void write_digit(unsigned int digit, u32 segments)
53 {
    iowrite32 (segments, dev. virtbase + 4* digit);
    dev.segments[digit] = segments;
55
56
57
58 /*
  * Handle ioctl() calls from userspace:
   * Read or write the segments on single digits.
* Note extensive error checking of arguments
62
63 static long vga_led_ioctl(struct file *f, unsigned int cmd, unsigned
     long arg)
      vga_led_arg_t vla;
65
    switch (cmd) {
    case VGA_LED_WRITE_DIGIT:
67
      if (copy_from_user(&vla, (vga_led_arg_t *) arg,
              sizeof(vga_led_arg_t)))
69
        return -EACCES;
      /*if (vla.digit > 8)*/
71
        /*return -EINVAL; */
72
      write_digit(vla.digit, vla.segments);
      break;
75
    case VGA LED READ DIGIT:
76
      if (copy_from_user(&vla, (vga_led_arg_t *) arg,
77
              sizeof(vga_led_arg_t)))
78
        return -EACCES;
79
      if (vla. digit > 15)
80
        return -EINVAL;
81
      int a;
82
          a = ioread32(dev.virtbase + 4*vla.digit);
          vla.segments = a;
84
    // vla.segments = dev.segments[vla.digit];
      if (copy_to_user((vga_led_arg_t *) arg, &vla,
86
            sizeof(vga_led_arg_t)))
        return -EACCES;
88
      break:
89
90
    default:
      return -EINVAL;
92
93
94
    return 0;
95
96
97
98 /* The operations our device knows how to do */
99 static const struct file_operations vga_led_fops = {
owner = THIS_MODULE,
```

```
.unlocked_ioctl = vga_led_ioctl,
102 };
103
  /* Information about our device for the "misc" framework - like a char
  static struct miscdevice vga_led_misc_device = {
               = MISC_DYNAMIC_MINOR,
    . minor
    . name = DRIVER_NAME,
            = &vga_led_fops,
    . fops
109 };
111 /*
  * Initialization code: get resources (registers) and display
   * a welcome message
   */
114
115 static int __init vga_led_probe(struct platform_device *pdev)
117 // static unsigned char welcome_message[VGA_LED_DIGITS] = {
  //
        0x3E, 0x7D, 0x77, 0x08, 0x38, 0x79, 0x5E, 0x00};
    int i, ret;
     static unsigned char welcome_message [4] = \{0, 0, 0, 0\};
120
    /* Register ourselves as a misc device: creates /dev/vga_led */
     ret = misc_register(&vga_led_misc_device);
123
124
    /* Get the address of our registers from the device tree */
     ret = of_address_to_resource(pdev->dev.of_node, 0, &dev.res);
126
    if (ret) {
127
      ret = -ENOENT;
128
      goto out_deregister;
129
130
131
    /* Make sure we can use these registers */
132
    if (request_mem_region(dev.res.start, resource_size(&dev.res),
133
                DRIVER_NAME) == NULL) {
134
       ret = -EBUSY;
135
       goto out_deregister;
136
138
    /* Arrange access to our registers */
139
    dev.virtbase = of_iomap(pdev->dev.of_node, 0);
140
      if (dev. virtbase == NULL) {
141
       ret = -ENOMEM;
       goto out_release_mem_region;
143
145
    /* Display a welcome message */
146
    for (i = 0; i < VGA\_LED\_DIGITS; i++)
         write_digit(i, welcome_message[i]);
148
    return 0;
149
out_release_mem_region:
```

```
release_mem_region(dev.res.start, resource_size(&dev.res));
153 out_deregister:
     misc_deregister(&vga_led_misc_device);
     return ret;
155
156
157
  /* Clean-up code: release resources */
  static int vga_led_remove(struct platform_device *pdev)
    iounmap(dev.virtbase);
161
    release_mem_region(dev.res.start, resource_size(&dev.res));
     misc_deregister(&vga_led_misc_device);
     return 0;
164
165
166
167 /* Which "compatible" string(s) to search for in the Device Tree */
168 #ifdef CONFIG OF
  static const struct of_device_id vga_led_of_match[] = {
    { .compatible = "altr, vga_led" },
170
    {},
172 };
MODULE_DEVICE_TABLE(of, vga_led_of_match);
  #endif
174
176 /* Information for registering ourselves as a "platform" driver */
  static struct platform_driver vga_led_driver = {
    .driver = {
178
       . name = DRIVER NAME,
179
       . owner = THIS_MODULE,
181
      . of_match_table = of_match_ptr(vga_led_of_match),
183
     .remove = __exit_p(vga_led_remove),
184
185
  /* Called when the module is loaded: set things up */
  static int __init vga_led_init(void)
188
    pr_info(DRIVER_NAME ": init\n");
189
     return platform_driver_probe(&vga_led_driver, vga_led_probe);
191 }
192
  /* Called when the module is unloaded: release resources */
  static void __exit vga_led_exit(void)
195
     platform_driver_unregister(&vga_led_driver);
196
     pr_info(DRIVER_NAME ": exit\n");
197
198
199
  module_init(vga_led_init);
  module_exit(vga_led_exit);
203 MODULE_LICENSE("GPL");
```

```
MODULE_AUTHOR("Stephen A. Edwards, Columbia University");
MODULE_DESCRIPTION("VGA 7-segment LED Emulator");
```

#### Software: Makefile

```
# Use gcc for compilation
_2 CC = gcc
4 # Include extra directories
5 INCLUDES =
7 # Compilation Options:
8 # -g for debugging -Wall enables all warnings
9 \text{ CFLAGS} = -g - \text{Wall } \$(\text{INCLUDES})
11 # Linking Oprions:
12 # −g for debugging info
LDFLAGS = -g
15 #List of Libraries which need to be linked in LDLIBS
16 LDLIBS =
18 # Specify Targets in a recursive way.
19 # We rely on make's implicit rules:
    $(CC) $(LDFLGAGS) <a11-dependent -.o-files > $(LDLIBS)
22 # Main is the main target that is compiled, it contains references to
     other
23 # functions.
24
25 # The philosophy is pretty simple main depends on everything so include
     all the
26 # *.o files in main, now other files might have internal dependencies
     like packet_gen
27 # depends on common hence it is compiled together.
29 . PHONY:
main: main.o packetgen.o common.o
main.o: main.c packetgen.h
packet_gen.o: packetgen.c packetgen.h common.h
36 common.o:common.c common.h
38 # Target based compilation
40 packetgen:
    $(CC) $(CFLAGS) packetgen.c packetgen.h common.c common.h
43 .PHONY: clean
  rm -f *.o a.out main core packet_gen common executable
46 .PHONY: all
```

all: clean packet\_gen