Columbia Video Network Tool (CoVit)

Subhrendu Sarkar
Columbia University
New York, NY 10027
USA
ss3295@columbia.edu

Abstract

CoVit is a robust video network tool which can send and receive video data over internet in real-time. The software captures video from a webcam, encodes the video frames and sends the data over the network to a receiver using RTP. The receiver receives the video data, decodes it and displays it.The operations of the tool can be remotely controlled from another application with the help of XML-RPC methods.

Introduction

There are many video communication tools available. However very few of them provide us the flexibility to choose amongst different coding formats. Also this software has been built to be reusable as a component, thus allowing other applications to seamlessly plug-in this tool and demonstrate video communication in those applications.

The software broadly consists of six components:
- video capture from a camera input.
- encoding into compressed video formats. (currently supported formats: MPEG-1, MPEG-2, MPEG-4 and H.263)
- use existing RTP library for sending and receiving video frames.
- decoding from compressed video formats.
- rendering in a video window.
- remote control, so that CoVit can be used as a component.

This project primarily deals with video but its design and interfaces can very easily be extended for audio media as well.
Protocols like SIP, RTSP, H.323 are not included within the scope of this project. The software does not include any call setup, session initiation or control protocols. CoVit is not concerned about signaling and assumes that to be handled by the user (either a human or an application) of this software tool.

Design Overview

Communication using RTP provides a common media transport layer without any dependencies on the signaling protocol or the application. The sender is responsible for capturing of the video stream from a camera device (webcam in our case), it can optionally compress the video streams into one of the many acceptable compressed video formats, form RTP packets and send them to the receiver. The receiver is listening for video streams on a certain port. It receives RTP packets from the network at its designated port, processes the RTP packets, extracts the timing information associated with the video data as well as the actual video data itself. It then decompresses the media stream (video data in our case) and then displays the video on a rendering window. The receiver is also in charge of deciding whether to render certain data on the window depending on the timing information of the associated packet. Fig-1 and Fig-2 broadly lays out the flow of the media from the sender to the receiver over RTP.

Figure-1 The Media Sender

Figure 1


Figure 2 The Media Receiver

Figure 2

Architecture

The software is broadly divided into three layers :
1. The XML-RPC module:
This is the topmost layer and it is primarily concerned with the remote controlling of the tool by another user. It also acts as an interface between the remote user and the underlying communication and capture layers of the software. This layer exposes well formed APIs for both the remote user to access as well as for interacting with the underlying RTP and capture modules. (described below)

2. The RTP module:
This middle layer is responsible for generating RTP packets (sender) or receiving RTP packets from the network. Apart from the handling of the RTP protocol details, this layer is also responsible for encoding or decoding of the video data. Also it acts as a channel through which the XML-RPC layer can communicate with lower Capture layer if required for instance for configuring the camera. A remote user can effectively configure the camera capture formats with the help of the RTP layer APIs.

3. The capture module:
This lower layer deals with the capturing of video streams from a camera device. This module provides a generic interface to initialize and interact with video device drivers. The capture module keeps on capturing video data from the device and provides the data to the middle layer.

Figure 3 shows the interdepencies between the different layers of the software. The Figure shows the model for a single video stream capture, sender and receiver. CoVit supports multiple video sending streams and multiple video receiving streams. There will always be only one XML-RPC module whereas there will be multiple RTP modules, multiple capture and rendering modules for multiple streams.

Figure 3: The layered Architecture (for a single video stream capture, sender and receiver)


Figure 3

The software is built on a threading model.
The threads can be categorized as follows:
1. The XML-RPC Server
2. The RTP Sender
3. The RTP Receiver
4. The Capture
When the software tool is initiated, only the XML-RPC Server thread is active and running. All the other three threads are created and run as and when requested by the user. The XML-RPC server thread continuously listens on a certain port (the port number is specified in the command line arguments when the tool is run.
E.g: ./covit.o 8082 (which means the XML-RPC server is listening on port 8082)
Depending on the commands from the remote user, the remaining three threads are created and starts running.
E.g: When the remote user wants the tool to send RTP packets with video then both the RTP sender thread and the capture thread are created and run.

The XML-RPC Server thread is always on and listening for any incoming commands from the remote user. The RTP sender thread is created and starts running when the remote user gives a command to the XML-RPC Server to send data to a receiver. Such a command also starts the capture thread. The software maintains a circular list of media Buffers which it uses for processing of continuous video media stream. We have kept 30 buffers each of which holds one encoded video frame in the circular list. The capture thread keeps on grabbing frames from the video input device and pushes the grabbed frame into one of the available Media buffers available. The RTP Sender thread keeps processing the video frames by pulling the video frames out from the media Buffers being filled by the capture Thread. The media Buffers are a shared resource between the capture thread and the RTP sender thread. Thus the media buffer accesses are protected by appropriate mutexes. The capture thread which operates with the input device may also have an internal list of buffers if the video driver implements it in that fashion. The implementation and design of the RTP sender is however completely independent of the design and model of the underlying video device driver. The operation of the RTP receiver thread is simpler in this context. It listens on a certain port number for RTP packets, decodes the video packets and displays the video on a rendering window. The RTP receiver receives video frames from the network decodes them and displays them on the rendering window. The receiver is also responsible for deciding whether to display the video frame depending on the timing information associated with the video frame. CoVit supports multiple camera capture devices, multiple RTP senders and multiple RTP receivers. Each capture thread and RTP sender share an unique video send stream identifier. Similarly each RTP receiver is identifiable with an unique video receive stream identifier. The user is responsible for creating video streams (send stream and receive stream) by specifying unique video stream identifiers and subsequently stopping those video stream. CoVit frees all resources associated with a video stream identifier and removes the video stream identifier when the XML RPC receives a commnad to stop that video stream.
Figure 4 shows the Threading Model.

Figure 4: The Threading Model


Figure 4

Program Documentation