Checking Distributed Systems

Introduction

This project basically tends to demenstrate the implementation of DMT systems in practical environment. Most DMT systems have a common assumption that the environment input remains unchanged for different runs. This assumption raises many doubts about practical distribution of DMT systems, because the outside environment is always an undecidalbe factor to the whole system.

How to make the DMT systems dealing with various external environment? Our observation is that most events will happen in an expected manner. For example, in Apache, the main thread is always receiving the first request from a client. Also, in aget (a multithreaded downloader), every network event is expected to obtain a specific piece of data from the server.

Besides the special events, it's possible to make the whole system determinitic with a DMT scheduler in most times. On the other hand, since the number of special, or say unexpected events, is supposed to be reasonably small, we tend to apply an input exploration algorithm to expose all possible patterns of unexpected events. For example, we can adopt a model checker to do this job.

Regular project meeting: 12pm - 1pm, every Wednesday in System Lab.

What's been done

We have a LD_PRELOAD based dynamic instrumentation library and a LLVM-based static instrumentation library. These libraries allow recording and controlling program execution.
We have a DMT scheduler framework and implementations of multiple scheduling algorithms, e.g. Round-Robin, First Come First Serve.
We are able to print execution trace in a log file and produce partial order view or total order view of the execution trace. So does trace signature.
We have setup multiple benchmarks, including aget, Apache, Berkeley DB replication. Setup of some other benchmarks (like MongoDB, COPS, and NYU's YFS) are still on going.

What's our current plan

Build a DMT scheduler algorithm that enforces most determinism even if the input enviroment changes. Some simple ideas could be applied in this algorithm:
1. Delay and batch input messages
2. Run network events as internal operations unless some unexpected behavior is observed. For example, unexpected delay or message content changes.
Application evaluation. Each application will be executed in a certain configuration with DMT for many times. We compare the difference between different runs to determine unexpected events and the portion of deterministic part. In this part, we want to demenstrate
1. The number of unexpected events is relatively small.
2. DMT system is applicable in most times.
More discussion on program behavior in different configurations.