Recent updates

Feb 29, 2012

Application's dependency on physical time turns out to be an obstacle on the road of DMT. Take a glimpse at the following code appeared in YFS.

void async_call(int socket, timespec finaldeadline)
{
  do {
    clock_timespec(now);
    nextdeadline = now + delta;
    if (pthread_cond_timedwait(sth, nextdeadline))
      process message queue and return;
  } while (nextdeadline < finaldeadline);
}

First, function pthread_cond_timedwait itself is non-deterministic. The physical time elapses all the way so one cannot guarantee a precise waiting time of delta. Secondly, it's hard to determine how long this timedwait is gonna take. The variant cost of timedwait leads to an undecidable number of loop iterations.

In addition, physical time also matters in the interaction between different nodes in the distributed system. Think about the lease protocol. The lease server grants a client a lease for one minute. And the client guarantees the period by doing pthread_cond_timedwait with a deadline after 50 seconds. However, if the physical time is converted into turn numbers, the client is not able to make the same physical time guarantee it used to make.

To solve this issue, we need sort of timing model of physical time in applications.

ST-local

Single threaded local programs doesn't care about physical time at all.

This assertion is easy to make: if you only have one thread, and you don't interact with any other threads, processes or remote party, why do you need physical time?

MT-local

In multithreaded local programs, physical time operations are based on the assumption of enough CPU speed.

Let's look at sleep and timedwait operation. A thread makes a sleep mostly for two purposes, waiting for some operations to finish, or delay the current thread to save CPU resources. On the other hand, the length of physical time used in timewait operation basically means some operations are expected to be finished within the period of time.

MT-distributed

Between two nodes in the distributed system, physical time ise used to measure network latency. In addition, the physical time eclapsing speed should be almost the same on different nodes.

Applicaion should not make any assumption of the absolute clock on another computer.

Epoch-based Timing Model

Based on discussions above, here we propose a realime model for DMT systems based on the concept of epoch: an epoch is a small period of time, when each core processes at most one task. Here we change the concept of epoch a little bit. At most one thread is allowed to run on one CPU in a single epoch, and in addition, the length of epoch is almost fixed.

With the definition of epoch, the good thing is that physical time can be measured by the number of epochs taken. On the other hand, making epochs with a fixed length slows down the execution. While this approach is very similar with the one that fixes the realtime length of each turn, it ticks the clock by the computation capabitliy, instead of the number of threads/turns. The latter leads to abnormal faster (measured by turns) or slower (measured by threads) speed of time elapsing.

Question 1. How timedwait and sleep work with epochs?

We can do almost precise convertion between turn number and realtime. Suppose each epoch's length is .1 ms, and a thread is goning to sleep for 100 ms. We can convert the time into 1,000 turns. When the thread wakes up from sleeping, the real clock elapses approximately 0.1 * 1000 = 100 ms.

Question 2. How lease protocol works with epochs?

The issue here is we want to make the time elapsing in the same speed on different machines. This property can be implemented based on the clock of local computer because the real clocks are running in almost the same speed. If we have the assumption, lease protocol will work like this. The lease server issues a lease with one minute's availability including 5 seconds’ roundtrip cost. When the client receives the lease, it launches a thread waiting for 40 seconds, and the thread then returns the lease after waken up. Based on the assumption, 40 seconds on the client takes approximately 40 seconds on the server, the lease availability is guaranteed.

Question 3. What if the system is suffering load imbalance?

Epoch + CPU approach is able to measure the computation scalability of a machine. When a machine is running too many threads/processes, it suffers an inherent slow down and epoch approach only makes an extra slow down.

Question 4. What if an operatoins takes longer than an epoch?

This should be an interesting problem. If you explored any potential performance optimization of the epoch approach, you would find out that the longer operations are certainly be an issue.

First, the scheduler should always approach the exact epoch point in execution: If an operation finishes before the next epoch point, we delay it. And if it finishes after the next epoch point, we launch the next operation immediately and hope it could balance the delay.

Question 5. How's the potential performance issue?

A simple solution is balance epoch length and operation cost. If we set the epoch length long, we have determinism but the application is slowed down. If we shorten the epoch length, the number of longer operations increases and will lead to non-determinism, but we gain a better performance. An easy way to obtain an appropriate conversion rate is sampling.

An alternative optimization is to make coarse-grain clock tick. That allows more chance to balance delay of late operation.

Implementation Plan

Hook real time functions. (done)
Add real time checker in schedule functions like getTurn
Merge previous code of realtime-turn conversion.
Add tests (almost done, but lack of inter-proc test)

Feb 12, 2012

Round-Robin scheduler is making a huge overhead without knowledge of blocking operations! In this case we have to adopt a monitor thread checking and skiping blocked threads. Because we have to wait for a reasonable time to make sure the operation is not gonna return in this round. This blocks other thread for a threshold lenght every round!

In YFS, the configuration has to wait for 20 seconds for each client's setup. This time slot used to be 1 second without DMT system!

Feb 9, 2012

Some notes on the handling of physical time.

It seems like some codes rely on physical time to accomplish operations, and a timed out could be treated as error or something.
Round-Robin scheduler could cause thread imblance in terms of physical time. Because some operations finsh quickly, while others take some time.
How to handle timedwait and sleep will an issue.

Feb 8, 2012

Discussion today focuses on evaluation and measurement. Basically we have two targets before next week's meeting.

The first is to make a latency analysis of each operation, and to use this data to figure out expected events and unexpected events step-by-step.
Secondly, after building a working DMT scheduler with expected/unexpected events, try to obtain the approximate number of essential schedules required for an application.

Here are the working items.

Enable recording latency in log files together with insid. (Check with YiHong)
Enable Round-Robin scheduler with Jingyang's YFS. (Huayang)
Try to speed up Round-Robin schedule from the ugly wait-skip manner.
Setup benchmark configurations.
1. AGET (done). Download www.cs.columbia.edu/~huayang/files/demeter.pdf with 4 threads.
2. YFS. 3 lock servers, 1 extent server and 2 yfs_clients. Execute a fixed sequence of operations on each clients in parallel.
Do latency analysis for AGET and YFS.
Figure out expected/unexpected events in the benchmarks.
Run YFS 1000 times to get the approx number of schedules.
Try some other configurations.

Feb 1, 2012

Discussion today focuses on how to demenstrate expected and unexpected in real programs. The basic intuition here is to look at the expected response time of a blocking function. Generally, an expected event should behave like following.

Latency Graph

This graph shows the relationship between response time and possibility.
x axis shows the response time, from 0 to INF.
y axis shows the possibility.
An expected event should have most responses appear in a relatvely short interval.
Events out of the expected interval are treated as unexpected event.

Jan 29, 2012

Design for a non-blocking runtime scheduler algorithm.

A simple timer to skip threads?

Dec 7, 2011

Currently the project focuses on measuring the effect of changing external inputs for DMT system. So the concern is mostly about timing issue.

What's done.

Instrumented external blocking functions. One schedule point is inserted before the function call, and another schedule point is inserted after the call. Instrumented functions include accept(), connect(), recv(), recvfrom(), recvmsg(), read(), select().
Being able to record a message trace. The basic trace is a sequence of tuple (tid, op, turn).
Being able to reproduce (prefix of) the input message timing. The current algorithm is to enforce schedule points after external call.
Being able to run a modified the message trace (prefix). There are two possible outputs for a modified message trace (prefix): either a successful run with corresponding new message trace, or an error message like invalid message trace (prefix).
Being able to merge message trace and synchronization operations for a global (partial order) view.

What's next.

A scheduling algorithm works well for arbitrary external input timing? Three system models here.
- Infinite looping even without new input coming (Bound it)
- Concurrent transaction model (Round-Robin)
- Sequential transaction model
Timing exploration algorithm.
- Earliest turn of (returning) scheduling point. (By default it's the next turn of entering schedule point)
- Search granularity. (By default it explores all possible the turn numbers of each operation)

Sequantial transaction based scheduling algorithm

void block()	//	schedule point before ex_call
{
    Remove the thread from scheduler
}
void wakeup()	//	schedule point after ex_call
{
    while (true)
    {
        lock(scheduler);
        if Round-Robin-Queue.empty()
        {
            add myself into the schedule
        } else
        {
            donothing
        }
        unlock(scheduler);
    }
    general schedule
}

Timing Exploration Algorithm

void DFS(Sequence s, int prefix_fixed)
{
  DFS(s, prefix_fixed + 1); // fix the next operation
  op = s.get_operation(prefix_fixed);  // the next variable operation
  new_op_rt = op.inc_rt();
  s_1 = s.prefix(prefix_fixed) + op;
  s_2 = run_DMT_with_prefix(s_1);  // notice op's real return time might not be new_op_rt, it's determined by the DMT scheduler.
                                  // In addition, the next operation might not be op, and other operations can be brought forward.
  if (s_2 is valid)
    DFS(s_2, prefix_fixed);
}