Home | CV | Publications | Research | Quotes | Misc

 [Last updated: Oct. 2011]

This page provides a brief overview of my research interests and highlights some of my current and past projects. I work with Columbia's Intrusion Detection Systems, Network Security and Machine Learning labs.


 

NETWORK TRACE ANONYMIZATION

Network trace anonymization research seeks to provide ways for organizations to distribute traffic captures for research purposes without worrying about privacy loss. We are working on behavior-based traffic anonymization using a mixed-net approach (for offline data) on hosts with similar behavior. By modeling host-layer network behavior using time-series models then using these models to determine groups with similar behavior we can mix traffic among similar hosts to provide privacy from identification for these groups while minimizing the overall statistical-information loss within the dataset. In addition, many related techniques such as data compression and network behavior visualization emerges naturally due to our time-series-model and kernel-based approach.
 


ADVANCED SHELLCODE POLYMORPHISM

We are working on the design of a new polymorphic shellcode engine which combines several obfuscation techniques into one. Advanced encoding tactics such as multi-layer ciphers with random reversible operations, NOP insertion and others, are combined into one platform to generate highly obfuscated shellcode which yields no recognizable signature. This work is meant as a proof of concept that the era of signature based sensors is drawing to an end. A paper is being prepared for this work. More information will be made available later.
 

 

SPECTROGRAM: DETECTING WEB-LAYER CODE INJECTION ATTACKS

Web layer attacks are growing in complexity and scale. It is estimated by security experts that, in the year 2008 alone, there were hundreds of thousands, even up to millions, of web-layer code-injection attacks, with over 16,000 websites successfully compromised, per day. These attacks are primarily composed of PHP file-inclusion, Javascript cross-site-scripting and SQL injections. Columbia University's computer science department receives about one to two thousand of these attacks per day.

I designed a new machine-learning-based sensor which learns statistical models for legitimate input and can successfully detect the majority of these attacks as anomalies. The sensor, which we named Spectrogram, utilizes a mixture of Markov chains to model the content, and structure, of proper web-layer requests, and uses packet-reassembly to dynamically reconstruct the web-layer content stream while sitting at the network layer. This lets the sensor remain off-host but still being able to monitor multiple remote hosts, simultaneously.


The picture you see above is a geographic mapping of the sources of a sample of attacks against Columbia, as detected by Spectrogram. This work was published in the 2009 proceedings of the Network and Distributed Systems Security (NDSS) Symposium in San Diego California.

 



Means and covariances pull towards the ML solution.

SEMI-PARAMETRIC LEARNING

In machine learning we have the familiar independently distributed and independent identically distributed modeling paradigms. We introduce an interpolation between these extremes with a new method which we call independent similarly distributed (i.s.d.) sampling. The key insight is to view the i.i.d. case as a learning the i.d. case with equality constraints on all of the different models. We relax this constraint to be similar instead of equal using the Bhattacharyya affinity as the similarity metric and derive the i.s.d. posterior. We prove that this posterior is log-concave and consistent for certain families of distributions and derive the EM-like update rules for the Gaussian case. The result is an  interpolation of between i.d. and i.i.d using only a single parameter.

Here is video (856 kb) of our algorithm at work. Notice how as we adjust our single parameter, the Parzen windows converge towards the maximum likelihood solution. Notice the intermediate stages where the covariances form a band around data.
 
A paper based on this work will appear in NIPS this year.
 


UNSUPERVISED LEARNING FOR TIME-SERIES DATA

This project is representative of my general interests in time-series models. I connected several machine learning techniques into the time-series realm through the use of kernels between HMMs.

We had a lot of success clustering time series data such as the CMU motion capture data shown left. We took different running sequences and walking sequences, as well as synthesized sequences where we rotated the same sequences by certain degrees, and successfully clustered and embedded them.

A paper based on this work was accepted into ECML, here is our abstract:

Clustering has recently enjoyed progress via spectral methods which group data using only pair-wise affinities and avoid parametric assumptions. While spectral clustering of vector inputs is straightforward, extensions to structured data or time-series data remain less explored. We developed a clustering method for time-series data that couples non-parametric spectral clustering with parametric hidden Markov models (HMMs). HMMs add some beneficial structural and parametric assumptions such as Markov properties and hidden state variables which are useful for clustering. Probabilistic pair-wise kernel estimates between parametric models provides improved experimental results for unsupervised clustering and visualization of real and synthetic datasets.

The full paper is available on my publications page.
 


SHELLCODE POLYMORPHISM

Ever wonder how "hacking" works? The basic idea is that almost all personal computers are designed to hold data and instructions next to each other in memory and the operating systems does its best to keep track of which is which. Buffer overflow attacks, involves tricking the the target in to copying more data than it can handle, overflowing the data buffer and overwriting the return address of the function in execution, which, conveniently enough, is stored on the stack next to the static data buffer. This ends up overwriting the return address of that function. The attacker will overwrite it so that the return address points back into the data that he sent you -- that data contains his code, which is then executed.

The code he sends is called shellcode, and typically opens up a backdoor or downloads and executes another program. Shellcode is  typically pretty hard to write so good shellcode gets reused a lot. This is good for the defense community because they can detect when these code snippets show up in the network. But recently, the attackers have introduced polymorphic shellcode, code that mutates the payload by encrypting it and decrypting it dynamically using a decoder, which also mutates. This makes it very hard to detect incoming attacks since each instance of shellcode will look different. The image on the left shows a typical decoder stripped from an polymorphic engine used in the wild. My research applied machine learning techniques to analyzing these polymorphic techniques.

A paper based on this work was accepted into ACM CCS and is available on my publications page.
 



An early variant of the algorithm detects an armored personnel carrier (APC).


Detecting and tracking a tank.
 

AUTOMATIC TARGET RECOGNITION

Vehicle detection and tracking.

I designed a targeting system for infrared video, based on correlation filtering, spectral clustering and support vector machines. The dataset used was from Redstone Arsenal and consists of 50 Forward Looking Infrared (FLIR) videos, containing sequences of missiles closing-in on their targets.

Here is a
VIDEO (1.7 mb) of my targeting algorithm at work, tracking an M60 tank. My algorithm draws the crosshairs. This is the simpler version, without spectral clustering. There is a false positive in one of the frames (out of around 770) toward the end. I had to reduce the quality of the video a bit to keep the size down.

Here is another
VIDEO (1.6 mb). This one is a close-in sequence where the target isn't moving.

This work is on hold right now, I wish I had more time to work on it -- it's a really interesting topic -- but unfortunately many other projects have precedence over this. If any government entity wants to fund a revival of this project, I'm accepting checks, heh.


  SOME EARLIER INTERESTING PROJECTS
 
 

PROGRAMMING LANGUAGES

Development of the "Ninja Card Language" (NCL)
I was a part of a team of five students who designed and developed our own language for programming card games. The group was called The N.I.N.J.A C.L.A.N, a double recursive name that stood for "N.I.N.J.A C.L.A.N Is Not Just Another Card Language Authoring N.I.N.J.A C.L.A.N". Our language came to be called the Ninja Card Language, NCL for short, it's also recursive but very long so I'll skip the details. This was a project for a programming languages course and we did everything from designing the syntax of the language to building a complete working compiler.
 

 

 

 


The first runner program belongs to user A, the other two to user B.

OPERATING SYSTEMS

The following is a list of some of the interesting projects that I did for my Operating Systems course where we did a lot of kernel programming with the Red Hat 2.4 Kernel. This class made me interested in the systems security field.

Implementing a New Scheduler
The heart of an operating system lies in the kernel, a small program that makes all the important higher level decisions. One of the most important elements of the kernel is the scheduler which decides which tasks to run, at what time and for how long. Normally the kernel uses task oriented scheduling, giving the same amount of runtime for each round robin scheduled task. So if user A has 2 tasks running and user B has 1, then user A is taking up 67% of the CPU cycles. For this project, I implemented a new "User Weighted Round Robin" algorithm which schedules tasks according to the users themselves. Now user A gets about 25% of the CPU for each of his two tasks and user B gets about 50% for his one task. Moreover, the root can set a weight for each user. If user A has double the weight of user B, then A gets 67% of the CPU once again and B gets only 33%.

Adding an Access Control List (ACL)
In POSIX systems, we have the familiar -rwx-rwx-rwx- permission bits system. This divides the permission realm into three domains: owner, group and everyone else. This means that you can't explicitly deny/give one specific user access to a particular file while excluding everyone else. Access Control Lists gives you this ability. My implementation associates a special User ID indexed permission list with each file.

Kernel Level Firewall
I implemented a kernel level firewall for UDP and TCP protocols, allowing root to block packets that originate from banned IP addresses. The was a lot of socket programming involved and we wrote our own server and client applications.