updated: Oct. 2011]
This page provides a brief overview of my research interests and
some of my current and past projects. I work with Columbia's
NETWORK TRACE ANONYMIZATION
Network trace anonymization research seeks to provide ways for organizations to distribute traffic captures for research purposes without worrying about privacy loss.
We are working on behavior-based traffic anonymization using a mixed-net approach (for offline data) on hosts with similar behavior. By modeling host-layer network behavior using time-series models then using these models to determine
groups with similar behavior we can mix traffic among similar hosts to provide privacy from identification for these groups while minimizing the overall statistical-information loss within the dataset.
In addition, many related techniques such as data compression and network behavior visualization emerges naturally due to our time-series-model and kernel-based approach.
ADVANCED SHELLCODE POLYMORPHISM
We are working on the design of a new
polymorphic shellcode engine which combines several
obfuscation techniques into one. Advanced encoding tactics
such as multi-layer ciphers with random reversible
operations, NOP insertion and others, are combined into
one platform to generate highly obfuscated shellcode which
yields no recognizable signature. This work is meant as a
proof of concept that the era of signature based sensors
is drawing to an end. A paper is being prepared for this
work. More information will be made available later.
WEB-LAYER CODE INJECTION ATTACKS
Web layer attacks are growing in
complexity and scale. It is estimated by security experts
that, in the year 2008 alone, there were hundreds of
thousands, even up to millions, of web-layer code-injection
attacks, with over 16,000 websites successfully compromised,
per day. These attacks are primarily composed of PHP
injections. Columbia University's computer science department
receives about one to two thousand of these attacks per day.
I designed a
new machine-learning-based sensor which learns statistical
models for legitimate input and can successfully detect the
majority of these attacks as anomalies. The sensor, which we
named Spectrogram, utilizes a mixture of Markov chains to
model the content, and structure, of proper web-layer
requests, and uses packet-reassembly to dynamically
reconstruct the web-layer content stream while sitting at the
network layer. This lets the sensor remain off-host but still
being able to monitor multiple remote hosts, simultaneously.
The picture you
see above is a geographic mapping of the sources of a sample
of attacks against Columbia, as detected by Spectrogram. This
work was published in the 2009 proceedings of the Network and
Distributed Systems Security (NDSS) Symposium in San Diego
Means and covariances pull towards
the ML solution.
In machine learning we have the familiar independently
distributed and independent identically distributed modeling
paradigms. We introduce an interpolation between these
extremes with a new method which we call independent
similarly distributed (i.s.d.) sampling. The key insight
is to view the i.i.d. case as a learning the i.d. case with
equality constraints on all of the different models. We relax
this constraint to be similar instead of equal using the
Bhattacharyya affinity as the similarity metric and derive the
i.s.d. posterior. We prove that this posterior is log-concave
and consistent for certain families of distributions and
derive the EM-like update rules for the Gaussian case. The
result is an interpolation of between i.d. and i.i.d
using only a single parameter.
Here is video (856 kb) of our algorithm
at work. Notice how as we adjust our single parameter, the Parzen windows converge towards the maximum likelihood
solution. Notice the intermediate stages where the covariances
form a band around data.
A paper based on this work will appear in NIPS this year.
UNSUPERVISED LEARNING FOR TIME-SERIES DATA
This project is representative of my general interests in
time-series models. I connected several machine learning
techniques into the time-series realm through the use of
kernels between HMMs.
We had a lot of success clustering time series data such as
the CMU motion capture data shown left. We took different
running sequences and walking sequences, as well as
synthesized sequences where we rotated the same sequences by
certain degrees, and successfully clustered and embedded them.
A paper based on this work was accepted into ECML, here is our
Clustering has recently enjoyed progress via spectral methods which group data using only pair-wise affinities and avoid parametric assumptions. While spectral clustering of vector inputs is straightforward, extensions to structured data or
time-series data remain less explored. We developed a
clustering method for time-series data that couples non-parametric
spectral clustering with parametric hidden Markov models (HMMs).
HMMs add some beneficial structural and parametric assumptions
such as Markov properties and hidden state variables which are
useful for clustering. Probabilistic pair-wise kernel estimates between
parametric models provides improved experimental results for
unsupervised clustering and visualization of real and
The full paper is available on my publications page.
Ever wonder how "hacking" works? The basic idea is that almost all personal computers are designed to hold data and instructions next to each other in memory and the operating systems does its best to keep track of which is which. Buffer overflow attacks, involves tricking the the target in to copying more data than it can handle, overflowing the data buffer and overwriting the return address of the function in execution, which, conveniently enough, is stored on the stack next to the static data buffer. This ends up overwriting the return address of that function. The attacker will overwrite it so that the return address points back into the data that he sent you -- that data contains his code, which is then executed.
The code he sends is called shellcode, and typically opens up a backdoor or downloads and executes another program. Shellcode is typically pretty hard to write so good shellcode gets reused a lot. This is good for the defense community because they can detect when these code snippets show up in the network. But recently, the attackers have introduced polymorphic shellcode, code that mutates the payload by encrypting it and decrypting it dynamically using a decoder, which also mutates. This makes it very hard to detect incoming attacks since each instance of shellcode will look different. The image on the left shows a typical decoder stripped from an polymorphic engine used in the wild. My research applied machine learning techniques to analyzing these polymorphic techniques.
A paper based on this work was accepted into ACM CCS and is available on my publications page.
An early variant of the algorithm detects an armored personnel carrier (APC).
Detecting and tracking a tank.
AUTOMATIC TARGET RECOGNITION
Vehicle detection and tracking.
designed a targeting system for infrared video, based on correlation filtering, spectral clustering and support vector machines. The dataset used was from Redstone Arsenal and consists of 50 Forward
Looking Infrared (FLIR) videos, containing sequences of
missiles closing-in on their targets.
Here is a
VIDEO (1.7 mb) of my targeting
algorithm at work, tracking an M60 tank. My algorithm draws
the crosshairs. This is the simpler
version, without spectral clustering. There is a false
positive in one of the frames (out of around 770) toward the
end. I had to reduce
the quality of the video a bit to keep the size down.
Here is another
(1.6 mb). This one is a close-in sequence where the target
This work is on hold right now, I wish I had more time to work on it -- it's a really interesting topic -- but unfortunately many other projects have precedence over this. If any government entity wants to fund a revival of this project, I'm accepting checks, heh.
||SOME EARLIER INTERESTING PROJECTS
Development of the "Ninja Card
I was a part of a team of five students who designed and
developed our own language for programming card games. The group was called
The N.I.N.J.A C.L.A.N, a double recursive name that stood for
"N.I.N.J.A C.L.A.N Is Not Just Another
Card Language Authoring N.I.N.J.A
C.L.A.N". Our language came to be called the Ninja Card
Language, NCL for short, it's also recursive but very long so
I'll skip the details. This was a project for a programming
languages course and we did everything from designing the
syntax of the language to building a complete working
The first runner program belongs to user A, the other two to
The following is a list of some
of the interesting projects that I did for my Operating Systems course where we did a lot of kernel programming with the Red Hat 2.4 Kernel. This class made me interested in the systems security field.
The heart of an operating system lies in the kernel, a small
program that makes all the important higher level decisions.
One of the most important elements of the kernel is the
scheduler which decides which tasks to run, at what time and
for how long. Normally the kernel uses task oriented
scheduling, giving the same amount of runtime for each round
robin scheduled task. So if user A has 2 tasks running and
user B has 1, then user A is taking up 67% of the CPU cycles.
For this project, I implemented a new "User Weighted Round
Robin" algorithm which schedules tasks according to the users
themselves. Now user A gets about 25% of the CPU for each of
his two tasks and user B gets about 50% for his one task.
Moreover, the root can set a weight for each user. If user A
has double the weight of user B, then A gets 67% of the CPU
once again and B gets only 33%.
Adding an Access Control List
In POSIX systems, we have the familiar -rwx-rwx-rwx-
permission bits system. This divides the permission realm into
three domains: owner, group and everyone else. This means that
you can't explicitly deny/give one specific user access to a
particular file while excluding everyone else. Access Control
Lists gives you this ability. My implementation associates a special User ID
indexed permission list with each file.
Kernel Level Firewall
I implemented a kernel level firewall for UDP and TCP protocols, allowing root to block packets that originate from banned IP addresses. The was a lot of socket programming involved and we wrote our own server and client applications.