Spring 2009

January 21

Rob Thomas, Cymru

January 28 - Spectrogram: A Mixture-of-Markov-Chains Model for Anomaly Detection in Web Traffic

Yingbo Song, Columbia PhD Student

We present Spectrogram, a machine learning based statistical anomaly detection (AD) sensor for defense against web-layer code-injection attacks. These attacks include PHP file inclusion, SQL-injection and cross-site scripting; memory-layer exploits such as buffer overflows are addressed as well. Statistical AD sensors offer the advantage of being driven by the data that is being protected and not by malcode samples captured in the wild. While models using higher order statistics can often improve accuracy, trade-offs with false-positive rates and model efficiency remain a limiting usability factor. This paper presents a new model and sensor framework that offers a favorable balance under this constraint and demonstrates improvement over some existing approaches. Spectrogram is a network situated sensor that dynamically assembles packets to reconstruct content flows and learns to recognize legitimate web-layer script input. We describe an efficient model for this task in the form of a mixture of Markov chains and derive the corresponding training algorithm. Our evaluations show significant detection results on an array of real world web layer attacks, comparing favorably against other AD approaches.

February 4 - Autonomous Security for Autonomous Systems: Pretty Good BGP

Josh Karlin, University of New Mexico

The Internet's interdomain routing protocol, BGP, supports a complex network of Autonomous Systems which is vulnerable to a number of potentially crippling attacks. Several promising cryptography-based systems have been proposed, but they have not been widely deployed. Rather than force centralized control on a distributed network, the talk will describe a distributed anomaly detection and response system that is both effective and incrementally deployable. The talk will also describe preliminary results of my current research, which explores the potential impact of political policies, such as censorship and wiretaps, on interdomain routing.

February 11 - Secure Anonymous Database Search

Mariana Raykova and Binh Vo, Columbia PhD Students

There exist many large collections of private data that must be protected on behalf of the entities that hold them or the clients they serve. However, there are also often many legitimate reasons to share portions of this data with other parties. For example, two intelligence agencies might be willing to cooperate by sharing documents about a specific case, yet be unwilling to allow each other open access to their data.

We address the problem of allowing such entities to search each other's data in a way that conceals the identities of both participants from each other, as well as the contents of the queries and of documents unrelated to them. In order to achieve this in an efficient and practical manner, we make use of Bloom filters and of a novel encryption primitive: reroutable encryption.

February 18 - Advances in Privacy-Preserving Machine Learning

Claire Monteleoni, CCLS

This talk introduces the problem of privacy-preserving machine learning, and some recent results. The goal of privacy-preserving machine learning is to provide machine learning algorithms that adhere to strong privacy protocols, yet are useful in practice. As increasing amounts of sensitive data are being digitally stored and aggregated, maintaining the privacy of individuals is critical. However, learning cumulative patterns, such as disease risks from medical records, could benefit society. Our work on privacy-preserving machine learning seeks to facilitate a compromise between these two opposing goals, by providing general techniques, for the design of algorithms to learn from private databases, that manage the inherent trade-off between privacy and learnability.

I will present a new method for designing privacy-preserving machine learning algorithms. Researchers in the cryptography and information security community [Dwork et al. '06] had shown that if any function learned from a database is randomly perturbed in a certain way, the output respects a very strong privacy definition. The amount of perturbation depends on the function however, and could render the output ineffectual for machine learning purposes. We introduce a new paradigm: perturb the optimization problem, instead of its solution, for functions learned via optimization. It turns out that, for a canonical machine learning algorithm, regularized logistic regression, our new method yields a significantly stronger learning performance guarantee, and demonstrates improved empirical performance over the previous approach, while adhering to the same privacy definition. Our techniques also apply to a broad class of convex loss functions.

This talk is based on joint work with Kamalika Chaudhuri (UC San Diego).

March 4 - Smashing the Stack with Hydra: The Many Heads of Advanced Shellcode Polymorphism

Pratap Prabhu and Yingbo Song, Columbia Students

Recent work on the analysis of polymorphic shellcode engines suggests that modern obfuscation methods would soon eliminate the usefulness of signature-based network intrusion detection methods, and supports growing views that the new generation of shellcode cannot be efficiently, and accurately, represented by the string signatures which current IDS and AV scanner rely upon. We extend this prior work with additional empirical evidence in the form of a new polymorphic engine which we designed and implemented.

The Hydra polymorphic shellcode engine distinguishes itself by integrating an array of obfuscation techniques into one package while offering improvements upon existing strategies. Hydra was developed to present an updated view of the frontier of modern polymorphic shellcode, as well as to provide an effective tool for evaluation of IDS systems, cyber test ranges and other related security technologies. The talk highlights the key components of shellcode obfuscation such as multi-layer ciphering and NOP injection, and describes our extensions thereof.

March 11 - An Anonymous Product Delivery System

Elli Androulaki, Columbia PhD Student

Delivery of products bought online can violate customers' privacy, although not in a straightforward way. In particular, delivery companies that have contracted with a website know the company selling the product, as well as the name and address of the customer. To make matters worse, if the same delivery company has contracted with many websites, aggregated information per address may be used to profile customers' transaction activities. In this paper, we present a fair delivery service system with guaranteed customer anonymity and merchant-customer unlinkability, with reasonable assumptions about the threat model.

March 25 - Compression, Correction, Confidentiality, and Comprehension: A Modern Look at Commercial Telegraph Codes

Steve Bellovin, Columbia Professor

Telegraph codes are a more-or-less forgotten part of history. However, many of their features are present, albeit in different form, today. It is worth looking back to see how things evolved.

April 1 - JI's Holistic Performance-Measurement Agency

John Ioannidis

We were called to measure the performance of (and identify performance bottlenecks) in a large distributed system of a large firm. The system consisted of several groups of applications, written and maintained by different groups, communicating over the network. It should come as no big surprise that there were problems with the end-to-end performance of the system even though the individual application groups were claiming that their part worked.

In this talk, we'll discuss what we had to do to take *meaningful* measurements. It was a complex problem because what we were trying to measure was a production system (we could not afford to cause any problems), we had no access to the application code, our only source of network traffic was SPAN ports on switches, the system had to process a huge volume of information (so it had to do data-reduction early), we had to hook up to the GPS infrastructure to get accurate timings, and, worst of all, we had to navigate corporate politics. Our solution involved a few thousand lines of C and perl code, coupled with a healthy disregard for protocol layering.

This is joint work with Perry Metzger

April 15 - Adaptive Anomaly Detection via Self-Calibration and Dynamic Updating

Gabriela F. Cretu-Ciocarlie

The deployment and use of Anomaly Detection (AD) sensors often requires the intervention of a human expert to manually calibrate and optimize their performance. Based on the site and the type of traffic it receives, the operators might have to provide recent and sanitized training data sets, the characteristics of expected traffic (i.e. outlier ratio), and exceptions or even expected future modifications in system's behavior. In this paper, we study the potential performance issues that stem from fully automating the AD sensors' day-to-day maintenance and calibration. Our goal is to remove the dependence on human operator using an unlabeled, and thus potentially dirty, sample of incoming traffic. To that end, we propose to enhance the training phase of AD sensors with a self-calibration phase, leading to the automatic determination of the optimal AD parameters. We show how this novel calibration phase can be employed in conjunction with previously proposed methods for training data sanitization resulting in a fully automated AD maintenance cycle. Our approach is completely agnostic to the underlying AD sensor algorithm. Furthermore, the self-calibration can be applied in an online fashion to ensure that the resulting AD models reflect changes in the system's behavior which would otherwise render the sensor's internal state inconsistent. We verify the validity of our approach through a series of experiments where we compare the manually obtained optimal parameters with the ones computed from the self-calibration phase. Using traffic from two different sources and for the worst case, the fully automated calibration shows a 7.08% reduction in detection rate and a 0.06% increase in false positives when compared to the optimal selection of parameters. Finally, our adaptive models outperform the statically generated ones retaining the gains in performance from the sanitization process over time.

This is joint work with Angelos Stavrou, Michael E. Locasto, and Salvatore J. Stolfo.

April 22 - Laissez-faire file sharing

Maritza Johnson

When file systems' access control mechanisms render users who need to share files unable to do so, these users will inevitably find other ways to share. Put another way, file systems with access control policies that are restricted by a central administrator suffer from failure modes similar to centrally-planned economies: individuals either leave the systems or learn to circumvent them as matters of necessity.

To better address the file sharing needs of information workers, we formalize requirements for laissez-faire sharing, which parallel the requirements of free market economies. We describe our development of Slackcess Control, an email extension layered onto Windows shared folders that attempts to reduce the friction and increase the transparency of file sharing. We describe the problems in developing Slackcess Control that resulted from a fundamental conflict between the Windows shared folders architecture and our laissez-faire requirements.

We argue that by providing file sharing systems that meet laissez-faire requirements, organizations may obviate the need for their information workers to circumvent these systems. Also, policies and systems that do not meet the laissez-faire requirements will be circumvented, at considerable risk to the relying organization. We conclude that laissez-faire access control increases not just individual productivity but security as well.

This is joint work with Stuart Schechter, Rob Reeder, and Steve Bellovin.

April 29 - Digital Oversight Committee: Who will monitor the monitors?

Isaac Greenbaum

Many different monitoring systems have been created to identify and prevent a host of different attacks. However, should a stealthy attacker turn off or disable these monitoring systems without detection, he would be able perpetrate the very attacks that these systems were designed to stop. In this presentation, we present a novel technique to "monitor the monitors" in such a way that (a) unauthorized shutdowns of critical monitors are detected with high probability, (b) authorized shutdowns raise no alarm, and (c) the proper shutdown sequence for authorized shutdowns is not stored anywhere in memory. We shall demonstrate a proof-of-concept implementation of the proposed self-monitoring monitoring system that operates on a single host, as well as a distributed version allowing remote collaborative hosts to monitor each other. The techniques were inspired by the safety technology devised to prevent unauthorized discharge (turning on) of nuclear weapons.

This is joint work with Sal Stolfo and Simha Sethumadhavan