Spring 2006

COMPUTER SCIENCE COLLOQUIA

High Performance Object-Based Storage Systems
David Nagle
Panasas
March 1, 2006
ABSTRACT: Applications’ demands on storage are insatiable. From data mining and biotechnology to oil-and-gas exploration, the speed of storage ultimately determines overall application performance. With many applications running for weeks or months, even a modest improvement in storage performance can save many days of total run time. To keep up, storage has begun adapting clustering technologies and the newly standardized object-based storage device (OSD) interface to deliver tens of GBytes/sec of storage bandwidth today, and TB/sec performance in the near future.

This talk explores object-based storage and the techniques used to deliver scalability across large storage clusters. The talk begins by presenting the core design of our Object-based Storage system. I then examine how semantic information in the OSD interface is used to efficiently manage OSD resources across the storage cluster. Finally, I discuss how storage?s bursty and synchronized network traffic patterns can overwhelm TCP/IP and present a novel storage layout mechanism that ensures that TCP/IP and commodity networks can support 1000?s of clients and OSD within a single cluster.

BIOGRAPHY: David Nagle is currently the Advanced Development Architect at Panasas, where he works on high-performance cluster storage systems. David started his storage career as a Professor and Director of Carnegie Mellon University (CMU) Parallel Data Lab. His CMU Network-attached Secure Disk Project (NASD) project has led to two startups (Panasas and Lustre), object-based storage projects within most major storage vendors (IBM, Seagate, Sun), and the ANSI T10 SCSI OSD standard. His other research has focused on Micro-Electrical-Mechanical Storage Systems (MEMS) and migrating storage functionality into network devices (Active Storage Networks).
Data, technologies and populations for genome wide association studies
Itsik Pe’er
The Program for Medical and Population Genetics Broad Institute of MIT and Harvard and Center for Human Genetic Research, Massachusetts General Hospital
March 22, 2006
ABSTRACT: The pervasive effect of genetic variation on medically important phenotypes provides a means for dissecting their underlying mechanisms by identifying variants that are associated with traits of interest. Current trends in human genetics now facilitate, for the first time, pursuing this potential by execution of large scale studies that scan the entire genome for potentially associated variants. Specifically, the talk will present 1. The International HapMap Project, a data resource we participated in developing to enable genomewide association studies, and what our analyses of these data tell us about human variation. 2. The current generation of SNP array technology, and how computation and statistics improvements allow it to cover the majority of common human variants. 3. The tale of a pilot association scan in an isolated population in Micronesia, where we show such scans are more promising than elsewhere, though we expose practical complexities of real data and the computational challenges they present.

Some of the research presented was performed as part of the International HapMap Analysis Team, or in collaborations with Affymetrix Inc. and the Friedman lab at Rockefeller University.

Using context to assist in personal file retrieval
Craig Soules
Carnegie Mellon University
March 27, 2006
ABSTRACT: The drastic increases in personal storage space over the last ten years have virtually eliminated the need for users to limit what data they store. Instead of having to decide what data to keep, users are now faced with the challenge of locating the data they want within a sea of information. And, despite recent interest in desktop file search, it is still often easier for a user to locate data on the web than in their personal file space.

This talk describes how context information, identified automatically by the file system, can be used to improve existing desktop search tools. By identifying and summarizing context information for each file, the system can extend traditional content-only search results with closely related contextual results, allowing the system to locate even files with non-indexable content. Furthermore, using the context information to re-rank the results improves the accuracy of searches. The end result is that context-enhanced search reduces both false-negatives and false-positives when compared to content-only search alone.

FACULTY CANDIDATE SEMINARS

Byzantine Fault-Tolerance and Beyond
Jean-Philippe Martin
University of Texas, Austin
March 6, 2006
ABSTRACT: Computer systems should be trustworthy in the sense that they should reliably answer requests from legitimate users and protect confidential information from unauthorized users. Building this kind of systems is challenging, even more so in the increasingly common case where control is split between multiple administrative domains.

Byzantine fault tolerance techniques can elegantly provide reliability without overly increasing the complexity of the system and have recently earned the attention of the system community. In the first part this talk I discuss some of the contributions I have made toward practical Byzantine fault tolerance—in particular, how to reduce the cost of replication and how to reconcile replication with confidentiality. In the second part of the talk I argue that Byzantine fault-tolerance alone is not sufficient to deal with cooperative services under multiple administrative domain, where nodes may deviate from their specification not just because they are broken or compromised, but also because they are selfish. To address this challenge, I propose BAR, a new failure model that combines concepts from Byzantine fault-tolerance and Game Theory. I will describe BAR, present an architecture for building BAR services, and briefly discuss BAR-B, a BAR-tolerant cooperative backup system.

BIOGRAPHY: Jean-Philippe Martin is a Ph.D. candidate at the Department of Computer Sciences at The University of Texas at Austin. He has a M.S. and received his B.S in Computer Sciences from the Swiss Federal Institute of Technology (EPFL). His main research interests are trustworthy systems, Byzantine fault-tolerance, and cooperative systems. His papers on cooperative services (SOSP’05) and fast Byzantine consensus (DSN’05) were recognized as some of the best papers at these conferences and were selected for publication in a journal.
Six ways to touch an elephant — modeling different aspects of the biomolecular system
Chen-Hsiang Yeang
Center for Biomolecular Science and Engineering, University of California, Santa Cruz
March 20, 2006

Watch the video

Please note that the first 20 minutes of this video are not available.

ABSTRACT: Life is a complex phenomenon comprising many inter-related subsystems such as gene regulation, signal transduction, and metabolism. Recent progress in high throughput technologies and computational biology allows us to probe different aspects of this complex system. A comprehensive understanding of a biological system relies not only on the detailed knowledge about individual molecules, genes and pathways but also on the grasp of system-wide properties and the interplay between the subsystems. To develop methods of integrating information from different sources, modeling complex biological systems, and processing an astronomical amount of data, collaboration across disciplines of biology, computer science, statistics and other sciences is necessary.

In this talk I will present six inter-related works modeling different aspects of the complex biological system. In the first part, I will describe a modeling framework integrating physical interaction, gene expression, and metabolic flux data to reconstruct the gene regulatory network and the coupling between gene regulation and metabolism. We annotated the networks of physical interactions and metabolic reactions with various functional attributes such as the function and direction of each physical interaction. By using the data of physical interactions, knockout gene expression, and metabolic flux data to constrain the network, we built a probabilistic graphical model — a factor graph — over the attributes in the network. By applying the max-product inference algorithm, we approximately identified the attribute values which best fit the data. Because the current data do not sufficiently constrain the model, there are multiple model configurations which fit the data equally well. We used an information theoretic score to rank new knockout experiments to disambiguate these models, and perform several top-ranking experiments. The new data validated two putative pathways suggested by the inferred models and disambiguated the functions of the transcription factors along these pathways.

BIOGRAPHY: Chen-Hsiang Yeang is a postdoctoral researcher at the Center for Biomolecular Science and Engineering at UC Santa Cruz (David Haussler’s group). His main research interest is to build quantitative models of living systems. Specifically, his current research focuses on integrating different types of data to uncover gene regulation, signal transduction and metabolism and modeling the dependent evolution between multiple components in a molecular system. He received B.S. in Electrical Engineering from National Taiwan University, M.S. and Ph.D. (advisor Tommi Jaakkola) in Electrical Engineering and Computer Science from Massachusetts Institute of Technology. He was a visiting postdoc at Max-Planck Institute for molecular genetics (Martin Vingron’s group) after graduating from MIT in 2004. He has been at UC Santa Cruz since 2005. Besides research he writes stage and screen plays as a hobby.