Steven Bellovin is this year’s recipient of the ESORICS Outstanding Research Award

bellovin-200x200
Steven M. Bellovin

For the lasting impact of contributions in furthering the theory or development of secure systems, Steven M. Bellovin has been named the 2016 recipient of the Outstanding Research Award, which is bestowed annually by the European Symposium on Research in Computer Security (ESORICS).

The Percy K. and Vida L.W. Hudson Professor of Computer Science, Bellovin is a recognized security and network expert whose research, opinions and commentary regularly appear in academic and technical publications as well as prominent media outlets such as Wired and CNN.

Before coming to Columbia in 2005, Bellovin worked for more than 20 years at Bell Labs (and later AT&T Labs Research) where he was one of the first to recognize the importance of firewalls to network security and to explore protocol failures, discuss routing security, and utilize encrypted key exchange protocols. With William Cheswick, he wrote in 1994 the first book on firewalls (the best-selling Firewalls and Internet Security: Repelling the Wily Hacker). His most recent book is Thinking Security: Stopping Next Year’s Hackers. He holds numerous patents on cryptographic and network protocols.

Bellovin has long been active in public policy questions involving security. Earlier this year he was appointed the first Technology Scholar to the Privacy and Civil Liberties Oversight Board and previously served as Chief Technologist of the Federal Trade Commission (2012-13). He is currently a member of the Computer Science and Telecommunications Board of the National Academies and has served on numerous National Academies study committees. In the past, he has been a member of the Science and Technology Advisory Committee of the Department of Homeland Security and the Technical Guidelines Development Committee of the Election Assistance Commission (2009-2012). Last month, he joined Verified Voting’s board of advisors. He has previously held leadership positions in the Internet Engineering Task Force, including Security Area co-director (2002-2004); from 1996-2002, he was a member of the Internet Architecture Board.

The ESORICS award is his second in as many months. In August, Bellovin was awarded the Electronic Frontier Foundation’s Pioneer Award. Among many previous other awards, he was elected to the Cybersecurity Hall of Fame (2014), earned an NIST/NSA National Computer Systems Security Award (2007) and the 1995 Usenix Lifetime Achievement (“The Flame”) Award, which he received, along with Tom Truscott and Jim Ellis, for his part in creating USENET.

Computing in Context, a computer science course for liberal arts majors, expands with new SIPA track

Computing in Context has added a new track for the School of International and Public Affairs (SIPA) to teach computational concepts and coding in the context of solving policy problems. SIPA students will be taught by both a computer science professor, who lectures on basic computer and programming skills while teaching students to think like computer scientists, and by a SIPA professor who shows how those skills can augment traditional policy analysis. Projects and assignments will be geared for the policy arena to give students a command of technical solutions for problems they are likely to encounter in their classes and future work.

SIPA’s is the first new track to be added since Computing in Context debuted in spring 2015 with tracks in digital humanities, social science, and economics and finance. Aimed at liberal arts majors who might not otherwise take computer science, Computing in Context is the first of its kind to provide a contextualized introduction that combines algorithmic thinking and programming with projects and assignments from different liberal arts disciplines.

falco-150x150
Gregory Falco

How much should students in the School of International and Public Affairs (SIPA) know about computer science?

In a digital world when information is being collected at unprecedented rates and as government decision-making becomes more data driven, computer science is fast becoming fundamental to policy analysis. Computational methods offer an efficient way to navigate and assess a variety of systems and their data, and make it possible to comb even massive data sets for subtle patterns that might otherwise go undiscovered. A relatively small amount of code can replace tedious, time-consuming manual efforts to gather data and refine it for analysis.

As machine learning and text mining turn texts into data analyzable by a computer, computational methods once reserved for quantitative data can now be applied to almost any type of document—emails, tweets, public records, transcripts of hearings—or to a corpus of tens or hundreds of thousands of documents. These new methods for computationally analyzing texts and documents make computer science relevant to humanities and social science disciplines that traditionally have not been studied computationally. Social science majors may analyze vast numbers of social media posts, English majors may automate stylistic analyses of literary works, finance students may mine data for new economic trends.

Liberal arts students have been increasingly skipping the cursory computer science class intended for non-majors (1001) and enrolling in computer science classes alongside computer science majors. Adam Cannon who has been teaching introductory computer science for 15 years has watched the number of liberal arts students in his classes climb. From being outnumbered in his classes 4-to-1 five or six years ago, liberal arts students last year surpassed the number of computer science majors in his classes. “These students want more than an appreciation of computer science; they want to apply computer science techniques in their own fields.”

To think like a computer scientist

Though his classes had the rigor students were demanding, Cannon worried that classes and projects meant for computer science and engineering students were too focused on numeric processing to be immediately relevant to liberal arts students studying mainly texts, and who might walk from his class into a lecture on Jane Austen or the evolution of liberation theology in Latin America

To provide context for liberal arts students and demonstrate how computational methods relate to their fields, Cannon introduced last year a new computer science class, Computing in Context, to focus on text processing. A hybrid class taught by a team of professors, it combines basic computer science concepts and Python programming with lectures and projects by humanities and social science professors who show how those skills and methods apply to a specific liberal arts discipline.

The computer science component is rigorous. While covering basic computer and programming concepts, the class teaches students to think algorithmically so they can, like computer scientists, reformulate problems in a way they can be analyzed by a computer. A framework for problem solving, algorithmic thinking forces students to think hard about a problem and see it clearly enough to break it down into its component parts. It’s a logical, step-by-step approach that entails critical thinking and deductive reasoning and has wide application across different types of problems. Perhaps more than any other single subject, computer science bridges the logical reasoning and abstract and creative thought that should be an essential part of college education.

Computer science within a context

Algorithmic thinking is critical for designing solutions to new problems and analyzing new data sets, but the nature of the problems and the data sets depends on the particular field of study. Different liberal arts disciplines require different kinds of computational proficiency; for this reason, Computing in Context maintains separate tracks for each discipline, with each track taught by a different professor. The class debuted with three tracks: social science (Matthew Jones), digital humanities (Dennis Tenen) and economics and financing (Karl Sigman). All students take the computer science component and learn the same basic concepts, but then divide into separate tracks to learn how those concepts apply to their particular discipline.

It’s a modular design that makes it easy to insert additional tracks as more departments and professional schools act to make computer science part of their students’ curriculum. The first time a new track is offered, a professor from that department lectures live, and then records those lectures for future semesters. This flipped classroom approach—where students view videos of lectures outside class and use classroom time to discuss the content of those videos—helps make the class financially sustainable since each new track represents a one-time expense.

SIPA’s is the first track to be added since Computing in Context was introduced and is being taught by Gregory Falco, a Columbia adjunct faculty who is also an executive at Accenture and is currently pursuing his PhD in Cybersecurity of Critical Urban Infrastructure at MIT. With an MS in Sustainability Management from Columbia University, Falco specializes in applying data, analytics, and sensors to solve complex sustainability and security policy problems.

Having Falco teach a track within Computing in Context is part of SIPA’s commitment to deeply integrating technology courses into its curriculum and equipping students with a robust tech and computer science skillset. It is one way SIPA Deans Merit Janow and Dan McIntyre along with Falco are pioneering the next generation of policy education.

What SIPA students can expect

For the first six weeks of the course, SIPA students will attend the twice-weekly lectures on computer science along with all other students. At the halfway point, the track lectures kick in, and SIPA students go to lectures given by Falco, who will also assign homework and projects geared specifically to public policy. While economics and financing students price options and digital humanities students run sentiment analysis on tweets, SIPA students might be troubleshooting sources of environmental pollution, evaluating the effectiveness of public housing policy, or determining the impact of local financial markets on international healthcare or education.

Considering SIPA is a professional school, Falco’s lectures and assignments are aimed at helping students integrate and transition what they learn in the classroom to the professional setting and job market.

Unlike other tracks, the SIPA track will always have live lectures each time it is given. The changing relevance of policy problems requires a class constantly evolving for current events. Also, the skills SIPA students learn in Computing in Context will be integrated into their capstone research projects that serve as graduate theses; since Falco teaches both Computing in Context and will advise research projects, his constant, in-class presence will provide a more continuous resource of expertise on data and computing for SIPA students.

“This is a one-of-a-kind, very cool policy class because it enables SIPA students to think like computer scientists and see the art of the possible in relation to how technology, data analytics, and artificial intelligence can be used to address policy problems,” says Falco. “Beyond coding, the class helps foster the language of digital literacy which is invaluable in the professional world for policy practitioners.”

The SIPA track will be the first test of how well Computing in Context can scale to meet demand, which is only expected to grow as more departments and schools like SIPA integrate computer science into their curricula.

Posted: 9/20/2016

 – Linda Crane

 

An interview with John Paparrizos: Hunting early signs of pancreatic cancer in web searches

With few warning signs and no screening test, pancreatic cancer is usually not diagnosed until it is well advanced; 75% of pancreatic cancer patients die within a year of a diagnosis. The prognosis improves if the disease is caught early, but the cancer’s nonspecific symptoms (bloating, abdominal or back pain) are not initially concerning. Doctors will want to rule out more likely diseases first before ordering expensive lab tests; adding to the difficulty, doctors often lack a full medical history for patients they see only intermittently or even for the first time and patients may have visited other doctors or facilities. Doctors in other words have very little data to go on.

Search engines however collect a great deal of data, and as more people turn to the web for health-related information, they are constructing a deep and broad data set of medical histories that may contain clues as to how diseases emerge.

In a study started last summer, researchers Eric Horvitz and Ryen White of Microsoft, both recognized experts in the field of information retrieval, along with Columbia PhD student John Paparizzos looked through 18 months of US Bing search data to see if they could detect patterns of symptoms in pancreatic cancer before people were diagnosed with the disease. In 5–15% of such cases, they could, and with extremely few false positives (as low as 1 in 100,000).

A paper detailing their results was the August 2016 cover article for The Journal of Oncology Practice and widely covered in the popular press. In this interview, John Paparizzos discusses the methodology behind those results and more recent work presented at KDD in August.

ioannis-cancer-clues-225x225
John Paparizzos,
CS PhD student
(advisor Luis Gravano)

How did you know what users had been diagnosed with pancreatic cancer?

We were very careful. We looked for people whose first-person searches indicated they had the disease: “I was just diagnosed with pancreatic cancer,” “why did I get cancer in pancreas,” and “I was told I have pancreatic cancer what to expect.” We then worked backward in time through previous searches made by these people to see if they had previously searched for symptoms associated with the disease—bloating or abdominal discomfort, unexplained weight loss, yellowing skin. From 9.2 million users with searches relevant to pancreatic cancer disease and symptoms, we focused on 3203 cases matching the diagnostic pattern.

It’s important to point out that this data was completely anonymized. Users were identified through a unique ID; we had no names or personal information about them. Just search queries.

You study computer science. How did you come to work on a medical problem?

Before starting an internship last summer at Microsoft, I talked by phone with Eric and Ryen on possible projects. They have spent several years studying how people search online, particularly how people use the web for health purposes. The idea this time was to do some type of prediction through analysis of patients’ query logs over time for a specific disease. Pancreatic cancer was a good candidate; it’s a fast-developing cancer with few overt symptoms. Search logs could conceivably provide early warning of the disease, something that doesn’t exist now.

My particular focus was to develop an approach to capture characteristics over time in the user behavior—as expressed in search query logs—that would discriminate users as early as possible to those who might experience pancreatic cancer and those who simply explore pancreatic cancer.

Since human web searches contain a lot of noise, we spent a great deal of time cleaning and annotating data before inputting it into classifiers that separate out the true signals. We excluded searches where people were looking up symptoms for general knowledge or on the behalf of someone else. We excluded idiomatic phrases that might get mistaken for real symptoms, like “I’m sick of hearing about the Kardashians.” Particular to this problem were queries generated by Steve Jobs’s death from pancreatic cancer.

Was this the first time you had worked with medical data?

Yes. It was new to me and it required learning about the disease of pancreatic cancer and its symptoms, which were all described differently also. Medical records are usually clean, formatted in templates, and reflect the precise vocabulary of doctors working in the field, but people will describe those symptoms in very different terms, even using slang. We had to do a two-level ontology of terms for symptoms and their synonyms.

The distribution of the data was different. Normally you are building classifiers where the positive and the negative cases are in balance. With this project, maybe 1 in 10,000 people will have this disease. A high misprediction rate could unnecessarily alert millions of users.

There were issues of scale. In an academic environment we might be running analysis involving 100 or 1000 users, not millions. When you’re dealing with at least 3 orders of magnitude more, there is a lot of engineering involved along with the research task.

You knew from people’s previous searches who had cancer. Could you take this model and apply to another data set where it’s not known whether people have the cancer or not?

Our KDD paper presented last month describes a similar experiment. Where our first paper describes training a model on users between October 2013 to May 2015, our KDD paper describes how we took this model and applied it on a new set of users from August 2014 to January 2016—users not used in training. We wanted to see if we could predict which searchers will later input first-person queries for the disease, which we could do at about the same accuracy as before.

The second paper also extended the methodology. In addition to looking for people with first person searches—“I have this cancer”—the second paper goes an extra step and looks for users who searched for the specific treatments for this disease—Whipple procedure, pancreaticoduodenectomy, neoadjuvant therapy—that only someone with first-hand knowledge of the disease would know.

We got slightly better results because the data was cleaner, and this raised our accuracy level from 90% to 92%.

What is the possibility that web searches will actually be used to screen for cancer?

Our goal was to demonstrate the feasibility of mining medical histories of a large population of people to successfully see symptoms at an early stage. We did this using anonymized search data, but it’s not practical in the real world. We don’t have names so we can’t contact individuals—which would raise massive privacy issues in any case.

What may be feasible is to link our method to a hospital or clinical group where patients allow their data to be shared and combined. For this, we need help from doctors working directly with patients, and it’s why we published first in a medical journal; we wanted to engage the medical community with the prospect of using nontraditional web searches in conjunction with standard practices as a way to do low-cost, passive surveillance to flag hidden problems and improve cancer screenings.

Our system won’t do the diagnosis—that is for the doctor to do—but it pulls together more clues and gives more warning that something serious might be developing, and to perhaps recommend that a specific patient gets a certain test or suggest a meeting with a physician.

We want doctors to see what’s possible when you have more data; you learn more about a single patient by having that patient’s data centralized in one spot and seeing what symptoms they had months ago, but you also learn more about the disease. For example, we found in examining the histories of many pancreatic cancers that symptoms appearing in particular order—for example,  indigestion before abdominal pain—correlate positively with the experiential user cases.

To better understand when such web-based health surveillance approach could be applied and what is the impact of the presence of particular risk factors or symptoms on the performance of the model, we have performed an analysis of the performance of our model when conditioned on users with specific symptoms or risk factors. We found that if we focus our predictions on users who search for risk factors such as alcoholism or obesity many more users would benefit from a prediction than would be mistakenly alerted.

symptom-table
Table describes the performance of the model when conditioned on users with particular symptoms or risk factors. The last column describes the benefit of using the researchers’ model to users with the corresponding symptom. Values > 1 indicate that more people would benefit using model than would be mistakenly alerted.

Will you continue to work with medical data sets? Did this intrigue you a lot to work in this field?

Perhaps. It’s an extremely interesting application. In our field, you don’t get to see a direct application to users. We build a new model or classier or algorithm, but often it’s not linked to such a concrete application. This is an excellent example of how we can use science and what we are learning in computer science for social good.

Posted: 9/13/2016