Carl Vondrick Wins NSF CAREER Award

Assistant Professor Carl Vondrick has won the National Science Foundation’s (NSF) Faculty Early Career Development award for his proposal program to develop machine perception systems that robustly detect and track objects even when they disappear from sight, thereby enabling machines to build spatial awareness of their surroundings.

Papers That Explore Novel Methods to Use Artificial Intelligence Accepted to AAAI 21

Research from the department was accepted to the 35th AAAI Conference on Artificial Intelligence. The conference promotes research in artificial intelligence (AI) and scientific exchange among AI researchers, practitioners, scientists, and engineers in affiliated disciplines. 


Automated Symbolic Law Discovery: A Computer Vision Approach
Hengrui Xing Columbia University, Ansaf Salleb-Aouissi Columbia University, Nakul Verma Columbia University

One of the most exciting applications of modern artificial intelligence is to automatically discover scientific laws from experimental data. This is not a trivial problem as it involves searching for a complex mathematical relationship over a large set of explanatory variables and operators that can be combined in an infinite number of ways. Inspired by the incredible success of deep learning in computer vision, the authors tackle this problem by adapting various successful network architectures into the symbolic law discovery pipeline. The novelty of this new approach is in (1) encoding the input data as an image with super-resolution, (2) developing an appropriate deep network pipeline, and (3) predicting the importance of each mathematical operator from the relationship image. This allowed to prior the exponentially large search with the predicted importance of the symbolic operators, which can significantly accelerate the discovery process.

The model was then applied to a variety of plausible relationships—both simulated and from physics and mathematics domains—involving different dimensions and constituents. The authors show that their model is able to identify the underlying operators from data, achieving a high accuracy and AUC (91% and 0.96 on average resp.) for systems with as many as ten independent variables. Their method significantly outperforms the current state of the art in terms of data fitting (R^2), discovery rate (recovering the true relationship), and succinctness (output formula complexity). The discovered equations can be seen as first drafts of scientific laws that can be helpful to the scientists for (1) hypothesis building, and (2) understanding the complex underlying structure of the studied phenomena. This novel approach holds a real promise to help speed up the rate of scientific discovery.

 

Bounding Causal Effects on Continuous Outcome 
Junzhe Zhang Columbia University, Elias Bareinboim Columbia University

One of the most common methods for policy learning used throughout the empirical sciences is the use of randomization of the treatment assignment. This method is considered the gold standard within many disciplines and can be traced back, at least, to Fisher (Fisher 1935) and Neyman (Neyman 1923). Whenever human subjects are at the center of the experiment, unfortunately, issues of non-compliance arise. Namely, subjects do not necessarily follow the experimental protocol and end up doing what they want. It is well-understood that under such conditions, unobserved confounding bias will emerge. For instance, subjects who did not comply with the treatment assignment may be precisely those who would have responded adversely to the treatment. Therefore, the actual causal effects of the treatment, when it is applied uniformly to the population, might be substantially less effective than the data reveals. Moreover, since one does not observe how subjects decide/respond to the realized treatment, the actual treatment effects are not uniquely computably from the collected data, called non-identifiable.

Robins (1989) and Manski (1990) derived the first informative bounds over the causal effects from studies with imperfect compliance under a set of non-parametric assumptions called instrumental variables (IV). In their seminal work, Balke and Pearl (1994a, 1997) improved earlier results by employing an algebraic method to derive analytic expressions of the causal bounds, which are provably optimal. However, this approach assumes the primary outcome to be discrete and finite. Solving such a program could be intractable when high-dimensional context variables are present.

This paper presents novel non-parametric methods to bound causal effects on the continuous outcome from studies with imperfect compliance. These methods could be generalized to settings with a high-dimensional context. Perhaps surprisingly, this paper introduced a latent data representation that could characterize all constraints on the observational and interventional distributions implied by IV assumptions, even when the primary outcome is continuous. Such representation allows one to reduce the original bounding problem to a series of linear programs. Solve these programs, therefore, leads to tight causal bounds.

 

Estimating Identifiable Causal Effects through Double Machine Learning
Yonghan Jung, Jin Tian, Elias Bareinboim Columbia University

Learning causal effects from observational data is a pervasive challenge found throughout the data-intensive sciences. General methods of determining the identifiability of causal effect from a combination of observational data and causal knowledge about the underlying system have been well-understood in theory. In practice, however, there are still challenges to estimating identifiable causal functionals from finite samples. Recently, a novel approach, named double/debiased machine learning (DML) (Chernozhukov et al. 2018), has been proposed to learn parameters leveraging modern machine learning techniques, which are both robust to model misspecification (‘doubly robust’) and slow convergence (‘debiased’). Still, DML has only been used for causal estimation in settings when the back-door condition (also known as conditional ignorability) holds.

This paper aims to bridge this gap by developing a general class of estimators for any identifiable causal functionals that exhibit robustness properties of DML estimators, which the authors called ‘DML-ID.’ In particular, they provide a complete procedure for deriving an essential ingredient of the DML estimator called an influence function (IF) and construct a general class of estimators based on the IF. This means that one can estimate any causal functional and enjoy two robustness properties,  doubly robustness and debiasedness.

 

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding 
Long Chen  Tencent AI Lab, Wenbo Ma  Zhejiang University, Jun Xiao  Zhejiang University, Hanwang Zhang Nanyang Technological University, Shih-Fu Chang Columbia University

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. 

In this paper, the researchers argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. 

The paper proposes Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. 

Since RefNMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https://github.com/ChopinSharp/ref-nms.

 

Q&A: Jihye Kwon on PhD Research Projects

Jihye Kwon, a computer engineering PhD student, talks about her research projects and what it took to win a Best Paper award.

Jihye Kwon

What drew you to computer engineering, specifically the application of machine learning to computer-aided design? What questions or issues do you hope to answer?

I was attracted to the concept of a computer: a machine that performs calculations. I found it very interesting how modern computers evolved from executing one instruction at a time to executing many instructions simultaneously by exploiting multiple levels of parallelism. Still, various challenges remained, or newly arose, so I dreamed about designing a brand-new computer system. That is what I had in mind when coming to Columbia.

At the beginning of my PhD, I experimented and learned how to design the core parts of special-purpose computers, using computer-aided design tools. I also explored machine learning from both theoretical and practical perspectives. These activities led me to work on my current research problems.

In advanced computer-aided design of computer systems, computers solve many complex optimization problems in steps to generate a final design. They do so as guided by the designers via means of the configurable ‘knobs’. My focus is on the designers’ work.

For a target system, designers run the computer-aided design tools repeatedly with the many different knob configurations until the tools output final designs with optimal or desired properties, e.g., in timing, area, and power. I wondered if machines can learn, from designers’ previous work, how to configure the knobs to optimize a new target system. Can designers virtually collaborate across time and tasks through the machine learning models? These are the main questions that I hope to answer.

Could you talk about your research and how you collaborated with other groups? Was this something you considered when applying to Columbia – that there are opportunities to do multi-disciplinary work?

When I was applying to Columbia, I wished I could have collaboration opportunities to study and work in the interdisciplinary research communities at the center of New York City. I wanted to explore applications of computer science in different areas to eventually gain insight and inspiration for my own research, which is centered at computer engineering.

Fortunately, these were realized as I worked with my advisor, Professor Luca Carloni. I was invited to join the project “Energy Efficient Computing with Chip-Based Photonics”, which is a part of a large initiative supported by the government and industry. In this project, I worked closely with Lightwave Research Laboratory in Electrical Engineering on a new optical computing system. We proposed the concept of a next-generation computing system that is co-designed with silicon photonics and electronic circuitry, in order to overcome the fundamental and physical limitations of today’s computers.

Another project on optical communication was initiated from a student project that I mentored in my advisor’s class, Embedded Scalable Platforms. This project investigated the use of photonic switches in optically-connected memory systems for deep learning applications.

Outside Columbia, I have also collaborated with researchers at IBM TJ Watson Research Center via my summer internships on the project of auto-tuning computer-aided design flows for commercial supercomputers. All these collaborations opened new horizons for me.

 

You won the MLCAD 2020 Best Paper award for your research, can you talk about your process – how did the research come about? How long did it take you to complete the work? What were the things you had to overcome?

In this work, I proposed a novel machine learning approach for computer-aided design optimization of hardware accelerators. I wanted to address this problem because it is computationally very expensive to explore the entire optimization space. It took me about one year to complete the work. One of the biggest difficulties I faced was the limited availability of the data for applying machine learning to the problem.

Then, I found out that transfer learning has been recently successfully applied in other areas with limited data. In transfer learning, a model trained for a related problem (e.g., natural image recognition) is transferred to aid the machine learning for the target problem (e.g., face recognition). Hence, I tried to apply transfer learning to my research problem. I trained a neural network model for a different accelerator design, and transferred the model to predict the design properties of a target accelerator.

However, the transferred model did not perform well in this case. I realized that due to the diverse characteristics of the accelerators, I needed to distinguish which piece of the source information should be transferred. Based on this intuition, I constructed a series of new models, and eventually, proposed one with promising performance. While it was a long process of building new models without knowing the answers, my advisor greatly encouraged me in our discussions to keep moving forward, and it was very rewarding in the end.

The Machine Learning for Systems session from the 2nd ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) can be viewed here and the Best Paper announcement here

 

Looking back, how have you grown as a researcher and a person?

Besides expanding my problem-solving capabilities and technical skills, I have grown to be a better presenter and communicator. One of the tasks of a researcher is to explain one’s work to various groups and different types of audiences. I had a number of opportunities to present my work at academic conferences, seminars at companies, lightning talks, and annual project reviews. Initially, I struggled to meet the audience’s interests whose expertise spans a diverse range of areas and levels. Through those opportunities, I have received very helpful feedback, I have tried to ask myself questions from other people’s perspectives and progressively learned to keep a good balance between abstraction and elaboration.

Also, by interacting with a lot of students with heterogeneous backgrounds in the classes I TA’ed, I have learned to understand what their questions mean and where they come from. Based on that, I tried to adjust my answers to have more relatable conversations. From those conversations, sometimes the students found the topics very interesting, and sometimes I learned something new from them. It was such a great pleasure to inspire others and to be inspired. I think those experiences have made me a better researcher and person.

 

There are many organizations on campus, why did you choose to join Womxn in Computer Science (WiCS)?

In Fall 2017, I received an invitation from WiCS’ president, Julia Di, and was impressed by the passionate and caring board members working on the common goal of supporting the advancement of womxn in computer science. In my second year I launched the WiCS Lightning Talks for students with research experience to share their work and stories. The goal was for young students to get to know more about research and demystify the process.

I am one of the few women at Columbia in my research area of computer engineering and would like to contribute to inspiring the next generation to join us.

 

What was the highlight of your time at Columbia?

Every moment was special for me. Some of the highlights were during happy hour with members of the fishbowl. The fishbowl is a large office occupied by the majority of PhD students in computer engineering. We call it the fishbowl, because it is surrounded by large windows and students inside look like small fishes. Once, my colleagues talked about their memories of old computers that I had never seen. I enjoyed imagining the machines from their descriptions, and thinking about different types and generations of computers.

 

What is your advice to students on how to navigate their time at Columbia?

Explore, experience, and exploit. There are recommended lists of classes, activities, and companies, depending on your track and interests, but no one is exactly like you. There is such a great variety of opportunities and resources at Columbia and in New York City. I hope you can spend enough time exploring them and get involved in many ways before determining your academic and career goals.

 

Is there anything else that you think people should know?

Columbia is beautiful in the snow! It gets pretty windy in the winter, so please be aware if you are coming from warmer places. There are many places where you can study but Avery Library is my favorite library on campus. If you have any questions or opinions on this Q&A story, please feel free to drop me a line!

Three CS Students Recognized By The Computing Research Association

For this year’s Outstanding Undergraduate Researcher Award, Payal Chandak, Sophia Kolak, and Yanda Chen were among students recognized by the Computing Research Association (CRA) for their work in an area of computing research.


Payal Chandak
Finalist

Using Machine Learning to Identify Adverse Drug Effects Posing Increased Risk to Women
Payal Chandak Columbia University, Nicholas Tatonetti Columbia University

The researchers developed AwareDX – Analysing Women At Risk for Experiencing Drug toXicity – a machine learning algorithm that identifies and predicts differences in adverse drug effects between men and women by analyzing 50 years’ worth of reports in an FDA database. The algorithm automatically corrects for biases in these data that stem from an overrepresentation of male subjects in clinical research trials.

Though men and women can have different responses to medications – the sleep aid Ambien, for example, metabolizes more slowly in women, causing next-day grogginess – doctors may not know about these differences because most clinical trial data itself is biased toward men. This trickles down to impact prescribing guidelines, drug marketing, and ultimately, patients’ health. Unfortunately, pharmaceutical companies have a history of ignoring complex problems and clinical trials have singularly studied men, not even including women. As a result, there is a lot less information about how women respond to drugs compared to men. The research tries to bridge this information gap. 

Buy Weed Seeds from premium breeders of the best marijuana strains. Discreetly ships across the USA with free weed seeds in every order.


Sophia Kolak
Finalist

It Takes a Village to Build a Robot: An Empirical Study of The ROS Ecosystem
Sophia Kolak Columbia University, Afsoon Afzal Carnegie Mellon University, Claire Le Goues Carnegie Mellon University, Michael Hilton Carnegie Mellon University, Christopher Steven Timperley Carnegie Mellon University

The Robot Operating System (ROS) is the most popular framework for robotics development. In this paper, the researchers conducted the first major empirical study of ROS, with the goal of understanding how developers collaborate across the many technical disciplines that coalesce in robotics.

Building a complete robot is a difficult task that involves bridging many technical disciplines. ROS aims to simplify development by providing reusable libraries, tools, and conventions for building a robot. Still, as building a robot requires domain expertise in software, mechanical, and electrical engineering, as well as artificial intelligence and robotics, ROS faces knowledge-based barriers to collaboration. The researchers wanted to understand how the necessity of domain-specific knowledge impacts the open-source collaboration model in ROS.

Virtually no one is an expert in every subdomain of robotics: experts who create computer vision packages likely need to rely on software designed by mechanical engineers to implement motor control. As a result, the researchers found that development in ROS is centered around a few unique subgroups each devoted to a different specialty in robotics (i.e. perception, motion). This is unlike other ecosystems, where competing implementations are the norm.

Detecting Performance Patterns with Deep Learning
Sophia Kolak Columbia University

Performance has a major impact on the overall quality of a software project. Performance bugs—bugs that substantially decrease run-time—have long been studied in software engineering, and yet they remain incredibly difficult for developers to handle. In this project, the researchers leveraged contemporary methods in machine learning to create graph embeddings of Python code that can be used to automatically predict performance.

Using un-optimized programming language concepts can lead to performance bugs and the researchers hypothesized that statistical language embeddings could help reveal these patterns. By transforming code samples into graphs that captured the control and data flow of a program, the researchers studied how various unsupervised embeddings of these graphs could be used to predict performance.  

Implementing “sort” by hand as opposed to using the built-in Python sort function is an example of a choice that typically slows down a program’s run-time. When the researchers embedded the AST and data flow of a code snippet in Euclidean space (using DeepWalk), patterns like this were captured in the embedding and allowed classifiers to learn which structures are correlated with various levels of performance.   

I was surprised by how often research changes directions,” said Sophia Kolak. In both projects, they started out with one set of questions but answered completely different ones by the end. “It showed me that, in addition to persistence, research requires open-mindedness.”

 


Yanda Chen
Honorable Mention

Cross-language Sentence Selection Via Data Augmentation and Rationale Training
Yanda Chen Columbia University, Chris Kedzie Columbia University, Suraj Nair University of Maryland, Petra Galuscakova University of Maryland, Rui Zhang Yale University, Douglas Oard University of Maryland, and Kathleen McKeown Columbia University

In this project, the researchers proposed a new approach to cross-language sentence selection, where they used models to predict sentence-level query relevance with English queries over sentences within document collections in low-resource languages such as Somali, Swahili, and Tagalog. 

The system is used as part of cross-lingual information retrieval and query-focused summarization system. For example, if a user puts in a query word “business activity” and specifies Swahili as the language of source documents, then the system will automatically retrieve the Swahili documents that are related to “business activity” and produce short summaries that are then translated from Swahili to English. 

A major challenge of the project was the lack of training data for low-resource languages. To tackle this problem, the researchers proposed to generate a relevance dataset of query-sentence pairs through data augmentation based on parallel corpora collected from the web. To mitigate the spurious correlations learned by the model, they proposed the idea of rationale training where they first trained a phrase-based statistical machine translation system and used the alignment information to provide additional supervision for the models. 

The approach achieved state-of-the-art results on both text and speech across three languages – Somali, Swahili, and Tagalog. 

 

Natural Language Processing Papers Accepted to EMNLP 2020

Six papers from the Speech & NLP group were accepted to the Empirical Methods in Natural Language Processing (EMNLP) conference. 

Generating Similes Effortlessly Like a Pro: A Style Transfer Approach for Simile Generation
Tuhin Chakrabarty Columbia University, Smaranda Muresan Columbia University, and Nanyun Peng University of Southern California and University of California, Los Angeles

Abstract:
Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language, such as a simile, goes beyond plain expressions to give readers new insights and inspirations. We tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then fine-tune a pre-trained sequence to sequence model, BART (Lewis et al., 2019), on the literal-simile pairs to generate novel similes given a literal sentence. Experiments show that our approach generates 88% novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts 37%1 of the times, and three baseline systems including a recent metaphor generation model 71%2 of the times when compared pairwise.3 We also show how replacing literal sentences with similes from our best model in machine-generated stories improves evocativeness and leads to better acceptance by human judges.

 

Content Planning for Neural Story Generation with Aristotelian Rescoring
Seraphina Goldfarb-Tarrant University of Southern California and University of Edinburgh, Tuhin Chakrabarty Columbia University, Ralph Weischedel University of Southern California and Nanyun Peng University of Southern California and University of California, Los Angeles

Abstract:
Long-form narrative text generated from large language models manages a fluent impersonation of human writing, but only at the local sentence level, and lacks structure or global cohesion. We posit that many of the problems of story generation can be addressed via high-quality content planning, and present a system that focuses on how to learn good plot structures to guide story generation. We utilize a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle’s Poetics. We find that stories written with our more principled plot structure are both more relevant to a given prompt and higher quality than baselines that do not content plan, or that plan in an unprincipled way.

 

Severing the Edge Between Before and After: Neural Architectures for Temporal Ordering of Events
Miguel Ballesteros Amazon AI, Rishita Anubhai Amazon AI, Shuai Wang Amazon AI, Nima Pourdamghani Amazon AI, Yogarshi Vyas Amazon AI, Jie Ma Amazon AI, Parminder Bhatia Amazon AI, Kathleen McKeown Columbia University and Amazon AI and Yaser Al-Onaizan Amazon AI

Abstract:
In this paper, we propose a neural architecture and a set of training methods for ordering events by predicting temporal relations. Our proposed models receive a pair of events within a span of text as input and they identify temporal relations (Before, After, Equal, Vague) between them. Given that a key challenge with this task is the scarcity of annotated data, our models rely on either pre-trained representations (i.e. RoBERTa, BERT or ELMo), transfer, and multi-task learning (by leveraging complementary datasets), and self-training techniques. Experiments on the MATRES dataset of English documents establish a new state-of-the-art on this task.

 

Controllable Meaning Representation to Text Generation: Linearization and Data Augmentation Strategies
Chris Kedzie Columbia University and Kathleen McKeown Columbia University

Abstract:
We study the degree to which neural sequenceto-sequence models exhibit fine-grained controllability when performing natural language generation from a meaning representation. Using two task-oriented dialogue generation benchmarks, we systematically compare the effect of four input linearization strategies on controllability and faithfulness. Additionally, we evaluate how a phrase-based data augmentation method can improve performance. We find that properly aligning input sequences during training leads to highly controllable generation, both when training from scratch or when fine-tuning a larger pre-trained model. Data augmentation further improves control on difficult, randomly generated utterance plans.

Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations
Emily Allaway Columbia University and Kathleen McKeown Columbia University

Abstract:
Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.

 

Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios
Ramy Eskander Columbia University, Smaranda Muresan Columbia University, and Michael Collins Columbia University

Abstract:
We describe a fully unsupervised cross-lingual transfer approach for part-of-speech (POS) tagging under a truly low resource scenario. We assume access to parallel translations between the target language and one or more source languages for which POS taggers are available. We use the Bible as parallel data in our experiments: small size, out-of-domain, and covering many diverse languages. Our approach innovates in three ways: 1) a robust approach of selecting training instances via cross-lingual annotation projection that exploits best practices of unsupervised type and token constraints, word-alignment confidence and density of projected POS, 2) a Bi-LSTM architecture that uses contextualized word embeddings, affix embeddings and hierarchical Brown clusters, and 3) an evaluation on 12 diverse languages in terms of language family and morphological typology. In spite of the use of limited and out-of-domain parallel data, our experiments demonstrate significant improvements in accuracy over previous work. In addition, we show that using multi-source information, either via projection or output combination, improves the performance for most target languages.

 

Research by CS Undergrad Published in Cell

Payal Chandak (CC ’21) developed a machine learning model, AwareDX, that helps detect adverse drug effects specific to women patients. AwareDX mitigates sex biases in a drug safety dataset maintained by the FDA.

Below, Chandak talks about how her internship under the guidance of Nicholas Tatonetti, associate professor of biomedical informatics and a member of the Data Science Institute, inspired her to develop a machine learning tool to improve healthcare for women. 


Payal Chandak

How did the project come about? 
I initiated this project during my internship at the Tatonetti Lab (T-lab) the summer after my first year. T-lab uses data science to study the side effects of drugs. I did some background research and learned that women face a two-fold greater risk of adverse events compared to men. While knowledge of sex differences in drug response is critical to drug prescription, there currently isn’t a comprehensive understanding of these differences. Dr. Tatonetti and I felt that we could use machine learning to tackle this problem and that’s how the project was born. 

How many hours did you work on the project? How long did it last? 
The project lasted about two years. We refined our machine learning (ML) model, AwareDX, over many iterations to make it less susceptible to biases in the data. I probably spent a ridiculous number of hours developing it but the journey has been well worth it. 

Were you prepared to work on it or did you learn as the project progressed? 
As a first-year student, I definitely didn’t know much when I started. Learning on the go became the norm. I understood some things by taking relevant CS classes and through reading Medium blogs and GitHub repositories –– this ability to learn independently might be one of the most valuable skills I have gained. I am very fortunate that Dr. Tatonetti guided me through this process and invested his time in developing my knowledge. 

What were the things you already knew and what were the things you had to learn while working on the project? 
While I was familiar with biology and mathematics, computer science was totally new! In fact, T-Lab launched my journey to exploring computer science. This project exposed me to the great potential of artificial intelligence (AI) for revolutionizing healthcare, which in turn inspired me to explore the discipline academically. I went back and forth between taking classes relevant to my research and applying what I learned in class to my research. As I took increasingly technical classes like ML and probabilistic modelling, I was able to advance my abilities. 

Looking back, what were the skills that you wished you had before the project? 
Having some experience with implementing real-world machine learning projects on giant datasets with millions of observations would have been very valuable. 

Was this your first project to collaborate on? How was it? 
This was my first project and I worked under the guidance of Dr. Tatonetti. I thought it was a wonderful experience – not only has it been extremely rewarding to see my work come to fruition, but the journey itself has been so valuable. And Dr. Tatonetti has been the best mentor that I could have asked for! 

Did working on this project make you change your research interests? 
I actually started off as pre-med. I was fascinated by the idea that “intelligent machines” could be used to improve medicine, and so I joined T-Lab. Over time, I’ve realized that recent advances in machine learning could redefine how doctors interact with their patients. These technologies have an incredible potential to assist with diagnosis, identify medical errors, and even recommend treatments. My perspective on how I could contribute to healthcare shifted completely, and I decided that bioinformatics has more potential to change the practice of medicine than a single doctor will ever have. This is why I’m now hoping to pursue a PhD in Biomedical Informatics. 

Do you think your skills were enhanced by working on the project? 
Both my knowledge of ML and statistics and my ability to implement my ideas have grown immensely as a result of working on this project. Also, I failed about seven times over two years. We were designing the algorithm and it was an iterative process – the initial versions of the algorithm had many flaws and we started from scratch multiple times. The entire process required a lot of patience and persistence since it took over 2 years! So, I guess it has taught me immense patience and persistence. 

Why did you decide to intern at the T-Lab? 
I was curious to learn more about the intersection of artificial intelligence and healthcare. I’m endlessly fascinated by the idea of improving the standards of healthcare by using machine learning models to assist doctors. 

Would you recommend volunteering or seeking projects out to other students? 
Absolutely. I think everyone should explore research. We have incredible labs here at Columbia with the world’s best minds leading them. Research opens the doors to work closely with them. It creates an environment for students to learn about a niche discipline and to apply the knowledge they gain in class.