CS undergrad, Alyssa Hwang  (SEAS ’20), Presented Her Research at Columbia, Harvard and Stanford Research Conferences

Hwang spent the summer working for the Natural Language Text Processing Lab (NLP) and the Data Science Institute (DSI) on a joint project, doing research on gang violence in Chicago.


What was the topic/central focus of your research project?
I used the DSI’s Deep Neural Inspector to evaluate an NLP model that classified Tweets from gang-related users.

What were your findings?
Through my research, I found that the DNI reported higher correlation between hypothesis functions and neuron/layer output in trained models than random models, which confirms that the models learn how to classify the data input.

The aggression model showed interesting correlation with activation hypotheses, and the same with the loss model with imagery, which implies that aggressive speech tends to be very active (intense) and that text containing loss tend to use language that is concrete rather than abstract. If I had more time to continue this research, I would love to explore different types and sentiments in text and how that would affect how well a model learns its task.

What about the project did you find interesting?
The most interesting part of my research was seeing how interconnected all of these disciplines are. I split most of my time between the Natural Language Processing Lab and the Data Science Institute, but I also had the chance to meet some great people from the School of Social Work–their work on gang-related speech is part of an even bigger project to predict, and later prevent, violence based on social media data.

How did you get involved in/ choose this project?
I’ve been working at the NLP Lab since freshman year and decided to continue working there over the summer. In my opinion research is one of the best ways to develop your skillset and ask questions to people already established in the same field. I knew I wanted to pursue research even before I decided to major in computer science, and I feel so grateful to be included in a lab that combines so many of my interests and develops technology that matters.

How much time did it take and who did you work with?
The project was for three months and I worked with CS faculty – Professor Kathy McKeown and Professor Eugene Wu.

Which CS classes were most helpful in putting this project together?
Python, Data Structures

What were some obstacles you faced in working on this project?
I had just finished my sophomore year when I tackled this project, which means that the most advanced class I had taken at that point was Advanced Programming. I spent a lot of time just learning: figuring out how machine learning models work, reading a natural language processing textbook, and even conducting a literature review on violence, social media, and Chicago gangs just so I could familiarize myself with the dataset. I felt that I had to absorb an enormous amount of information all at once, which was intimidating, but I was surrounded by people with infinite patience for all of my questions.

What were some positives of this project?
Through this project, I really started to appreciate how accessible computer science is. Half of the answers we need are already out on the internet. The other half is exactly why we need research. I can learn an entire CS language for free in a matter of days thanks to all of these online resources, but it takes a bit more effort to answer the questions I am interested in: what makes text persuasive? What’s a fair way of summarizing emotional multi-document texts?

Can you discuss your experience presenting?
Along with the Columbia Summer Symposium, I have presented my research at the Harvard National Collegiate Research Conference and the Stanford Research Conference.

Do you plan to present this research at any other events/conferences?
Yes, but I have yet to hear if I have been accepted.

What do you plan to do with your CS undergraduate degree?
Not sure yet but definitely something in the natural language understanding/software engineering space.

Do you see yourself pursuing research after graduation?
Yes! I loved working on a project that mattered and added good to the world beyond just technology. I also loved presenting my research because it inspired me to think beyond my project: what more can we do, how can others use this research, and how can we keep thinking bigger?

How Can We Keep Genetic Data Safe?

In light of how easy it is to identify people based on their DNA, researchers suggest ways to protect genetic information.


Genetic information uploaded to a website is now used to help identify criminals. This technique, employed by law enforcement to solve the Golden State Killer case, took genetic material from the crime scene and compared it to publicly available genetic information on third party website GEDmatch.

Inspired by how the Golden State Killer was caught, researchers set out to see just how easy it is to identify individuals by searching databases and finding genetic matches through distant relatives. The paper out today in Science Magazine also proposes a way to protect genetic information.

“We want people to discover their genetic data,” said the paper’s lead author, Yaniv Erlich, a computer scientist at Columbia University and Chief Science Officer at MyHeritage, a genealogy and DNA testing company. “But we have to think about how to keep people safe and prevent issues.”

Commercially available genetic tests are increasingly popular and users can opt to have their information used by genetic testing companies. Companies like 23andMe have used customer’s data for research to discover therapeutics and come up with hypothesis to make medicines. People can also upload their genetic information to third party websites, such as GEDmatch and DNA.Land, to find long-lost relatives.

With these scenarios, the data is used for good but what about the opposite? The situation can easily be switched, which could prove harmful for those who work covert operations (aka spies) and need their identities to remain secret.

Erlich shared that it takes roughly a day and a half to sift through a dataset of 1.28 million individuals to identify a third cousin. This is especially true for people of European descent in the United States. Then, based on sex, age and area of residence it is easy to get down to 40 individuals. At that point, the information can be used as an investigative lead.

To alleviate the situation and protect people, the researchers propose that raw data should be cryptographically encrypted and only those with the right key can view and use the data.

“Things are complicated but with the right strategy and policy we can mitigate the risks,” said Erlich.