Yaniv Erlich: Dissecting the complex relationships of genes, health, and privacy

Yaniv Erlich

This January Yaniv Erlich joins the computer-science faculty at Columbia Engineering. Working in the field of quantitative genomics, he is at the forefront of new gene sequencing techniques and the issue of genetic privacy. He will continue to build on work he started while a Fellow at MIT’s Whitehead Institute where his lab created new algorithms for examining genetic information at both the molecular level and within large-scale human populations. The appointment is a joint one with the New York Genome Center, one that will bridge the strong computer-science communities at Columbia with the Genome Center’s focused efforts on translating genomic research into patient care.

Health and predisposition to disease strongly depend on the genetic material carried deep within the cells of every individual. Increasingly this genetic information, unique to every individual, is becoming public. For Yaniv Erlich it’s two parts of the complex relationship between genes and health, one increasingly understood through quantitative analysis and computer algorithms.
“Computational methods are necessary at every step in examining the genome,” says Erlich. “The strings of nucleotides (A, T, G, and C) that form each person’s unique genome is billions of letters long. It’s not even possible to look at these long sequences without a computer. As more genetic data is collected, concepts from machine learning, statistics and signal processing are needed to detect the subtle variations and statistical patterns that reveal traits and predisposition to disease.”
Erlich is well-positioned for untangling the complex genetic underpinnings of human biology. An initial love of math and a friend’s chance remark that biology entailed a lot of mathematics piqued his interest, and he went on to study both biology and genetics, two sciences increasingly awash in data and in need of algorithms for inferring information from the data. He approached both subjects from a computational perspective.
At MIT’s Whitehead Institute, where he headed a research lab from 2010 through 2014, he had the chance to apply his computational tool kit to genetics. Results were impressive. One method sequenced tens of thousands of samples at a time. Another harnessed signal processing and statistical learning to extract genetic information from short tandem repeats, or STRs, a fast-mutating fragment of DNA so small it had been mostly neglected by the research community. Both methods contributed new information about how genes operate at the molecular level.
But Erlich was also interested in how genes affect health and traits. Is the relationship linear where mutations predictably sum to a trait, or is it nonlinear, with mutations interacting unpredictably with one another? The answer required a population-scale analysis, one too large to be constructed using traditional data collection methods. Here Erlich and his Whitehead colleagues came up with an innovative solution. Using existing genealogical data taken from social-media sites, they created a genealogical tree of 13 million individuals dating back to the 15th century. Such a deep genealogy reveals clusters of genetic variation, some tied to longevity, others to rare disorders. (And offering as a side bonus an intriguing view into human migration.) Looking at how mutations ripple through populations, researchers can measure the frequency of a certain trait, and thus evaluate the genetic contribution. With a larger tree, even more will become possible.
The promise is great but it all means nothing if there isn’t genetic material to work with. And that requires trust.
People today are more wary about revealing personal data, and they are right to be so. Erlich himself is one of the first to flag the ethical complexities involving genetics and privacy. A paper he spearheaded, released in January 2013, caused a stir by showing how easy it is to take apparently anonymized genetic information donated by research participants and cross-reference it to online data sources. Using only Internet searches with no actual DNA, Erlich and his research team were able to correlate the donated DNA to a surname in 13% the US population, a result that surprised even Erlich.
“Our study highlighted current gaps in genetic privacy as we enter to the brave new era of ubiquitous genetic information,” explains Erlich. “However, we must remember that sharing genetic information is crucial to understand the hereditary basis of devastating disorders. We were pleased to see that our work has helped to facilitate discussions and procedures to better share genetic information in ways that respect participants preferences.”
Ensuring better safeguarding of genetic information will encourage more people to contribute their DNA, speeding the day when personalized medicine becomes a reality for all, especially those most at risk of rare genetic diseases. For them, the bigger danger may be in not contributing genetic information.
BSc, Tel-Aviv University; PhD, Watson School of Biological Sciences at the Cold Spring Harbor Laboratory (New York)
Photo credit: Jared Leeds
Posted 1/13/2015
Linda Crane