Overview

How is it best to measure, describe and quantify differences between individual DNA sequences? How does sequence variation affect biological processes? How can we use it to understand and influence human disease? All these questions pose complex analytical challenges, with direct impact on medical research.

Human genetics is as ancient as human history. It's computational foundations are intertwined with the most fundamental developments in statistics. Such quantifications reveal the tremendous degree to which medical traits are heritable, and motivate a large research community to investigate the interconnections between gene variants (genotypes) and observed traits (phenotypes). The third millennium finds genetics more flourishing than ever with high throughput technologies generating large scale data sets, yet with more need than ever of computational innovation and methods to process these data into meaningful biomedical insights. The upcoming era, of complete genotype information available to each individual on the planet therefore holds the potential of great discoveries, and poses the challenges of powerful and rigorous analyses of these data.


Developing methods for analysis of high throughput sequencing data

As the cost of DNA sequencing falls, there is an increased need for accurate and fast analysis algorithms.

We have developed a tool to model copy number variants from exome sequencing data using read
depth information, normalized by other samples in the experiment. In addition, we are looking into
the value of imputation from exome sequencing to capture lower frequency variants in the isolate
population. Using imputation reduces the number of individuals that need to be sequenced to obtain
the variants in the population. We are using these variants from exome sequencing to better understand
associations with common disease in this isolate population.

Using high throughput sequencing technologies allows us to better examine rare genetic variants
and their role in both rare and common traits.

Contact person: Arthi


Inferring the transcriptional network from genetic-genomic data

Unraveling the structural and functional organization of human transcription and how it is driven
by genetic variation holds the promise for gaining a better understanding of various biological
mechanisms and often their clinical implications.

We develop computational models, clustering algorithms and inference methods that leverage
the emerging data from current technologies to improve our understanding of the influence
of genetic variants on gene expression via multiple regulatory elements (i.e. network
organization). In addition to pathway architecture, we are interested in the mechanisms of gene
expression regulation.

Specifically, we are inferring modules of genes that share similar functions and are associated
with a single genetic variant. This model can be applied to different tissues to explore how
regulatory processes may change between them. It can also be extended to study the difference
in regulation between healthy and affected tissues, focusing on diseases where regulatory
functions are compromised (e.g. cancer). Our ultimate goal is to create a comprehensive,
predictive model of the human regulatory network.

Contact person: Anat


Population genetics of identity-by-descent

Data-driven studies of population help us understand demographic and evolutionary phenomena
and their relationship with genotypic and phenotypic variation.

Leveraging the presence of long DNA segments that are co-inherited from common ancestors
within and across populations, we have developed mathematical and computational tools to
detect and quantify demographic events such as expansion, bottlenecks, founder events,
migration and admixture. Our methods are particularly focused on genetic variation that
developed in recent millennia, where classical analysis offers limited resolution.

We apply our tools to quantify the substructure, relationships and recent genetic history of
several groups of diverse geographical and ethnic origins. Here, complex demographic and
evolutionary processes reveal a substantially different pattern from panmictic, neutral
models of evolution.

By elucidating crucial events in the recent history of our species, our methods help us
understand the relationships between historical events and evolutionary processes,
providing the context for the study of rare genetic variants and their role in the heritability
of common traits.

Contact person: Pier


Development and application of computational methodologies to analyze
cancer data in order to identify novel gene variants

Cancer is a complex disease with genetic factors that are inherited or acquired.
Identification of these factors, which are under constant selective pressure in a
tumor micro-environment, is a core research objective.

We utilize novel, computationally efficient methodologies for integrated statistical
analysis of germline and somatic variants, which traditionally have been
examined independently. This enables us to identify novel gene variants and their
contributions to the unexplained component of the heritability of cancer.

Specifically, we have found inherited variants that were also selected for somatic
amplification in glioblastoma. A subset of these variants corresponded to allele-
specific expression levels of known cancer related genes. These findings improve our
understanding of the connection between somatic and germline genotypes.

Contact person: Ninad


Mapping causal variants underlying common diseases in founder
populations

Elucidating common variants underlying complex human diseases is an important
first step for implicating novel biological pathways, and holds the potential for future therapies.

We have developed methods to map common and rare genetic variation to evolutionarily
and biomedically relevant traits in large-scale populations: in particular, genome-wide
association mapping (GWAS) in founder populations such as South Pacific Islanders and
Ashkenazi Jews. In populations with reduced genetic, phenotypic and environmental
heterogeneity, the connection between the genotype and phenotype can be studied
under a more carefully controlled set of conditions. However, it is possible that
non-genetic confounding factors exist in these populations, and we attempt to control
for such factors.

We have focused on analyzing specific variations that, in the founder population, might
have increased in frequency due to genetic drift. Our strategy uses a combination of
identity-by-descent haplotype mapping and variance component models.

Contact person: Eimear


Discovering and leveraging hidden relatedness in large populations

Given a large enough study cohort, we find that individuals are much more related than
neutrally theorized. Determining the specific genomic boundaries of these rare relationship
events in an efficient manner can enrich our understanding of the history of a population
or phenotype.

We have developed algorithms for efficiently mapping shared genomic segments in massive
populations; in practice, as large as 50,000 assayed genomes. Focusing on an isolated
founder population, we have used such sharing to select individuals that optimally saturate
our understanding of variant diversity. Our methods of analyzing recent relatedness have
uncovered rare causal variants that are missed by neutral models of variation.

We apply these methods to the study of recent population structure by classifying populations
and analyzing changes within a population. By identifying specific regions with unexpected
levels of relatedness, we are also discovering genetic variants associated with harmless
phenotypes or harboring disease.

Contact person: Sasha


Genetics of Common Variants: Genomewide Association Studies

Whole genome arrays of Single Nucleotide Polymorphisms (SNPs) and Copy Number Variants (CNVs) have been designed to represent common variation in the human genome, aiming to associate such variation with health-related traits. We are involved in the analysis of such Genomewide Association Study (GWAS) data across multiple phenotypes:




Genetics of Rare Variants: Identity by Descent between Purported Unrelated Individuals

The co-inheritance of long haplotypes in recent generations is key to the analysis of rare variants carried on the background of these segments that are Identical By Descent (IBD). We have developed a linear-time algorithm to scan a large population for IBD segments without the quadratic exhaustive search of all pairs of individuals. This enables genomewide analysis of thousands of samples, and paves the way to multiple avenues of research. Specifically, we have been using IBD in unrelateds to inform phasing, deletion detection, population structure, and fine-mapping of associated variants.


High Throughput Sequencing for Comprehensively Cataloging Variants

While most of the current genetic data is based on SNP markers, DNA sequencing has been increasing in throughput and cost-effectiveness out-pacing Moore's law. We have been developing and applying computational methods to tackle this torrent of sequence data. Specifically, we have refined models of genomic coverage in worm resequencing data, observing they fit a Gamma distribution, rather than a Poisson model. We further developed a novel method for sequencing of DNA from pools of individuals. The method designs overlapping pools, so that an individual carrier of a discovered variant can be traced through the intersection of such pools. Error Correcting Codes make such pool designs robust to experimental and statistical error.