Background: Long Haplotypes, Identical By Descent (IBD) as Evidence for Relatedness
A pair of descendants of the same ancestor may share
haplotypes that have been transmitted along the respective lineages leading to them.
At each particular locus, the chance of this happening decreases exponentially
with the number of transmissions required.
However, if this event does occur at a specific site, then neighboring sites are likely to be shared as well:
Sharing will cease along the genome only upon reaching a site of ancestral recombination.
K-th cousins
for instance,
are expected to share less than one region
genomewide for K>5, but if, say, K<20, that region is expected to exceed 2.5cM.
With current SNP density on commercial genotyping arrays
we therefore expect several hundreds of SNPs to be spanned by such a region
- above and beyond the number of consecutive SNPs expected to be IBS (Identical By State) by chance alone.
Sharing of a segment IBD is therefore a rare but unequivocal evidence of hidden, remote relatedness.
Hypothesis:
Hidden Relatedness Becomes Frequent in Large Cohorts
A pair of purported unrelated individuals descended from an ancestral population will have
some chance of their ancestral lineages randomly reaching the same ancestor.
This probability is inverse proportional to the effective size of the ancestral population.
While this event is rare,
its frequency is sufficient for hidden relatedness to be observed in the
Human Haplotype Map cohorts of 60-90 unrelateds per population.
Scaling sample size by two or three orders of magnitude, the chance of a pair to share may remain the same,
but the number of observed pairs increases quadratically, up to a million fold.
We therefore hypothesized that IBD would be ubiquitous in large cohorts.
Unfortunately, scanning for hidden relatedness genomewide across all pairs of individuals
requires computational resources quadratic in the cohort size, and quickly becomes prohibitive.
GERMLINE: Genetic Error-tolerant Regional Matching
with LINear-time Extension
We developed a linear time method to create a dictionary of haplotype "words".
This hash table for each region along the genome allows filtering IBD matches rapidly,
and extending only promising candidates to report long shared haplotypes in a cohort.
The GERMLINE program is
available for download.
The algorithmic improvement facilitates rapid processing of thousands of samples genomewide.
Furthermore, the combinatorial nature of GERMLINE facilitates greater accuracy than statistical methods.
See our
publication for further details.
Contact person:Sasha
Accurate IBD Analysis Improves Phasing, Detects Deletion Variants
Applying GERMLINE to multiple datasets, the detection of IBD segments
with high resolution allowed observing short gaps along IBD segments
that are otherwise as long as expected. These can be classified into two types:
- Phasing errors in segments that are shared by remote relatives falsely appear as
frequent crossovers of the identical haplotype, and are observed as IBD gaps.
Correcting each pair of such adjacent switches
improves the initial phasing of the data.
- Deletions occurring on the background of one or both copies of the shared segment
harbor hemizygous markers genotyped as completely homozygotes along the deleted region,
and observed as IBD gaps.
Contact person:Sasha
Accurate IBD Analysis Refines Association Signals in an Isolated Population
Analyzing association signals to multiple phenotypes on the Island of Kosrae, Micronesia,
we occasionally observe multiple association signals at the same region.
Traditional statistical analysis of signal independence may support such signals,
but fail to conclusively indicate potential overlap of the span of the associated haplotypes.
Analysis of IBD segments maps variants to long range haplotypes, and enables identifying the
independently associated haplotypes with high fidelity.
Specifically, we were able to detect overlapping haplotypes
associated with levels of plasma plant sterols at the ABCG8 gene.
Contact persons:Eimear,Sasha