Background : Long Haplotypes, Identical By Descent (IBD) as Evidence for Relatedness
A pair of descendants of the same ancestor may share haplotypes that have been transmitted along the respective lineages leading to them. At each particular locus, the chance of this happening decreases exponentially with the number of transmissions required. However, if this event does occur at a specific site, then neighboring sites are likely to be shared as well: Sharing will cease along the genome only upon reaching a site of ancestral recombination. K-th cousins for instance, are expected to share less than one region genomewide for K>5, but if, say, K<20, that region is expected to exceed 2.5cM. With current SNP density on commercial genotyping arrays we therefore expect several hundreds of SNPs to be spanned by such a region - above and beyond the number of consecutive SNPs expected to be IBS (Identical By State) by chance alone. Sharing of a segment IBD is therefore a rare but unequivocal evidence of hidden, remote relatedness.
Hypothesis: Hidden Relatedness Becomes Frequent in Large Cohorts
A pair of purported unrelated individuals descended from an ancestral population will have some chance of their ancestral lineages randomly reaching the same ancestor. This probability is inverse proportional to the effective size of the ancestral population. While this event is rare, its frequency is sufficient for hidden relatedness to be observed in the Human Haplotype Map cohorts of 60-90 unrelateds per population. Scaling sample size by two or three orders of magnitude, the chance of a pair to share may remain the same, but the number of observed pairs increases quadratically, up to a million fold. We therefore hypothesized that IBD would be ubiquitous in large cohorts. Unfortunately, scanning for hidden relatedness genomewide across all pairs of individuals requires computational resources quadratic in the cohort size, and quickly becomes prohibitive.
GERMLINE: Genetic Error-tolerant Regional Matching with Linear-time Extension
We developed a linear time method to create a dictionary of haplotype "words". This hash table for each region along the genome allows filtering IBD matches rapidly, and extending only promising candidates to report long shared haplotypes in a cohort. The GERMLINE program is available for download. The algorithmic improvement facilitates rapid processing of thousands of samples genomewide. Furthermore, the combinatorial nature of GERMLINE facilitates greater accuracy than statistical methods. See our publication for further details.
Accurate IBD Analysis Improves Phasing, Detects Deletion Variants
Applying GERMLINE to multiple datasets, the detection of IBD segments with high resolution allowed observing short gaps along IBD segments that are otherwise as long as expected. These can be classified into two types:
- Phasing errors in segments that are shared by remote relatives falsely appear as frequent crossovers of the identical haplotype, and are observed as IBD gaps. Correcting each pair of such adjacent switches improves the initial phasing of the data.
- Deletions occurring on the background of one or both copies of the shared segment harbor hemizygous markers genotyped as completely homozygotes along the deleted region, and observed as IBD gaps.
Accurate IBD Analysis Refines Association Signals in an Isolated Population
Analyzing association signals to multiple phenotypes on the Island of Kosrae, Micronesia, we occasionally observe multiple association signals at the same region. Traditional statistical analysis of signal independence may support such signals, but fail to conclusively indicate potential overlap of the span of the associated haplotypes. Analysis of IBD segments maps variants to long range haplotypes, and enables identifying the independently associated haplotypes with high fidelity. Specifically, we were able to detect overlapping haplotypes associated with levels of plasma plant sterols at the ABCG8 gene.