Despite our obsessive interest in humans, they make a poor model organism. Their genetics, for example, is complicated by generations of sorting into populations and merging them together. These violations of standard, statistical assumptions of random mating, idealized samples are a major problem in disease association studies. Fortunately, the information in genomewide arrays that profile an individual's genetic makeup for disease studies also stores clues about origin of an individual's ancestors.

Our lab has recently completed development of Xplorigin a software tool to decipher population ancestry of different regions along an individual's genome. The tool is based on a Generalized Hidden Markov Model, trained on data from the International HapMap Project.

Analysis of population ancestry relies on differences in the frequency of variants between populations.The first methods to perform such analysis relied on ancestry informative markers, that have been selected for showing large frequency differences. However, the most abundant source of genetic data today are whole genome arrays.Such data is based on markers selected due to technology and LD considerations, rather than informativeness with respect to ancestry.This means each particular marker will have random, typically slight frequency differences.

Xplorigin takes as input data from whole genome arrays and pools information across many consecutive markers to decide ancestry at each locus.The dependence of nearby markers on one another is a major obsacle to mathematically using this information.We therefore use haplotypes within haplotype blocks as our atomic variant. This not only captures the intermarker correlation structure, but also helps information content: The differences in haplotype frequencies across populations are typically greater (in terms of power to distinguish origin) than differences in SNP allele frequencies.