Background: Sequencing Throughput At The Gigabase Scale
While most of the current genetic data is based on SNP markers, DNA sequencing has been increasing in throughput and cost-effectiveness out-pacing Moore's law. Currently, multiple technologies are able to attain raw throughput of gigabases per day and per $1000 of reagents. The computational infrastructure to handle this flood of genomic data and to make sense of it is a major current challenge for genetics.
Sequencing Pooled DNA From Multiple Individuals While Detecting Mutation-Carrier Identity
The throughout for the current generation of sequencing platforms facilitates comprehensive variant discovery along a small fraction of the genome. Ideally, one would sequence such a region across hundreds or thousands of individuals, each being covered sufficiently to detect single-copy mutations. Molding this design into existing benchmarks of sequencing throughput means sequencing multiple runs, each of pooled DNA from dozens or hundreds of samples. Unfortunately, this protocol means that identity of individual carriers of detected mutations will be lost, requiring barcoding or other experimental enhancements.
We developed a novel method for sequencing of DNA from pools of individuals. The method designs overlapping pools, so that an individual carrier of a discovered variant can be traced through the intersection of such pools. Error Correcting Codes make such pool designs robust to experimental sequencing error, and more importantly, to statistical undersampling. Such lacking representation of individual copies of mutation-carrying haplotypes is likely given the observed distributions of genomic coverage. We use block codes based on Steiner systems to design pooling schemes that are robust to undersampling.
Sequencing Of C. elegans Mutants Reveals Gamma-Distributed Genomic Coverage
Coverage, the number of sequence reads observing a particular site is a key figure for attaining high-confidence calls of the base at that site, and in particular, detecting sequence variants. Traditional sequencing methods have modeled coverage as being uniform across sites, therefore Poisson-distributed in any sequencing experiment. However, current technologies, that are based on massively-parallel sequencing of short reads, do not obey the assumption of uniformity. Our lab enjoys fruitful collaboration with the C. elegans lab of Oliver Hobert that resulted in the first mutant animal being fully resequenced. This allowed us to observe that coverage on the Solexa GA-2 platform is indeed poorly modeled by a Poisson distribution. However, a single additional degree of freedom, the shape parameter of the more general Gamma distribution, fits a model which is accurate across the vast majority of the reads.
Contact persons:Yufeng, Snehit