Background: Sequencing Throughput At The Gigabase Scale
While most of the current genetic data is based on SNP markers,
DNA sequencing has been increasing in throughput and cost-effectiveness
out-pacing Moore's law.
Currently, multiple technologies are able to attain raw throughput of gigabases per day and per $1000
of reagents.
The computational infrastructure to handle this flood of genomic data and to make sense of it
is a major current challenge for genetics.
Sequencing Pooled DNA From Multiple Individuals While Detecting Mutation-Carrier Identity
The throughout for the current generation of sequencing platforms
facilitates comprehensive variant discovery along a small fraction of the genome.
Ideally, one would sequence such a region across hundreds or thousands of individuals,
each being covered sufficiently to detect single-copy mutations.
Molding this design into existing benchmarks of sequencing throughput
means sequencing multiple runs, each of pooled DNA from dozens or hundreds of samples.
Unfortunately, this protocol means that identity of individual carriers of detected mutations
will be lost, requiring barcoding or other experimental enhancements.
We developed a novel method for sequencing of DNA from pools of individuals.
The method designs overlapping pools, so that an individual carrier of a discovered variant
can be traced through the intersection of such pools.
Error Correcting Codes make such pool designs robust to experimental sequencing error,
and more importantly, to statistical undersampling.
Such lacking representation of individual copies of mutation-carrying haplotypes
is likely given the observed distributions of genomic coverage.
We use block codes based on Steiner systems to design pooling schemes that are robust to undersampling.
Contact person:Snehit
Sequencing Of C. elegans Mutants Reveals Gamma-Distributed Genomic Coverage
Coverage, the number of sequence reads observing a particular site
is a key figure for attaining high-confidence calls of the base at that site,
and in particular, detecting sequence variants.
Traditional sequencing methods have modeled coverage as being
uniform across sites, therefore
Poisson-distributed in any sequencing experiment.
However, current technologies, that are based on massively-parallel sequencing of short reads,
do not obey the assumption of uniformity.
Our lab enjoys fruitful collaboration with the C. elegans lab of
Oliver Hobert that resulted in
the first mutant animal being fully resequenced.
This allowed us to observe that coverage on the Solexa GA-2 platform
is indeed poorly modeled by a Poisson distribution.
However, a single additional degree of freedom, the shape parameter of the more general Gamma distribution,
fits a model which is accurate across the vast majority of the reads.
Contact persons:YufengSnehit