The future of DNA sequencing is already in the classroom
Halfway through their Ubiquitous Genomics class, 20 students were handed a MinION device, a mobile DNA sequencer the size of two matchboxes laid end to end. This $1000 device, now fully available after being introduced in an early access program, is expected to play an important role in advancing the goal of real-time, on-site DNA sequencing, vastly increasing the applications for DNA sequencing and, just as far-reaching, expanding the number of people who can do DNA sequencing. For their professor, Dr. Yaniv Erlich, the device has a more immediate purpose: a teaching tool that gives students direct experience with handling and sequencing DNA samples for themselves. Plus he was curious. What happens when you give smart, ambitious students a new device not yet fully explored?
The parasites were a surprise. In sequencing a food sample pre-measured to contain 80% beef and 20% tomato, the students identified the DNA of three parasites (babesia bigemina, wuchereria bancrofti, onchocerca ochengi) and duly noted it as part of their assignment. Identifying parasites in food hadn't been the original intent, but when you give students a brand new tool not yet in general use, it's never clear how they are going to use it or what they will find. That's part of the fun, and the learning, too, and it shows the promise of onsite, immediate DNA sequencing.
But it was not all smooth sailing. While students found the accidental parasites, some also misidentified the beef—purchased from a local New York City grocery store—as bighorn sheep. Not a huge leap (both animals are in the same family), but it does give pause to the idea that real-time DNA sequencing will soon be in use at airports to screen passengers.
Classroom encounters with DNA sequencing
Sequencing DNA from food samples was the first of two hackathons in the class Ubiquitous Genomics, offered for the first time at Columbia and developed by Dr. Yaniv Erlich, an assistant professor of computer science at Columbia who also is also faculty member of the New York Genome Center. The class teaches the basics of DNA sequencing with an eye on future sequencing technologies that promise to make DNA identification possible in real time at almost any location.
Taught in conjunction with Sophie Zaaijer, a postdoc in Erlich's New York Genome Center lab, the class combines aspects of computer science, biology, electrical engineering, algorithms, and data science, particularly the special challenges of acquiring, storing, and analyzing huge amounts of genomic data. (The first reading assignment was Big Data: Astronomical or Genomical? by ZD Stephens and others.)
The class, however, has a major DIY twist. Rather than sending out DNA samples to a lab equipped with $1M sequencing machines, Erlich would have students learn DNA sequencing by actually doing it themselves.
What makes this scenario even imaginable let alone possible is a new, portable DNA sequencing device called a MinION. Inexpensive (approximately $1000), portable, and capable of sequencing DNA in almost real time, the MinION will vastly broaden the applications of DNA sequencing and who can accomplish it.
The MinION uses a sequencing method different from traditional (or sequential) DNA sequencing, which works by first breaking up the DNA into tiny snippets before painstakingly reassembling them, mapping them against a template DNA—a process that can take days and requires a high level of expertise.
Instead, MinION relies on nanopore sequencing, where a single-stranded DNA molecule passes through a small biological pore, or nanopore, embedded in an electrical field. As the DNA molecule transits through the nanopore, the individual nucleotides (A, T, G, C) that construct a string of the DNA disrupt the ion current in characteristic ways, creating a profile (called a squiggle) that can be analyzed by software to "decode" the nucleotide sequence, almost in real time.
Erlich was able to procure for his class five MinIONs because the device's manufacturer, Oxford Nanopore Technologies, is interested in exploring the potential applications of the MinION in education. (The class has generated interest among the community growing up around the MinION and was covered by a GenomeWeb article.)
Two hackathons count for half the grade
Half the grade would be determined by two hackathons, where the 20 graduate and undergraduate students, working in small groups, would be given the five MinIONs along with five PCs running MinION software. The first hackathon, "Snack to Sequence," required student teams to identify ingredients of a food sample prepared by Zaaijer. In the second, "CSI Columbia," students were given human DNA and asked to identify the specific individual who donated it. The first went much smoother than the second.
Before each hackathon, DNA samples were first prepared to create a DNA library for feeding into the MinION, a step that was done by Zaaijer. Though generating DNA libraries for MinIONs is much simpler than for other sequencers, it is time-consuming, requires a lab setting (and is therefore not mobile yet), and takes some finesse and experience.
With the libraries prepared, the students take over. Using a pipette, they dispense a solution containing the prepared DNA into the MinION's flow cell (which contains 512 channels containing nanopores). Care must be taken to not introduce air bubbles that render the pores inaccessible. Pipetting is tricky, and generally one person on each team learned how to do it and performed the task each time.
As the solution seeps through the flow cell, individual modules transit the nanopores, and software on the PC powering the MinION starts detecting the ion current disruptions. This raw data (in HDF5 format) gets uploaded to the cloud where software analyzes the recorded events to identify the individual bases. Minutes later, students begin seeing preliminary sequencing data on their screens. (All reads—along with new code written—were posted to the class github site.)
Not all 512 channels contain a nanopore that produces reads, but those that do produce individual files for each sequenced read. It's a lot of data in a very short time, both the promise of the MinION and the beginning of the difficulty for the students.
Right away, students were faced with the question of how to transfer thousands of individual files from the lab-supplied PCs to their own (mostly Mac) computers where they could carry out their analysis. The sizes of the files precluded using cloud-based products such as Dropbox whose free accounts don't support synchronizing data at such large scale. The file-transfer issue, after some grappling, was finally solved by placing the data in a BitTorrent Sync folder that was then synched to students' computers (maxing out the hard drive in at least one case).
With the sequenced data downloaded, the students head out. Their task is now to compare their reads with existing DNA sequences found online to identify the sample DNA. This they do using existing alignment tools, many free, that compare two or more reads and produce a similarity score.
For the snack hackathon, students all used NBCI BLAST, a tool that makes it easy to run stand-alone searches for similar sequences and to discover, for instance, whether a given read aligns more closely with a template read from a tomato or from a zucchini. The concept is simple, but the difficulty level can ratchet up quickly depending on what two sequences are being compared. Discriminating between two species is one thing; differentiating between two humans who share many of the same traits is something else entirely.
Difficulty level increases in second hackathon
Of the two hackathons, CSI Columbia proved to be much more open-ended. Here the aim was to test whether MinION sequencing could be used to identify a single person. Normally short tandem reads (STRs) are used to identify individuals (the FBI typically uses 13 different STRs for identification purposes), not the long reads returned by nanopore sequencing. As yet, no scientific framework exists on how to identify an individual using the reads generated from the MinION nanopore sequencer. While there are existing alignment tools for comparing two or more human DNA sequences, almost all were developed for traditional sequencing methods.
Choosing an alignment tool took time. With many different ones, it was hard to know where to begin. Even downloading the tools took time, a step that often had to be repeated when students discovered their first tool choice didn't work well.
File formats were another issue and consumed a significant amount of time for the teams. Different tools accept and output different file formats. Many were incompatible; only some were standard.
For CSI Columbia, the difficulty level ratcheted up much more than even Erlich and Zaaijer had imagined. (In fact, CSI Columbia had initially been slated to occur first, ahead of the snack hackathon. However, preparing the DNA libraries for CSI Columbia took longer than planned, necessitating a switch in the order of hackathons.)
Students were not originally given any clues as to the identities of the individuals whose DNA was being sequenced; they were told only to search several online genetic databases for a close match. With students having to spend considerable time finding the right tool and overcoming file incompatibilities, halfway through the assignment Erlich narrowed the scope, naming himself, Craig Venter, James Watson, or someone in the 1000 Genomes Project as the possible suspects. This extra information changed the scope considerably: rather than finding a single individual in a sea of others, the task became to look closely at a few individuals, and rule out others. Even then, only one of the five groups made the correct identification.
The main issue had to do with the number of reads students actually had to work with. Nanopore sequencing is less accurate and has more errors (deletions, insertions, and substitutions) and more noise than traditional sequencing. After filtering out those reads not meeting quality requirements for nanopore sequencing, students were left with a subset of reads covering the genome to around 1%. Such low coverage poses a challenge since much information about ancestry or traits is derived from tiny changes in the DNA (SNPs). Even so, students were able to learn some aspects of an individual's ancestry and traits (including susceptibility to diseases).
(Erlich wants to offer the class again and is considering adding an intermediate, "where-you-are" report so students can help one another over encountered roadblocks.)
Fortunately for the students, the grade depended more on methodology and designing a workable sequencing pipeline than coming up with a correct identification. In this regard, the students excelled, even with the severe computational challenges of constructing an integrated pipeline out of several distinct steps (acquisition, storage, distribution, and analysis), each with its own particular file incompatibilities and data storage problems. Without a clear route already mapped out by others, students responded by writing their own code to plug up the holes and seamlessly transition data from one step to another.
The fundamental structure was sound; it was the data that was lacking. But even then, students demonstrated they were able to properly interpret the data they had. If they couldn't identify the exact donor, they still were able to provide a list of traits that in the real world would help narrow the number of suspects.
Zaaijer points out also that students were dealing with a technology that is not yet mature. "Mobile sequencing is just now getting off the ground, and the error-rate in the reads is still relatively high compared to traditional DNA sequencing—though many scientific groups are working on improving this. It was good for the students to experience that not everything is an iPhone where you open the box and it works. Technology evolves by hard work of many people who see a future (and applications) for new types of devices and machines. The hackathons were a good learning experience. Even though there are obstacles to overcome, the students also saw the opportunities the technology has."
Students not only demonstrated they absorbed the basics of DNA sequencing but added ideas and strategies of their own. One team had taken a throw-processing-power-at-the-problem approach, setting up a dedicated server for the sole purpose of downloading the entire genomes of Watson and Venter—enormous files weighing in at 100 gigs for Watson, 80 gigs for Venter. It ran for over 24 hours before the team called a halt.
Interestingly the one group that did correctly identify its suspect actually had the fewest reads but compensated by using a statistical approach that assigned probabilities to different templates, thus narrowing choices to the most likely candidate. It was an impressive and highly workable solution that Erlich sees as the subject of a possible scientific paper.
The final project, good for 25% of the grade, had students work in pairs to describe a new use for the MinION. Each group had different applications, from wastewater management, to safe person identification at borders, to sequencing by zero gravity. Especially innovative was the idea for at-home sequencing to trace potential transplant rejection; another proposed using the sequencer when traveling to find edible food and clean water resources.
How soon before these applications or any others start appearing in the real world? Once before and once after the hackathons, students were asked to estimate when mobile DNA sequencing might replace passport checks at national borders. Their answers were more conservative at the second asking, but not by much. Only one or two students revised their answer. Students clearly see the potential for mobile DNA sequencing, even with first-hand knowledge of the work and dedication still needed to optimize the technology.
Though there were hiccups, the problems had more to do with finding the proper tools and overcoming incompatible file formats. Erlich and Zaaijer had been pushing from the beginning to see how far the students could go; that some original assumptions didn't work out was only to be expected. However, the main goal was clearly achieved: students new to DNA sequencing were able —with a little training—to successfully set up a sequencing pipeline and imagine new uses for the MinION. That a sophisticated process once relegated to specialized labs played out relatively smoothly in the classroom points to the huge possibilities of mobile, onsite DNA sequencing.
Says Erlich, "The future is here: we can place DNA sequencers in the hands of our students. No more theoretical explanation of how sequencers work, no more just data wrangling. We can let them feel the internal, promote critical thinking, and a sense of ownership. DNA is everywhere. In your food, on your clothes, everything you touch. By having these sequencers, we can let students get a glimpse for this rich data layer around them."
- Linda Crane
Class photos by Tim Lee
About the researchers
Yaniv Erlich is assistant professor of computer science at Columbia University and a Core Member of the New York Genome Center. New York Genome Center. His research is the fast-moving field of computational genetics, where he has developed algorithms for faster, more accurate sequencing while also inventing new ways to harness large genomic datasets necessary for examining the genetic basis of diseases. His work has been closely followed in the mainstream press, with features in The New York Times, The Atlantic, NPR, The Wall Street Journal, among others.
He is also among the first to call attention to the ethics of genetic privacy. A paper he coauthored in 2013 found that anonymous research participants could be identified from their DNA. When he and Joseph Pickrell last year launched DNA.Land, a site for crowdsourcing human genome information for scientific research, the site's privacy rules drew praise for their transparency and clear, understandable language.
Erlich received his PhD in genomics and bioinformatics from the Watson School of Biological Sciences at the Cold Spring Harbor Laboratory in New York, and his B.Sc. in computational neuroscience from Tel-Aviv University. From 2010 to 2015, he was a Fellow at MIT's Whitehead Institute where his research lab created a sequencing strategy to find rare genetic variations, and a short tandem repeat profiler for personal genomes. He also assembled what is probably the world's largest family tree by pooling 43 million profiles from a publicly available genealogy site.
Erlich is the recipient of the Burroughs Wellcome Career Award (2013), Harold M. Weintraub award (2013), the IEEE/ACM-CS HPC award, Goldberg-Lindsay Fellowship, Wolf foundation scholarship for Excellence in exact science, and Emmanuel Ax scholarship, and he was selected as one of 2010 Tomorrow's PIs team of Genome Technology.
Sophie Zaaijer is a Post-doctoral Researcher in Yaniv Erlich's lab at the New York Genome Center and Columbia University. She is interested in exploring the boundaries of the latest sequencing methodologies.
Zaaijer is from the Netherlands, where she did her undergraduate in Music (viola) and Food Technology. For her Masters, she studied Medical Biotechnology at Wageningen University and went to Harvard Medical School to finish her thesis work in Monica Colaiacovo's lab. She next went on to do a PhD in Molecular Biology and Genetics in Julie Cooper's lab at Cancer Research UK, London (now the Crick Institute) as well as at the National Institutes of Health, Bethesda. Her PhD work focused on dysfunctional telomeres and faulty chromosome segregation during cell division.