The future of DNA sequencing is already in the classroom

Halfway through their Ubiquitous Genomics class, 20 students were handed a MinION device, a mobile DNA sequencer the size of two matchboxes laid end to end. This $1000 device, still in development, is expected to play an important role in advancing the goal of real-time, on-site DNA sequencing, vastly increasing the applications for DNA sequencing and who can perform it. For their professor Dr. Yaniv Erlich, the device has a more immediate purpose: a teaching tool that gives students a direct experience with handling and analyzing DNA samples and a close-up and early look at the possibilities of mobile DNA sequencing. Plus he was curious. What happens when you give smart, ambitious students a new device not yet fully explored?

The parasites were a surprise. A food sample given out was pre-measured to contain 80% beef and 20% tomato, but the students sequencing the distributed sample identified three parasites (babesia bigemina, wuchereria bancrofti, onchocerca ochengi) by their DNA and duly noted it as part of their assignment. Identifying parasites in food hadn’t been the original intent, but when you give students a brand new tool not yet in general use, it’s never clear how they are going to use it or what they will find. That’s part of the fun, and the learning, too, and it shows the promise of onsite, immediate DNA sequencing.

But it was not all smooth sailing. While students found the accidental parasites, some also mis-identified the beef—purchased from a local New York City grocery store—as bighorn sheep. Not a huge leap (both animals are in the same family), but it does dampen the excitement about how soon real-time DNA sequencing can be used at airports to screen passengers.

Classroom encounters with DNA sequencing

Sequencing DNA from food samples was the first of two hackathons in the class Ubiquitous Genomics, offered for the first time at Columbia and developed by Dr. Yaniv Erlich, an assistant professor of computer science at Columbia who also is also faculty member of the New York Genome Center. The class teaches the basics of DNA sequencing with an eye on future sequencing technologies that promise to make DNA identification possible in real time at almost any location.

Taught in conjunction with Sophie Zaaijer, a postdoc in Erlich’s NY Genome Center lab, the class combines aspects of computer science, biology, electrical engineering, algorithms, and data science, particularly the special challenges of acquiring, storing, and analyzing huge amounts of genomic data. (The first reading assignment was “Big Data: Astronomical or Genomical?” by Stephens.

Slightly more than half the students were computer science majors.

The class, however, has a major DIY twist. Rather than sending out DNA samples to a lab equipped with $1M sequencing machines, Erlich would have students learn DNA sequencing by handling and sequencing DNA samples themselves.

What makes this scenario even imaginable let alone possible is a new, portable DNA sequencing device called a MinION. Inexpensive (approximately $1000), portable, and capable of sequencing DNA in almost real time, the MinION will vastly broaden the applications of DNA sequencing and who can accomplish it.

The MinION is four inches long, weighs 4 ounces, and gets power from a computer’s USB port.

The MinION uses a sequencing method different from traditional (or sequential) DNA sequencing, which works by first breaking up the DNA into tiny snippets and painstakingly reassembling by mapping them against a template DNA—a process that can take days and requires a high level of expertise.

Instead, MinION relies on nanopore sequencing, where a single-stranded DNA molecule passes through a small biological pore, or nanopore, embedded in an electrical field. As the DNA molecule transits through the nanopore, the individual nucleotides (A, T, G, C) that construct a string of the DNA disrupt the ion current in characteristic ways, creating a profile (called a squiggle) that can be analyzed by software to “decode” the nucleotide sequence, almost in real time.

Erlich was able to procure for his class 5 MinIONs from the device’s manufacturer, Oxford Nanopore Technologies in the U.K. The devices, still under development, are being selectively distributed to researchers for feedback useful for refining and improving the device. (The class has generated interest among the community growing up around the MinION and was covered by a GenomeWeb article.)

Two hackathons count for half the grade

Half the grade would be determined by two hackathons, where the 20 graduate and undergraduate students, working in small groups, would be given the five MinIONs along with five PCs running MinION software. The first hackathon, “Snack to Sequence,” required student teams to identify ingredients of a food sample prepared by Zaaijer. In the second, CSI Columbia, students were given human DNA and asked to identify the specific individual who donated it. The first went much smoother than the second.

Before each hackathon, DNA samples have to first be prepared into what is called a DNA library, a step that was done by Zaaijer. Though generating DNA libraries for MinIONs is much simpler than for other sequencers, it is time-consuming, requires a lab setting (and is therefore not mobile yet), and takes some finesse and experience.

With the libraries prepared, the students take over. Using a pipette, they dispense a solution containing the prepared DNA into the MinION’s flow cell (which contains 512 channels containing nanopores). Care must be taken to not introduce air bubbles that render the pores inaccessible. It is tricky, and generally one person on each team learned how and performed the task each time.

As the solution seeps through the flow cell, individual modules transit the nanopores, and software on the PC powering the MinION starts detecting the ion current disruptions. This raw data (in HDF5 format) gets uploaded to the cloud where software analyzes the recorded events to identify the individual bases. Minutes later, students begin seeing preliminary sequencing data on their screens. (All reads—along with new code written—were posted to the class github site: (https://github.com/dspeyer/ubiq_genome).

Not all 512 channels contain a nanopore that produces reads, but those that do produce individual files for each sequenced read. It’s a lot of data in a very short time, both the promise of the MinION and the beginning of the difficulty for the students.

double-image — In a classroom at NY Genome Center, students observe MinION data during second hackathon. Screenshot shows stats on number and length of reads. Eventually students get base pairs needed for identifying samples.

A lot of data needing management

Right away, students were faced with the question of how to transfer thousands of individual files from the lab-supplied PCs to their own (mostly Mac) computers where they could carry out their analysis. The sizes of the files precluded using free cloud-based products such as Dropbox whose free accounts don’t support synchronizing such a large amount of data. The file-transfer issue, after some grappling, was finally solved by placing the data in a BitTorrent Sync folder that was then synched to students’ computers (maxing out the hard drive at least in one case).

Once sequenced data is downloaded, the students head out. Their task is now to compare their reads with existing DNA sequences already in public genetic databases to identify the sample DNA. This they do using existing alignment tools, many free, that compare two or more reads and produce a similarity score.

For the “Snack to Sequence” hackathon, students all used NBCI BLAST, a tool that makes it easy to run stand-alone searches for similar sequences, letting students know whether their read aligns more closely with tomato than zucchini, for instance. The concept is simple, but the difficulty level can ratchet up quickly depending on what two sequences are being compared. Discriminating between two species is one thing; differentiating between two humans who share many of the same traits is something else entirely.

CSI Columbia proved to be much more open ended. Here the aim was to test whether MinION sequencing could be used to identify one specific individual. Normally short tandem reads (STRs) are used to identify individuals (by the FBI for instance), not the long reads returned by nanpore sequencing. As yet, no scientific framework exists on how to identify an individual using the reads generated from the MinION nanopore sequencer. In trying to do so, students were venturing into new territory.

While there are existing alignment tools for comparing two or more human DNA sequences, almost all were developed for traditional sequencing methods, and the students on their own would need to figure out which alignment tool would work best.

Students in new territory

Choosing an alignment tool took time. With many different ones (many free), it was hard to know even when to begin. Among the first to see MinION data, students were operating without any clear guidelines on what tools would work best on nanopore sequencing data. Even downloading the tools took time, a step that often had to be repeated when students discovered their first tool choice didn’t work well. (Oxford Technologies is working on data analysis tools specialized for MinION data.)

File formats were another issue and consumed a significant amount of time for the teams. Different tools accept and output different file formats; many were incompatible, only some were standard.

For CSI Columbia, the difficulty level ratcheted up much more than even Erlich and Zaaijer had imagined. (In fact, CSI Columbia had initially been slated to occur first, ahead with the snack hackathon. However, preparing the DNA libraries for CSI Columbia took longer than planned, necessitating a switch in the order of hackathons.)

Students were not originally given any clues as to the identity of their suspect individuals, just told to take their human DNA sample and search public genetic databases to find the person. With students having difficulty finding the right tool and overcoming file incompatibilities, halfway through the assignment Erlich gave out the names of three possible subjects—Erlich himself, Craig Venter, James Watson, or someone in the 1000 Genomes Project. This extra information changed the scope considerably: rather than finding a single individual in a sea of others, the task became looking closely at three individuals, and ruling out two. Even then, only one of the five groups made the correct identification.

The main issue had to do with the number of reads remaining after students filtered out reads not meeting the quality requirements for nanopore sequencing, leaving a subset of reads to be aligned to the reference genome. The MinION reads covered the genome around 1%, which is extremely low coverage. In addition, nanopore sequencing is less accurate and has more errors (deletions, insertions, and substitutions) and more noise than traditional sequencing. This poses a challenge, since much information about ancestry or traits are tiny changes in the DNA (SNPs). Even so, students were able to learn aspects of an individual’s ancestry and traits (including susceptibility to diseases) but didn’t have enough data to differentiate one individual from others who shared some of the same characteristics.

(Erlich, who wants to offer the class again, is considering adding an intermediate, “where-you-are” report so students can help one another over particular humps.)

Fortunately for the students, the grade depended more on methodology and designing a workable pipeline for sequencing DNA and analyzing the results. In this regard, the students excelled, even with the severe computational challenges of constructing an integrated pipeline out of several distinct steps (acquisition, storage, distribution, and analysis), each with its own particular file incompatibilities and data storage problems. Without a clear route already mapped out by others, students responded by writing their own code to plug up the holes and seamlessly transition data from one step to another.

The fundamental structure was sound; it was the data that was lacking. But even then, students demonstrated they were able to properly interpret the data. If they couldn’t identify the exact donor with the data they had, they still were able to provide a list of traits that in the real world would help narrow the number of suspects.

Zaaijer points out also that students were dealing with a technology that is not yet mature. “Mobile sequencing is just now getting off the ground, and the error-rate in the reads is still relatively high compared to traditional DNA sequencing—though many scientific groups are working on improving this. It was good for the students to experience that not everything is an iPhone where you open the box and it works. Technology evolves by hard work of many people who see a future (and applications) for new types of devices and machines. The hackathons were a good learning experience. Even though there are obstacles to overcome, the students also saw the opportunities the technology has.”

Students not only demonstrated they absorbed the basics of DNA sequencing but added ideas and strategies of their own. One team had taken a throw-processing-power-at-the-problem approach, setting up a dedicated server for the sole purpose of downloading the entire genomes of Watson and Venter—enormous files (100 gigs for Watson, 80 gigs for Venter). It ran for over 24 hours before the team called a halt.

Interestingly the one group that did identify its suspect actually had the fewest reads, but compensated by using a statistical approach that assigned probabilities to different templates, thus narrowing choices to the most likely candidate. Even though their data was far from complete, the team made the correct identification. It was an impressive and highly workable solution that Erlich sees as the subject of a possible scientific paper.

Final project

The final project, good for 25% of the grade had students work in pairs to describe a new use for the MinION. Each group had different applications, from waste water management, to safe person identification at borders, to sequencing by zero gravity. Especially innovative was the idea for at-home sequencing to trace potential transplant rejection; another proposed using the sequencer when traveling to find edible food and clean water resources.

How soon before these applications or any others start appearing in the real world? The students themselves, after the hackathon experiences, may be more conservative than others who haven’t actually tried to do mobile DNA sequencing. Students were asked twice when mobile DNA sequencing replaces passport checks at national borders, once before the hackathons and once after. Their answers were more conservative and perhaps more realistic at the second asking. Students still see the breadth of new applications for mobile DNA sequencing, but know first hand that technical difficulties still need to be resolved.

Erlich and Zaaijer however focus on how much students new to genomics were able to accomplish.

Though there were hiccups, the problems had more to do with finding the proper tools and overcoming incompatible file formats. Erlich and Zaaijer had been pushing from the beginning to see how far the students could go; that some original assumptions didn’t work out was only to be expected. However, the main goal was clearly achieved: students new to DNA sequencing were able with a little training to successfully set up a sequencing pipeline and imagine new uses for the MinION. A sophisticated process once relegated to specialized labs worked in the classroom. It points to the huge possibilities of mobile, onsite DNA sequencing.

Says Erlich, “The future is here: we can place DNA sequencers in the hands of our students. No more theoretical explanation of how sequencers work, no more just data wrangling. We can let them feel the internal, promote critical thinking, and a sense of ownership. DNA is everywhere. In your food, on your clothes, everything you touch. By having these sequencers, we can let students get a glimpse for this rich data layer around them.”

Posted 2/23/16
Class photos by Tim Lee
– Linda Crane

The future of DNA sequencing is already in the classroom

Classroom encounters with DNA sequencing

Two hackathons count for half the grade

A lot of data needing management

Students in new territory

Final project

Computer Science at Columbia University

Upcoming Events

Last day of classes

Foundation Models for Robotic Manipulation: Opportunities and Challenges

Class Day Graduate Ceremony

Class Day Undergraduate Ceremony

In the News

Press Mentions