About ten years ago, I spent a summer with other high school students for a summer program at the Waksman Institute of Microbiology. The program's goal was to introduce us to protocols to extract plant DNA and isolate regions of interest for sequencing. We learned how to use restriction enzymes to cut the DNA into smaller fragments, bacterial transformations to make copies of the DNA within E. coli, PCR to make copies of DNA without the help of E. coli, and gel electrophoreses to separate the DNA fragments by size and isolate the one(s) we wanted. Finally, the DNA had to be sequenced, and for this, we were introduced to the Sanger method, developed in 1975 by Frederick Sanger and his colleagues.
The Sanger method involves adding modified nucleotides called dideoxynucleotides, which can only form bonds at one end. Think of a Lego piece with a flat top. Thus, a DNA chain that has such a nucleotide will immediately terminate. If these nucleotides are mixed in with regular nucleotides during a process like PCR, it creates fragments of the DNA sequence with the same starting point and varying endpoints. If only a particular type of dideoxynucleotide such as dideoxyadenine (ddATP) is used, then all the resulting fragments terminate with an 'A'. If these fragments are then separated by gel electrophoresis, one can get a rough idea of the positions where 'A' shows up in the DNA sequence of interest. If 'C', 'G', and 'T' wells are adjacent to the one for 'A', one can just read off the DNA sequence from the gel electrophoresis. This is the basic principle of the Sanger method.
By the time school started again, we had become familiar with the techniques and protocols. We continued to return to the Waksman Institute periodically and apply these techniques. We would eventually use the sequence data from these visits to construct a phylogenetic tree of the Allium (i.e. onion) genus. Unfortunately, the data collection process could often be slow and annoying. There were many stages in which something could go wrong, and I would have to return to the beginning. All of this work produced just a tiny fraction of sequence information from these genomes.
A lot can happen in ten years. Thanks to my friends in the Broad's Outreach Program, I had a chance to visit 320 Charles St., the location of the Broad Institute's DNA sequencing facility. It is sometimes called a high-throughput production facility because of the rate at which they manage to sequence DNA. The facility was responsible for many of the sequences that were part of the Human Genome Project, and I was about to find out how they did it.
We entered 320 Charles St. and sat down for a presentation. Before we could start our tour of the facility, one of the scientists wanted to describe the process. To my surprise, she described the Sanger method. How could this be the process of a high-throughput production facility? Once the tour started, it became clear how: they industrialized the process. We had entered a factory, complete with conveyor belts, robotic arms, and computers. A group of technicians oversaw that the work on this genome assembly line went smoothly. Others, including the scientist leading the tour, were working on ways to industrialize new and improved sequencing methods developed by Solexa and 454.
It was interesting to learn that part of the rate increase has come from engineering solutions to scale up production. The amount of sequence data now available is enabling some researchers to ask questions that may previously have been too time-consuming to answer. I have talked to biologists this summer that have told me how challenging data collection can be, and I am starting to realize how those difficulties play a role in the questions they ask. How might these questions change if other protocols for data gathering were similarly industrialized?