HOW IS THE GENOME DECODED?
The first whole human genome sequence was completed in draft form in 2001 and in a more finished form in 2003. The project took over a decade to complete, involved ~2,000 researchers, and cost $500 million to $1 billion. The genome sequence is a composite because it was prepared using DNA pooled from several individuals, and thus does not represent a single individual’s genome.
The genome sequence was determined primarily using technologies that decoded short fragments, ~1,000 bases in length, and computer-based assembly of those short “reads” into longer continuous sequences based on information from fragment overlaps. The longer continuous sequences of
about 150,000 base pairs in length were mapped and assembled into the whole genome sequence based on a previously developed rough map of the genome. The final sequence, referred to as the “reference genome,” is approximately 3 billion base pairs in length, and includes each of the 22 autosomes plus the X and Y sex chromosomes. Some gaps in the reference genome sequence still exist; these occur in regions of the genome that are difficult to sequence using current technologies or that are highly variable among people.
Using current technologies, it is now possible to sequence an individual’s genome in a few days. The present approach involves sequencing millions of short fragments, typically about 100–150 bases in length (Figure a and Figure b). The fragment sequences are mapped to the reference genome and variants with respect to the reference are identified (or “called” in geneticists’ lingo). The methods have a ~1% error rate. Therefore, to ensure accuracy, each base is typically sequenced an average of 30 times (i.e., the “sequencing depth” is 30-fold or 30X). This method works well for calling single base variants and short insertions and deletions. Larger structural variants (large insertions, deletions and inversions), however, are more difficult to determine and specially designed computer algorithms are used to find them (Appendix A). The end result is that a person’s genome sequence is defined as variants relative to a reference sequence; and those variants that land in a gene associated with human disease are scrutinized most heavily to assess whether they might be disease-causing.
Figure . (a) Deciphering genome sequences. A person’s genomic DNA is broken into short fragments and sequences (typically 100–150 nucleotides) are determined from the ends of the fragments. These are mapped to the reference genome and variants (Black Bars) are identified. (b) An example of a variant identified in the ABCA4 gene, which is involved in retinal function. Mutation in both copies of this gene can lead to retinal disease. Gray bars represent sequences that are identical to the reference sequence shown at the bottom of the figure. Differences in the sequences are shown as letters on the gray bars. This individual has about one half of their reads containing a T instead of a C within the coding sequence and thus is heterozygous, that is, one gene copy has a variant and the other matches the reference sequence.
It is important to note that either one copy or both copies of any given gene (i.e., genes from both parents) can contain a variant with respect to the reference sequence, and it is also possible that different variants can be found in the same or both copies of any given gene. Phasing is a process by which variants are mapped to the same or opposite chromosomes. If two variants occur together on the same chromosome they are “in phase.” Phasing is important for predicting the functional consequence of variants in a given gene (Figure 1). For example, if sequencing reveals the presence of two deleterious variants affecting a gene that has an important role in taste, and phasing reveals that the two deleterious variants are on the same chromosome (in the same copy of a gene), the copy of the gene on the opposite chromosome will be unaffected and the presence of the normal gene product might mask the effect of the deleterious variants—and the sense of taste will be normal. In contrast, if the two deleterious variants occur in separate copies of the gene (i.e., the genes from both parents are affected), no normal gene product will be produced—and the sense of taste will be compromised.
Figure 1. The importance of “phasing variants.” When two deleterious variants reside in one gene copy and none on the other, one good copy exists (top). However, if both gene copies carry a deleterious mutation, then both copies are inactivated and this can result in disease. When the mutations are different, this situation is called a “compound heterozygote.” Thus, it is important to not only identify variants but know where they lie relative to one another.
To date many genomes have been sequenced. A large scale project called the “1000 Genomes Project” has determined the genome sequences of over one thousand people from highly diverse backgrounds and regions around the world. One of the products of this effort is a catalog of many of the common variants found in human populations. Over 50 million variants have been identified. In addition, over one thousand healthy people have had their genomes sequenced, including many individuals who are excited about the potential of personalized medicine and genomics and, therefore, often committed their own resources to have their genome sequence determined. These include celebrities such as Ozzy Osbourne and Glenn Close and many regular people as well. Additionally, many thousands of individuals with diseases such as cancer or diseases of unknown causation have had their genomes sequenced.
There are alternatives to whole genome sequencing that determine DNA sequences of specific, targeted areas of the genome. For example, exome sequencing is aimed at determining the sequences of the protein-coding regions of the genome (Figure 2 and Figure 3). The exome is the portion of the genome that encodes RNA, which is carrying out many of the biological functions—it is the part of the genome that is easiest to interpret. Other even more targeted approaches determine the sequences of particular subsets of genes that are commonly implicated in certain diseases. These various targeted approaches have historically been less expensive than whole genome sequencing and can provide more accurate sequence information for the regions of interest because “deeper” sequencing is feasible, meaning that the region can be sequenced more times. Exome and targeted approaches may increase the ability to detect heterogeneity (for example, somatic mutations in a subset of tumor cells) in a sample. Although the targeted approaches may be more resource- and time-efficient and permit increased depth and accuracy compared with whole genome sequencing, these advantages must be weighed against the lesser amount of sequence information obtained. Also, targeted approaches are limited in their ability to detect structural variants. On the other hand, whole genome sequencing generates a more complete set of information, but its value is necessarily constrained by our current limited knowledge about the non-protein coding regions of the human genome and their role in health and disease. Exome sequencing is presently the most widely used method for decoding a person’s DNA, but whole genome sequencing is gaining in popularity.
Figure 2. Genes can be divided into sections, with only some sections represented in the final, mature RNA. The parts of the gene that encode the mature RNA are called exons; the intervening noncoding regions (which are removed or “spliced” out in the final, mature RNA) are called introns. The exome refers to the portion of the genome that is represented in mature RNA. Presently, it is the easiest part of the genome to interpret.
Figure 3. Exome sequencing. Probes can be used to capture only the exome-coding regions of the genome, which can then be sequenced. Because the protein coding exome comprises only ~1%–2% of the total genomic DNA, exome sequencing results in an enormous cost savings compared with sequencing the entire genome. Also, because the scale is smaller, the exome is amenable to deep sequencing, which allows detection of rare variants that might otherwise be overlooked (as can occur in cancer).
Appendix A. Some of the different computational approaches used for mapping structural variants. (1) By comparing the sequences at the ends of fragments of known length with the reference genomes it can be deduced whether there is a deletion, insertion, or inversion in the sequenced regions. (2) By simply counting the number of independent sequences that map to a genomic interval it can be deduced whether there are normal numbers of copies, too few (i.e., a deletion of one or more copies), or extra copies of a region. An example of a region containing a deletion (dip on the left) and extra copy(ies) (increased signal on the right) is shown beneath the illustration.(3) Sequences that are juxtaposed in a person’s genome when normally separated in the reference genome indicate the presence of a deletion in that person’s DNA.