Human genome sequencing has provided some startling insights
- Coding sequences constitute less than 2% of the total genome, while greater than 50% of the genome represents blocks of repetitive nucleotides of uncertain function.
- The genome contains only 20,000 to 25,000 genes that code for proteins, although alternative splicing can generate more than 100,000 proteins.
- On average, any two individuals share 99.5% of their DNA sequences, so that the remarkable diversity of humanity (and the basis of genetic diseases) rests in less than 0.5% of the DNA (or roughly 15 million base pairs).
- The most common forms of genetic variation are single nucleotide polymorphism (SNP) and copy number variation (CNV). SNPs represent variations at single nucleotides and are mostly biallelic (i.e., only one of two nucleotide choices). More than 6 million SNPs have been identified in the human genome, but less than 1% occur in coding regions. Thus, although SNPs may have significance in causing disease, most are probably only markers co-inherited with the authentic genetic disease locus. CNVs represent variations in the numbers of large contiguous stretches of DNA from 1000 base pairs to millions of base pairs. While some CNVs are biallelic, others have multiple different variants in the population. CNVs account for 5 to 24 million base pairs of sequence difference between any two individuals; roughly 50% involve gene-coding sequences and may thus form much of the basis of phenotypic diversity. Instead of focusing narrowly on individual genes, the completed human genome sequence allows genomic analysis—the study of all genes and their interactions. Genomics promises to help unravel complex multigenic diseases; DNA microarray analysis of tumours is but one example. However, alterations in primary sequence cannot alone explain human genetic diversity. Thus epigenetics—heritable changes in gene expression that are not caused by specific DNA sequences—are involved in tissue-specific expression of genes and genetic imprinting (see later discussion). Beyond genomics, proteomics—the analysis of all the proteins expressed in a cell—is providing additional pathogenic insights. The ability to analyse all the patterns of genetic and protein expression is the province of computer-based bioinformatics. It is also increasingly appreciated that genes that do not code for proteins can have important regulatory functions. Thus small RNA molecules, called microRNAs (miRNAs), inhibit gene expression; there are approximately 1000 miRNA genes in the human genome (5% of the total genome). Transcripts from these miRNA genes are processed in the cytoplasm to yield mature oligomers that are 21 to 30 nucleotides in length. Subsequent interactions between the processed miRNA, target messenger RNA (mRNA), and the RNA-induced silencing complex (RISC) leads to mRNA cleavage or represses its translation. The same pathway can be exploited therapeutically by the introduction of exogenous small, interfering RNAs (siRNAs) targeted to specific mRNA species (e.g., oncogenes).