12 Genomics and Beyond
12.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define genomics and distinguish it from genetics
- Describe major genomic technologies including sequencing, microarrays, and bioinformatics
- Explain how genomes are annotated and what information genomic sequences contain
- Compare genome organization across different organisms
- Describe functional genomics approaches to understand gene function
- Explain the principles and applications of comparative genomics
- Analyze ethical, legal, and social implications of genomic information
- Apply genomic concepts to understand human health, evolution, and biodiversity
12.2 Introduction
Genomics represents a paradigm shift in biology—from studying individual genes to analyzing entire genomes. This chapter explores how modern technologies enable us to sequence, assemble, and interpret genomes, transforming our understanding of biology, medicine, and evolution. We will examine how genomic approaches build upon classical and molecular genetics, providing comprehensive views of genetic information and its expression. As we enter the post-genomic era, we face both unprecedented opportunities and significant challenges in interpreting and applying genomic knowledge.
12.3 Introduction to Genomics
12.3.1 Defining Genomics
Genomics: Study of genomes—their structure, function, evolution, and mapping
Contrast with genetics:
- Genetics: Study of individual genes and their inheritance
- Genomics: Holistic study of all genes and their interactions
12.3.2 Historical Milestones
1977: First genome sequenced (ΦX174 bacteriophage, 5,386 bp)
1995: First bacterial genome (Haemophilus influenzae, 1.8 Mb)
1996: First eukaryotic genome (Saccharomyces cerevisiae, 12 Mb)
1998: First multicellular organism (Caenorhabditis elegans, 97 Mb)
2000: Drosophila melanogaster (180 Mb)
2001: Human genome draft sequences published
2022: >100,000 genomes sequenced across species
12.3.3 Types of Genomics
Structural genomics: Genome sequencing and mapping
Functional genomics: Gene function analysis
Comparative genomics: Genome comparisons across species
Metagenomics: Study of microbial communities
Pharmacogenomics: Drug response based on genetics
Epigenomics: Genome-wide epigenetic patterns
12.3.4 The Genomics Revolution
Impact:
- Biology: New insights into evolution, development, diversity
- Medicine: Personalized medicine, disease gene discovery
- Agriculture: Crop improvement, livestock breeding
- Forensics: DNA fingerprinting, identification
- Conservation: Biodiversity assessment
12.4 Genome Sequencing Technologies
12.4.1 Sanger Sequencing
Developer: Frederick Sanger (1977)
Principle: Dideoxy chain termination
- Dideoxynucleotides: Lack 3’-OH, terminate chain elongation
- Four reactions: Each with different ddNTP
- Detection: Originally radioactive, later fluorescent
Advantages: High accuracy (~99.999%)
Limitations: Low throughput, expensive per base
Human Genome Project: Primarily used Sanger sequencing
12.4.2 Next-Generation Sequencing (NGS)
Characteristics: Massively parallel, high throughput, lower cost
Major platforms:
- Illumina: Sequencing by synthesis, reversible terminators
- Ion Torrent: Semiconductor detection of pH changes
- SOLiD: Sequencing by ligation
Workflow:
- Library preparation: Fragment DNA, add adapters
- Cluster amplification: On flow cell (Illumina) or beads
- Sequencing: Cyclic addition of nucleotides
- Data analysis: Base calling, alignment, variant calling
Applications: Whole-genome sequencing, exome sequencing, RNA-seq, ChIP-seq
12.4.3 Third-Generation Sequencing
Single-molecule sequencing: No amplification needed
Platforms:
- PacBio: Real-time sequencing (SMRT technology)
- Oxford Nanopore: Measures current changes as DNA passes through pore
Advantages:
- Long reads: Up to 100 kb (PacBio), >1 Mb (Nanopore)
- Direct detection: Modifications (methylation) detectable
- Portable: MiniON device (USB-sized)
Challenges: Higher error rates, requires computational correction
12.4.4 Sequencing Statistics
Coverage: Average number of times each base is sequenced
- Human genome: Typically 30× coverage for WGS
- Calculation: (Total bases sequenced) / (Genome size)
Read length: Varies by technology
- Illumina: 50-300 bp
- PacBio: 10-100 kb
- Nanopore: Up to 2 Mb
Cost reduction: From $100 million (Human Genome Project) to <$1000 (2020s)
12.5 Genome Assembly and Annotation
12.5.1 Genome Assembly
Challenge: Reconstruct chromosome sequences from short reads
Approaches:
- De novo assembly: No reference genome
- Overlap-layout-consensus: For longer reads
- De Bruijn graphs: For short reads
- Reference-based assembly: Align to existing genome
Metrics:
- N50: Length such that 50% of genome is in contigs of this size or larger
- Contiguity: Fewer, longer contigs/scaffolds preferred
- Completeness: BUSCO (Benchmarking Universal Single-Copy Orthologs)
12.5.2 Genome Annotation
Process: Identifying functional elements in genome sequence
Steps:
- Repeat masking: Identify and mask repetitive elements
- Gene prediction: Identify protein-coding genes
- Ab initio: Based on statistical properties
- Evidence-based: Using cDNA, protein similarity
- Combined: Both approaches
- Functional annotation: Assign functions to genes
- Similarity searches: BLAST against databases
- Domain identification: Pfam, InterPro
- Pathway assignment: KEGG, Reactome
Challenges:
- Alternative splicing: Multiple transcripts per gene
- Non-coding genes: Harder to predict
- Pseudogenes: Non-functional copies
12.5.3 Genomic Databases
Major resources:
- NCBI: GenBank, RefSeq, dbSNP
- ENSEMBL: Vertebrate genomes
- UCSC Genome Browser: Visualization and data mining
- UniProt: Protein sequences and annotation
12.6 Genome Organization and Content
12.6.1 Genome Size Variation
C-value paradox: Lack of correlation between genome size and organismal complexity
Range: From 0.16 Mb (Carsonella rudii, bacterium) to 150 Gb (Paris japonica, plant)
Factors affecting genome size:
- Gene number: Only part of explanation
- Intron size: Varies widely
- Repetitive elements: Major contributor
- Polyploidy: Whole-genome duplication
12.6.2 Gene Content
Gene number estimates:
- Bacteria: 500-10,000 genes
- Yeast: ~6,000 genes
- Nematode: ~20,000 genes
- Human: ~20,000 protein-coding genes
- Plants: Often >30,000 genes
Surprise: Human gene number similar to “simpler” organisms
Explanation: Alternative splicing, regulatory complexity, non-coding RNAs
12.6.3 Repetitive Elements
Types:
- Tandem repeats: Satellite, minisatellite, microsatellite DNA
- Interspersed repeats: Transposable elements
- DNA transposons: Cut-and-paste
- Retrotransposons: Copy-and-paste via RNA intermediate
- LTR: Long terminal repeat (retrovirus-like)
- Non-LTR: LINEs, SINEs
Percentage of human genome:
- Transposable elements: ~45%
- Other repeats: ~5%
- Total repetitive: ~50%
Functions: Some regulate gene expression, genome evolution
12.6.4 Non-Coding Functional Elements
Regulatory elements: Promoters, enhancers, silencers, insulators
Non-coding RNAs: miRNA, siRNA, piRNA, lncRNA
Structural elements: Centromeres, telomeres
Conserved non-coding elements: Often regulatory
12.7 Comparative Genomics
12.7.1 Principles
Comparative genomics: Compare genomes across species
Goals:
- Understand evolutionary relationships
- Identify functional elements
- Study genome evolution
- Understand genetic basis of traits
12.7.2 Evolutionary Conservation
Sequence conservation: Similar sequences suggest functional importance
Phylogenetic footprinting: Conserved non-coding regions often regulatory
Types of conservation:
- Synteny: Conservation of gene order
- Orthology: Genes descended from common ancestor
- Paralogy: Genes related by duplication
12.7.3 Model Organism Genomes
Key species sequenced early:
- E. coli: Bacterium, molecular biology workhorse
- S. cerevisiae: Yeast, simple eukaryote
- C. elegans: Nematode, developmental biology
- D. melanogaster: Fruit fly, genetics
- M. musculus: Mouse, mammalian model
- A. thaliana: Plant, small genome
Human genome comparisons: ~98% identical to chimpanzee, ~85% to mouse
12.7.4 Genome Evolution
Mechanisms:
- Gene duplication: Whole-genome, segmental, tandem
- Horizontal gene transfer: Especially in bacteria
- Exon shuffling: New genes from existing parts
- Transposable element activity: Genome rearrangements
Rates: Vary across lineages and genomic regions
12.8 Functional Genomics
12.8.1 Transcriptomics
Study of: All RNA transcripts in cell/tissue
Technologies:
- Microarrays: Hybridization-based, known sequences
- RNA-seq: Sequencing-based, unbiased
Applications:
- Expression profiling: Which genes are expressed where/when
- Differential expression: Changes between conditions
- Isoform discovery: Alternative splicing
- Fusion genes: In cancer
Single-cell RNA-seq: Transcriptomes of individual cells - Reveals: Cell heterogeneity, developmental trajectories
12.8.2 Proteomics
Study of: All proteins in cell/tissue
Technologies:
- Mass spectrometry: Identify and quantify proteins
- Protein microarrays: High-throughput protein interaction
Challenges: Dynamic range, post-translational modifications
12.8.3 Metabolomics
Study of: All metabolites in biological sample
Technologies: Mass spectrometry, NMR
Applications: Metabolic pathways, disease biomarkers
12.8.4 Interaction Networks
Protein-protein interactions: Yeast two-hybrid, co-immunoprecipitation
Genetic interactions: Synthetic lethality screens
Regulatory networks: Transcription factor binding sites
Network properties: Scale-free, small-world, modular
12.9 Human Genomics
12.9.1 Human Genome Project (HGP)
Timeline: 1990-2003
Cost: ~$3 billion
Consortium: International public effort
Private effort: Celera Genomics (Craig Venter)
Findings:
- ~20,000-25,000 protein-coding genes
98% non-coding
- ~50% repetitive elements
- Many gene families
- High degree of polymorphism
12.9.2 Human Genetic Variation
Types:
- SNPs: Single nucleotide polymorphisms (most common)
- Indels: Insertions/deletions
- CNVs: Copy number variations
- Structural variants: Inversions, translocations
Databases:
- dbSNP: Catalog of human variation
- 1000 Genomes Project: Comprehensive variation map
- gnomAD: Allele frequencies across populations
Clinical significance: Some variants cause disease, most are neutral
12.9.3 Genome-Wide Association Studies (GWAS)
Approach: Scan genomes of many individuals for variants associated with trait
Method: Compare cases (with trait) vs. controls (without)
Findings:
- Complex traits: Many genes with small effects
- Missing heritability: GWAS explains only part of genetic contribution
- Ethical issues: Population stratification, privacy
12.9.4 Personal Genomics
Direct-to-consumer testing: 23andMe, AncestryDNA
Information provided: Ancestry, traits, health risks
Limitations: Limited clinical validity, interpretation challenges
Clinical genomics: Diagnosis of rare diseases, cancer genomics, pharmacogenomics
12.11 Future Directions
12.11.1 Precision Medicine
Goal: Tailor medical treatment to individual characteristics
Components: Genomics, environment, lifestyle
Examples: Cancer genomics, pharmacogenomics
12.11.2 Synthetic Biology
Engineering biological systems: From parts to whole genomes
Minimal genomes: Essential genes for life
Synthetic organisms: Mycoplasma mycoides JCVI-syn3.0 (473 genes)
12.11.3 Environmental Genomics
Metagenomics: Study of microbial communities without culturing
Applications: Human microbiome, environmental monitoring, biotechnology
12.11.4 Long-Read Sequencing
Advancing technologies: Even longer reads, higher accuracy
Applications: Complete genome assembly, structural variant detection
12.11.5 Integration with Other Data
Multi-omics: Combine genomics, transcriptomics, proteomics, metabolomics
Systems biology: Holistic understanding of biological systems
12.12 Chapter Summary
12.12.1 Key Concepts
- Genomics: Study of entire genomes, enabled by high-throughput technologies
- Sequencing technologies: Evolved from Sanger to NGS to third-generation
- Genome annotation: Identifying genes and functional elements in sequence data
- Genome organization: Varies widely, with repetitive elements comprising large fractions
- Comparative genomics: Reveals evolutionary relationships and functional elements
- Functional genomics: Studies gene function at genome scale
- Human genomics: Applications in medicine, raises ethical issues
- Future directions: Precision medicine, synthetic biology, environmental genomics
12.12.2 Sequencing Technology Comparison
| Technology | Read Length | Throughput | Accuracy | Cost per Gb |
|---|---|---|---|---|
| Sanger | 500-1000 bp | Low | Very high (>99.999%) | High |
| Illumina | 50-300 bp | Very high | High (>99.9%) | Low |
| PacBio | 10-100 kb | Medium | Medium (~99%) | Medium |
| Nanopore | Up to 2 Mb | Variable | Lower (~95%) | Variable |
12.12.3 Genomic Scale Statistics
| Organism | Genome Size | Protein-coding Genes | % Coding | Year Sequenced |
|---|---|---|---|---|
| E. coli | 4.6 Mb | 4,300 | 88% | 1997 |
| Yeast | 12 Mb | 6,000 | 70% | 1996 |
| Nematode | 97 Mb | 20,000 | 27% | 1998 |
| Human | 3.2 Gb | 20,000 | 1.5% | 2001 |
| Mouse | 2.7 Gb | 23,000 | ~2% | 2002 |
| Rice | 389 Mb | 37,000 | ~10% | 2005 |
12.12.4 Major Genomic Projects
| Project | Goal | Key Findings |
|---|---|---|
| Human Genome Project | Sequence human genome | ~20,000 genes, much non-coding |
| 1000 Genomes Project | Catalog human variation | Millions of variants, population differences |
| ENCODE | Identify functional elements | 80% of genome has biochemical function |
| Human Microbiome Project | Characterize human microbes | Microbial communities affect health |
| Cancer Genome Atlas | Catalog cancer mutations | Mutational signatures, driver genes |
12.12.5 Ethical Principles for Genomics
- Respect for persons: Autonomy, informed consent
- Beneficence: Maximize benefits, minimize harms
- Justice: Fair distribution of benefits and burdens
- Privacy: Protect genetic information
- Transparency: Openness about methods and findings
12.13 Review Questions
12.13.1 Level 1: Recall and Understanding
- Define genomics and distinguish it from genetics.
- What are the main differences between Sanger sequencing and next-generation sequencing?
- What is genome annotation, and what are its main steps?
- Why is the human gene count surprisingly low compared to some “simpler” organisms?
- What are the main ethical issues in genomics?
12.13.2 Level 2: Application and Analysis
- If a genome of 100 Mb is sequenced with 300 million 100-bp reads, what is the coverage?
- Why might comparative genomics help identify functional non-coding elements?
- How could genome-wide association studies be used to find genes involved in a complex disease?
- What are the advantages and disadvantages of long-read sequencing technologies?
- How does the percentage of coding DNA vary across organisms, and what explains this variation?
12.13.3 Level 3: Synthesis and Evaluation
- Evaluate the statement: “Sequencing a genome is just the beginning—the real challenge is understanding what it means.”
- How has genomics changed our understanding of what a gene is?
- What are the potential benefits and risks of personal genomics for individuals and society?
- Design a research project using genomic approaches to study an environmental problem.
12.14 Key Terms
- Genomics: Study of genomes—their structure, function, evolution, and mapping
- Next-generation sequencing (NGS): Massively parallel DNA sequencing technologies
- Read: Short DNA sequence generated by sequencing
- Coverage: Average number of times each base is sequenced
- Genome assembly: Process of reconstructing genome sequence from reads
- Annotation: Process of identifying genes and other functional elements in genome sequence
- Comparative genomics: Comparison of genome sequences across species
- Functional genomics: Study of gene function using genome-scale approaches
- Transcriptomics: Study of all RNA transcripts in a cell or tissue
- Single nucleotide polymorphism (SNP): Single base pair variation in DNA sequence
- Genome-wide association study (GWAS): Study scanning genomes for variants associated with traits
- Metagenomics: Study of genetic material recovered directly from environmental samples
- Personal genomics: Analysis and interpretation of an individual’s genome
- Precision medicine: Medical approach tailoring treatment to individual characteristics
12.15 Further Reading
12.15.1 Books
- Gibson, G., & Muse, S. V. (2009). A Primer of Genome Science (3rd ed.). Sinauer Associates.
- Lesk, A. M. (2017). Introduction to Genomics (3rd ed.). Oxford University Press.
- Green, E. D., et al. (Eds.). (2020). Genomic Medicine. Cold Spring Harbor Laboratory Press.
12.15.2 Scientific Articles
- Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921.
- Venter, J. C., et al. (2001). The sequence of the human genome. Science, 291(5507), 1304-1351.
- ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.
12.15.3 Online Resources
- NCBI Genome: https://www.ncbi.nlm.nih.gov/genome
- ENSEMBL: https://www.ensembl.org
- UCSC Genome Browser: https://genome.ucsc.edu
- 1000 Genomes Project: https://www.internationalgenome.org
12.16 Quantitative Problems
- Sequencing Coverage:
- A 3 Gb genome is sequenced with 300 million 150-bp reads. What is the coverage?
- If 30× coverage is desired, how many 100-bp reads are needed?
- If sequencing costs $0.01 per Mb, what is the cost to sequence this genome at 30× coverage?
- Genome Annotation: A 2 Mb bacterial genome has 2,000 predicted genes averaging 1 kb each.
- What percentage of the genome is protein-coding?
- If average intergenic region is 200 bp, how many genes are expected?
- If RNA-seq finds expression for 1,800 genes, what percentage of predicted genes are likely real?
- Population Genetics: A SNP has allele frequencies: A=0.7, G=0.3 in population.
- What are the expected genotype frequencies under Hardy-Weinberg equilibrium?
- In a sample of 1000 individuals, how many heterozygotes are expected?
- If observed heterozygotes = 350, is the population in HWE?
12.17 Case Study: Cancer Genomics
Background: Cancer results from accumulation of somatic mutations in driver genes.
Questions:
- How do whole-genome and whole-exome sequencing contribute to cancer research?
- What types of genomic alterations are commonly found in cancer?
- How can genomic information guide cancer treatment (precision oncology)?
- What are the challenges in interpreting cancer genomic data?
Data for analysis:
- Typical cancer: 1,000-10,000 somatic mutations
- Driver mutations: 3-8 per cancer
- Mutational signatures: Patterns revealing causes (UV, smoking, etc.)
- Tumor heterogeneity: Subclones with different mutations
- Liquid biopsies: Circulating tumor DNA for monitoring
End of Part III: Genetics & Molecular Biology
Next Part: Part IV: Evolution & Diversity