12  Genomics and Beyond

12.1 Learning Objectives

By the end of this chapter, you should be able to:

  1. Define genomics and distinguish it from genetics
  2. Describe major genomic technologies including sequencing, microarrays, and bioinformatics
  3. Explain how genomes are annotated and what information genomic sequences contain
  4. Compare genome organization across different organisms
  5. Describe functional genomics approaches to understand gene function
  6. Explain the principles and applications of comparative genomics
  7. Analyze ethical, legal, and social implications of genomic information
  8. Apply genomic concepts to understand human health, evolution, and biodiversity

12.2 Introduction

Genomics represents a paradigm shift in biology—from studying individual genes to analyzing entire genomes. This chapter explores how modern technologies enable us to sequence, assemble, and interpret genomes, transforming our understanding of biology, medicine, and evolution. We will examine how genomic approaches build upon classical and molecular genetics, providing comprehensive views of genetic information and its expression. As we enter the post-genomic era, we face both unprecedented opportunities and significant challenges in interpreting and applying genomic knowledge.


12.3 Introduction to Genomics

12.3.1 Defining Genomics

Genomics: Study of genomes—their structure, function, evolution, and mapping

Contrast with genetics:

  • Genetics: Study of individual genes and their inheritance
  • Genomics: Holistic study of all genes and their interactions

12.3.2 Historical Milestones

1977: First genome sequenced (ΦX174 bacteriophage, 5,386 bp)

1995: First bacterial genome (Haemophilus influenzae, 1.8 Mb)

1996: First eukaryotic genome (Saccharomyces cerevisiae, 12 Mb)

1998: First multicellular organism (Caenorhabditis elegans, 97 Mb)

2000: Drosophila melanogaster (180 Mb)

2001: Human genome draft sequences published

2022: >100,000 genomes sequenced across species

12.3.3 Types of Genomics

Structural genomics: Genome sequencing and mapping

Functional genomics: Gene function analysis

Comparative genomics: Genome comparisons across species

Metagenomics: Study of microbial communities

Pharmacogenomics: Drug response based on genetics

Epigenomics: Genome-wide epigenetic patterns

12.3.4 The Genomics Revolution

Impact:

  1. Biology: New insights into evolution, development, diversity
  2. Medicine: Personalized medicine, disease gene discovery
  3. Agriculture: Crop improvement, livestock breeding
  4. Forensics: DNA fingerprinting, identification
  5. Conservation: Biodiversity assessment

12.4 Genome Sequencing Technologies

12.4.1 Sanger Sequencing

Developer: Frederick Sanger (1977)

Principle: Dideoxy chain termination

  • Dideoxynucleotides: Lack 3’-OH, terminate chain elongation
  • Four reactions: Each with different ddNTP
  • Detection: Originally radioactive, later fluorescent

Advantages: High accuracy (~99.999%)

Limitations: Low throughput, expensive per base

Human Genome Project: Primarily used Sanger sequencing

12.4.2 Next-Generation Sequencing (NGS)

Characteristics: Massively parallel, high throughput, lower cost

Major platforms:

  • Illumina: Sequencing by synthesis, reversible terminators
  • Ion Torrent: Semiconductor detection of pH changes
  • SOLiD: Sequencing by ligation

Workflow:

  1. Library preparation: Fragment DNA, add adapters
  2. Cluster amplification: On flow cell (Illumina) or beads
  3. Sequencing: Cyclic addition of nucleotides
  4. Data analysis: Base calling, alignment, variant calling

Applications: Whole-genome sequencing, exome sequencing, RNA-seq, ChIP-seq

12.4.3 Third-Generation Sequencing

Single-molecule sequencing: No amplification needed

Platforms:

  • PacBio: Real-time sequencing (SMRT technology)
  • Oxford Nanopore: Measures current changes as DNA passes through pore

Advantages:

  • Long reads: Up to 100 kb (PacBio), >1 Mb (Nanopore)
  • Direct detection: Modifications (methylation) detectable
  • Portable: MiniON device (USB-sized)

Challenges: Higher error rates, requires computational correction

12.4.4 Sequencing Statistics

Coverage: Average number of times each base is sequenced

  • Human genome: Typically 30× coverage for WGS
  • Calculation: (Total bases sequenced) / (Genome size)

Read length: Varies by technology

  • Illumina: 50-300 bp
  • PacBio: 10-100 kb
  • Nanopore: Up to 2 Mb

Cost reduction: From $100 million (Human Genome Project) to <$1000 (2020s)


12.5 Genome Assembly and Annotation

12.5.1 Genome Assembly

Challenge: Reconstruct chromosome sequences from short reads

Approaches:

  • De novo assembly: No reference genome
    • Overlap-layout-consensus: For longer reads
    • De Bruijn graphs: For short reads
  • Reference-based assembly: Align to existing genome

Metrics:

  • N50: Length such that 50% of genome is in contigs of this size or larger
  • Contiguity: Fewer, longer contigs/scaffolds preferred
  • Completeness: BUSCO (Benchmarking Universal Single-Copy Orthologs)

12.5.2 Genome Annotation

Process: Identifying functional elements in genome sequence

Steps:

  1. Repeat masking: Identify and mask repetitive elements
  2. Gene prediction: Identify protein-coding genes
    • Ab initio: Based on statistical properties
    • Evidence-based: Using cDNA, protein similarity
    • Combined: Both approaches
  3. Functional annotation: Assign functions to genes
    • Similarity searches: BLAST against databases
    • Domain identification: Pfam, InterPro
    • Pathway assignment: KEGG, Reactome

Challenges:

  • Alternative splicing: Multiple transcripts per gene
  • Non-coding genes: Harder to predict
  • Pseudogenes: Non-functional copies

12.5.3 Genomic Databases

Major resources:

  • NCBI: GenBank, RefSeq, dbSNP
  • ENSEMBL: Vertebrate genomes
  • UCSC Genome Browser: Visualization and data mining
  • UniProt: Protein sequences and annotation

12.6 Genome Organization and Content

12.6.1 Genome Size Variation

C-value paradox: Lack of correlation between genome size and organismal complexity

Range: From 0.16 Mb (Carsonella rudii, bacterium) to 150 Gb (Paris japonica, plant)

Factors affecting genome size:

  • Gene number: Only part of explanation
  • Intron size: Varies widely
  • Repetitive elements: Major contributor
  • Polyploidy: Whole-genome duplication

12.6.2 Gene Content

Gene number estimates:

  • Bacteria: 500-10,000 genes
  • Yeast: ~6,000 genes
  • Nematode: ~20,000 genes
  • Human: ~20,000 protein-coding genes
  • Plants: Often >30,000 genes

Surprise: Human gene number similar to “simpler” organisms

Explanation: Alternative splicing, regulatory complexity, non-coding RNAs

12.6.3 Repetitive Elements

Types:

  1. Tandem repeats: Satellite, minisatellite, microsatellite DNA
  2. Interspersed repeats: Transposable elements
    • DNA transposons: Cut-and-paste
    • Retrotransposons: Copy-and-paste via RNA intermediate
      • LTR: Long terminal repeat (retrovirus-like)
      • Non-LTR: LINEs, SINEs

Percentage of human genome:

  • Transposable elements: ~45%
  • Other repeats: ~5%
  • Total repetitive: ~50%

Functions: Some regulate gene expression, genome evolution

12.6.4 Non-Coding Functional Elements

Regulatory elements: Promoters, enhancers, silencers, insulators

Non-coding RNAs: miRNA, siRNA, piRNA, lncRNA

Structural elements: Centromeres, telomeres

Conserved non-coding elements: Often regulatory


12.7 Comparative Genomics

12.7.1 Principles

Comparative genomics: Compare genomes across species

Goals:

  1. Understand evolutionary relationships
  2. Identify functional elements
  3. Study genome evolution
  4. Understand genetic basis of traits

12.7.2 Evolutionary Conservation

Sequence conservation: Similar sequences suggest functional importance

Phylogenetic footprinting: Conserved non-coding regions often regulatory

Types of conservation:

  • Synteny: Conservation of gene order
  • Orthology: Genes descended from common ancestor
  • Paralogy: Genes related by duplication

12.7.3 Model Organism Genomes

Key species sequenced early:

  • E. coli: Bacterium, molecular biology workhorse
  • S. cerevisiae: Yeast, simple eukaryote
  • C. elegans: Nematode, developmental biology
  • D. melanogaster: Fruit fly, genetics
  • M. musculus: Mouse, mammalian model
  • A. thaliana: Plant, small genome

Human genome comparisons: ~98% identical to chimpanzee, ~85% to mouse

12.7.4 Genome Evolution

Mechanisms:

  1. Gene duplication: Whole-genome, segmental, tandem
  2. Horizontal gene transfer: Especially in bacteria
  3. Exon shuffling: New genes from existing parts
  4. Transposable element activity: Genome rearrangements

Rates: Vary across lineages and genomic regions


12.8 Functional Genomics

12.8.1 Transcriptomics

Study of: All RNA transcripts in cell/tissue

Technologies:

  • Microarrays: Hybridization-based, known sequences
  • RNA-seq: Sequencing-based, unbiased

Applications:

  • Expression profiling: Which genes are expressed where/when
  • Differential expression: Changes between conditions
  • Isoform discovery: Alternative splicing
  • Fusion genes: In cancer

Single-cell RNA-seq: Transcriptomes of individual cells - Reveals: Cell heterogeneity, developmental trajectories

12.8.2 Proteomics

Study of: All proteins in cell/tissue

Technologies:

  • Mass spectrometry: Identify and quantify proteins
  • Protein microarrays: High-throughput protein interaction

Challenges: Dynamic range, post-translational modifications

12.8.3 Metabolomics

Study of: All metabolites in biological sample

Technologies: Mass spectrometry, NMR

Applications: Metabolic pathways, disease biomarkers

12.8.4 Interaction Networks

Protein-protein interactions: Yeast two-hybrid, co-immunoprecipitation

Genetic interactions: Synthetic lethality screens

Regulatory networks: Transcription factor binding sites

Network properties: Scale-free, small-world, modular


12.9 Human Genomics

12.9.1 Human Genome Project (HGP)

Timeline: 1990-2003

Cost: ~$3 billion

Consortium: International public effort

Private effort: Celera Genomics (Craig Venter)

Findings:

  1. ~20,000-25,000 protein-coding genes
  2. 98% non-coding

  3. ~50% repetitive elements
  4. Many gene families
  5. High degree of polymorphism

12.9.2 Human Genetic Variation

Types:

  • SNPs: Single nucleotide polymorphisms (most common)
  • Indels: Insertions/deletions
  • CNVs: Copy number variations
  • Structural variants: Inversions, translocations

Databases:

  • dbSNP: Catalog of human variation
  • 1000 Genomes Project: Comprehensive variation map
  • gnomAD: Allele frequencies across populations

Clinical significance: Some variants cause disease, most are neutral

12.9.3 Genome-Wide Association Studies (GWAS)

Approach: Scan genomes of many individuals for variants associated with trait

Method: Compare cases (with trait) vs. controls (without)

Findings:

  • Complex traits: Many genes with small effects
  • Missing heritability: GWAS explains only part of genetic contribution
  • Ethical issues: Population stratification, privacy

12.9.4 Personal Genomics

Direct-to-consumer testing: 23andMe, AncestryDNA

Information provided: Ancestry, traits, health risks

Limitations: Limited clinical validity, interpretation challenges

Clinical genomics: Diagnosis of rare diseases, cancer genomics, pharmacogenomics


12.11 Future Directions

12.11.1 Precision Medicine

Goal: Tailor medical treatment to individual characteristics

Components: Genomics, environment, lifestyle

Examples: Cancer genomics, pharmacogenomics

12.11.2 Synthetic Biology

Engineering biological systems: From parts to whole genomes

Minimal genomes: Essential genes for life

Synthetic organisms: Mycoplasma mycoides JCVI-syn3.0 (473 genes)

12.11.3 Environmental Genomics

Metagenomics: Study of microbial communities without culturing

Applications: Human microbiome, environmental monitoring, biotechnology

12.11.4 Long-Read Sequencing

Advancing technologies: Even longer reads, higher accuracy

Applications: Complete genome assembly, structural variant detection

12.11.5 Integration with Other Data

Multi-omics: Combine genomics, transcriptomics, proteomics, metabolomics

Systems biology: Holistic understanding of biological systems


12.12 Chapter Summary

12.12.1 Key Concepts

  1. Genomics: Study of entire genomes, enabled by high-throughput technologies
  2. Sequencing technologies: Evolved from Sanger to NGS to third-generation
  3. Genome annotation: Identifying genes and functional elements in sequence data
  4. Genome organization: Varies widely, with repetitive elements comprising large fractions
  5. Comparative genomics: Reveals evolutionary relationships and functional elements
  6. Functional genomics: Studies gene function at genome scale
  7. Human genomics: Applications in medicine, raises ethical issues
  8. Future directions: Precision medicine, synthetic biology, environmental genomics

12.12.2 Sequencing Technology Comparison

Technology Read Length Throughput Accuracy Cost per Gb
Sanger 500-1000 bp Low Very high (>99.999%) High
Illumina 50-300 bp Very high High (>99.9%) Low
PacBio 10-100 kb Medium Medium (~99%) Medium
Nanopore Up to 2 Mb Variable Lower (~95%) Variable

12.12.3 Genomic Scale Statistics

Organism Genome Size Protein-coding Genes % Coding Year Sequenced
E. coli 4.6 Mb 4,300 88% 1997
Yeast 12 Mb 6,000 70% 1996
Nematode 97 Mb 20,000 27% 1998
Human 3.2 Gb 20,000 1.5% 2001
Mouse 2.7 Gb 23,000 ~2% 2002
Rice 389 Mb 37,000 ~10% 2005

12.12.4 Major Genomic Projects

Project Goal Key Findings
Human Genome Project Sequence human genome ~20,000 genes, much non-coding
1000 Genomes Project Catalog human variation Millions of variants, population differences
ENCODE Identify functional elements 80% of genome has biochemical function
Human Microbiome Project Characterize human microbes Microbial communities affect health
Cancer Genome Atlas Catalog cancer mutations Mutational signatures, driver genes

12.12.5 Ethical Principles for Genomics

  1. Respect for persons: Autonomy, informed consent
  2. Beneficence: Maximize benefits, minimize harms
  3. Justice: Fair distribution of benefits and burdens
  4. Privacy: Protect genetic information
  5. Transparency: Openness about methods and findings

12.13 Review Questions

12.13.1 Level 1: Recall and Understanding

  1. Define genomics and distinguish it from genetics.
  2. What are the main differences between Sanger sequencing and next-generation sequencing?
  3. What is genome annotation, and what are its main steps?
  4. Why is the human gene count surprisingly low compared to some “simpler” organisms?
  5. What are the main ethical issues in genomics?

12.13.2 Level 2: Application and Analysis

  1. If a genome of 100 Mb is sequenced with 300 million 100-bp reads, what is the coverage?
  2. Why might comparative genomics help identify functional non-coding elements?
  3. How could genome-wide association studies be used to find genes involved in a complex disease?
  4. What are the advantages and disadvantages of long-read sequencing technologies?
  5. How does the percentage of coding DNA vary across organisms, and what explains this variation?

12.13.3 Level 3: Synthesis and Evaluation

  1. Evaluate the statement: “Sequencing a genome is just the beginning—the real challenge is understanding what it means.”
  2. How has genomics changed our understanding of what a gene is?
  3. What are the potential benefits and risks of personal genomics for individuals and society?
  4. Design a research project using genomic approaches to study an environmental problem.

12.14 Key Terms

  • Genomics: Study of genomes—their structure, function, evolution, and mapping
  • Next-generation sequencing (NGS): Massively parallel DNA sequencing technologies
  • Read: Short DNA sequence generated by sequencing
  • Coverage: Average number of times each base is sequenced
  • Genome assembly: Process of reconstructing genome sequence from reads
  • Annotation: Process of identifying genes and other functional elements in genome sequence
  • Comparative genomics: Comparison of genome sequences across species
  • Functional genomics: Study of gene function using genome-scale approaches
  • Transcriptomics: Study of all RNA transcripts in a cell or tissue
  • Single nucleotide polymorphism (SNP): Single base pair variation in DNA sequence
  • Genome-wide association study (GWAS): Study scanning genomes for variants associated with traits
  • Metagenomics: Study of genetic material recovered directly from environmental samples
  • Personal genomics: Analysis and interpretation of an individual’s genome
  • Precision medicine: Medical approach tailoring treatment to individual characteristics

12.15 Further Reading

12.15.1 Books

  1. Gibson, G., & Muse, S. V. (2009). A Primer of Genome Science (3rd ed.). Sinauer Associates.
  2. Lesk, A. M. (2017). Introduction to Genomics (3rd ed.). Oxford University Press.
  3. Green, E. D., et al. (Eds.). (2020). Genomic Medicine. Cold Spring Harbor Laboratory Press.

12.15.2 Scientific Articles

  1. Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921.
  2. Venter, J. C., et al. (2001). The sequence of the human genome. Science, 291(5507), 1304-1351.
  3. ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.

12.15.3 Online Resources

  1. NCBI Genome: https://www.ncbi.nlm.nih.gov/genome
  2. ENSEMBL: https://www.ensembl.org
  3. UCSC Genome Browser: https://genome.ucsc.edu
  4. 1000 Genomes Project: https://www.internationalgenome.org

12.16 Quantitative Problems

  1. Sequencing Coverage:
    1. A 3 Gb genome is sequenced with 300 million 150-bp reads. What is the coverage?
    2. If 30× coverage is desired, how many 100-bp reads are needed?
    3. If sequencing costs $0.01 per Mb, what is the cost to sequence this genome at 30× coverage?
  2. Genome Annotation: A 2 Mb bacterial genome has 2,000 predicted genes averaging 1 kb each.
    1. What percentage of the genome is protein-coding?
    2. If average intergenic region is 200 bp, how many genes are expected?
    3. If RNA-seq finds expression for 1,800 genes, what percentage of predicted genes are likely real?
  3. Population Genetics: A SNP has allele frequencies: A=0.7, G=0.3 in population.
    1. What are the expected genotype frequencies under Hardy-Weinberg equilibrium?
    2. In a sample of 1000 individuals, how many heterozygotes are expected?
    3. If observed heterozygotes = 350, is the population in HWE?

12.17 Case Study: Cancer Genomics

Background: Cancer results from accumulation of somatic mutations in driver genes.

Questions:

  1. How do whole-genome and whole-exome sequencing contribute to cancer research?
  2. What types of genomic alterations are commonly found in cancer?
  3. How can genomic information guide cancer treatment (precision oncology)?
  4. What are the challenges in interpreting cancer genomic data?

Data for analysis:

  • Typical cancer: 1,000-10,000 somatic mutations
  • Driver mutations: 3-8 per cancer
  • Mutational signatures: Patterns revealing causes (UV, smoking, etc.)
  • Tumor heterogeneity: Subclones with different mutations
  • Liquid biopsies: Circulating tumor DNA for monitoring

End of Part III: Genetics & Molecular Biology

Next Part: Part IV: Evolution & Diversity