3 Information in Biology
3.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define biological information and distinguish it from other forms of information
- Explain how information is encoded, stored, and processed in biological systems
- Describe the central dogma of molecular biology and its modern revisions
- Calculate information content of biological sequences using Shannon entropy
- Analyze error rates in biological information processing and their evolutionary significance
- Compare biological information systems with human-designed information technologies
- Explain how information theory provides insights into biological organization and evolution
3.2 Introduction
Information is as fundamental to life as energy. While energy provides the capacity for work, information provides the instructions for that work. Biological information—encoded in DNA, expressed through RNA and proteins, and processed by cellular networks—directs the development, functioning, and evolution of all living organisms. This chapter explores how living systems store, transmit, process, and interpret information, examining both the molecular mechanisms and the theoretical principles that govern biological information flow.
3.3 What is Biological Information?
3.3.1 Defining Information in Biological Contexts
Information: In biology, information refers to instructions that specify biological structure and function. Unlike the mathematical definition (reduction of uncertainty), biological information has semantic content—it means something to the cell.
Key Properties of Biological Information:
- Specificity: Information corresponds to specific biological outcomes
- Stability: Information persists across generations with reasonable fidelity
- Copyability: Information can be accurately replicated
- Interpretability: Cellular machinery can “read” and act on the information
- Evolvability: Information can change gradually through mutation and selection
3.3.2 Levels of Biological Information
Genetic Information: Encoded in DNA/RNA sequences
- Example: Gene specifying hemoglobin protein
Epigenetic Information: Chemical modifications affecting gene expression without changing DNA sequence
- Example: DNA methylation patterns
Structural Information: Three-dimensional arrangement of molecules
- Example: Protein folding determined by amino acid sequence
Cellular Information: Signaling pathways and regulatory networks
- Example: MAP kinase cascade transmitting growth signals
Organismal Information: Neural patterns and hormonal signals
- Example: Memory storage in synaptic connections
Ecological Information: Species interactions and environmental cues
- Example: Flower color patterns attracting pollinators
3.4 The Central Dogma of Molecular Biology
Formulated by Francis Crick in 1958, the central dogma describes the flow of genetic information:
3.4.1 Original Formulation
DNA → RNA → Protein
Three Possible Transfers:
- General transfers: DNA → DNA (replication), DNA → RNA (transcription), RNA → Protein (translation)
- Special transfers: RNA → RNA (replication), RNA → DNA (reverse transcription)
- Never observed: Protein → Protein, Protein → DNA, Protein → RNA
3.4.2 Molecular Mechanisms
3.4.2.1 1. DNA Replication
- Process: Semiconservative replication
- Enzymes: DNA polymerase, helicase, ligase
- Accuracy: Error rate ~10-9 per base pair
- Proofreading: 3’→5’ exonuclease activity
3.4.2.2 2. Transcription
- Process: DNA → RNA
- Enzymes: RNA polymerase
- Accuracy: Error rate ~10-4 per base
- Processing: Capping, splicing, polyadenylation
3.4.2.3 3. Translation
- Process: RNA → Protein
- Machinery: Ribosomes, tRNAs, translation factors
- Accuracy: Error rate ~10-4 per amino acid
- Genetic Code: Triplet code, redundant, nearly universal
3.4.3 Modern Revisions and Exceptions
Reverse Transcription: RNA → DNA
- Examples: Retroviruses (HIV), telomerase, some transposons
RNA Replication: RNA → RNA
- Examples: RNA viruses (influenza, SARS-CoV-2)
Prions: Protein → Protein conformational change
- Examples: Mad cow disease, Creutzfeldt-Jakob disease
Epigenetic Inheritance: Information passed without DNA sequence change
- Examples: DNA methylation patterns, histone modifications
3.5 Information Theory in Biology
3.5.1 Shannon Entropy
Shannon Information (H): Measures uncertainty or information content
For a sequence with n possible symbols, each with probability pi:
H = -Σ pi log2 pi (bits per symbol)
- Maximum entropy: Occurs when all symbols are equally probable
- Minimum entropy: Occurs when one symbol has probability 1
3.5.2 Application to DNA Sequences
DNA alphabet: 4 nucleotides (A, T, C, G)
- Maximum entropy: Hmax = -4 × (0.25 × log2 0.25) = 2 bits/base
- Actual entropy: Typically 1.8-1.9 bits/base due to non-random distribution
Coding vs. non-coding regions:
- Coding regions: Lower entropy (conserved sequences)
- Non-coding regions: Higher entropy (more variable)
3.5.3 Information Content of the Genetic Code
Coding capacity: 43 = 64 possible codons
Amino acids: 20 standard amino acids + stop signals
Redundancy: Multiple codons specify same amino acid (degeneracy)
Error minimization: Similar codons often specify same or similar amino acids
3.5.4 Channel Capacity and Noise
Biological channels: DNA replication, transcription, translation
Noise sources: Polymerase errors, chemical damage, environmental stress
Error correction: Proofreading, mismatch repair, spell-checking mechanisms
Channel capacity theorem: Maximum reliable information transmission rate depends on bandwidth and signal-to-noise ratio
3.6 DNA as an Information Storage Medium
3.6.1 Storage Capacity
Human genome: ~3.2 × 109 base pairs
Information content: ~6.4 × 109 bits (800 MB)
Compression: Actual information less due to redundancy and non-coding regions
Comparison with digital storage:
- DNA: ~2 bits per nucleotide
- Hard drives: ~1 bit per 106 atoms
- DNA is orders of magnitude more dense
3.6.2 Durability and Stability
Half-life of DNA:
- Under ideal conditions: Thousands of years
- In living cells: Constantly repaired
- Sequencing of ancient DNA: Up to ~1 million years old
Damage and repair:
- Types: Depurination, deamination, crosslinks, breaks
- Repair systems: Base excision, nucleotide excision, mismatch, double-strand break
3.6.3 Information Redundancy
Genetic code redundancy: 64 codons for 20 amino acids
Diploidy: Two copies of each gene (in most eukaryotes)
Gene families: Multiple similar genes with related functions
Regulatory redundancy: Multiple pathways achieve same outcome
3.7 Information Processing in Cells
3.7.1 Gene Regulation
Prokaryotic regulation:
- Operons: Coordinated gene clusters
- Lac operon: Inducible system for lactose metabolism
- Trp operon: Repressible system for tryptophan synthesis
Eukaryotic regulation:
- Transcriptional: Promoters, enhancers, transcription factors
- Post-transcriptional: Alternative splicing, RNA editing
- Translational: Initiation factors, microRNAs
- Post-translational: Phosphorylation, glycosylation, proteolysis
3.7.2 Signal Transduction
Information flow: Extracellular signal → Receptor → Intracellular pathway → Cellular response
Amplification: One signal molecule can activate many intracellular effectors Example: Epinephrine → 108 glucose molecules released
Integration: Multiple signals combined for coordinated response
Specificity: Different cells respond differently to same signal
3.7.3 Neural Information Processing
- Action potentials: All-or-nothing electrical signals
- Frequency coding: Information encoded in firing rate
- Temporal coding: Information in precise timing of spikes
- Population coding: Information distributed across neuron groups
Comparison to digital computers:
- Neurons: Analog inputs, digital outputs, parallel processing
- Computers: Digital processing, serial operations
- Brain: ~1011 neurons, ~1015 synapses
- Computer: ~1010 transistors
3.8 Error Rates and Their Consequences
3.8.1 Mutation Rates
Definition: Probability of error per base per replication
Variation across organisms:
- RNA viruses: 10-3 to 10-5 per base per replication
- DNA viruses: 10-6 to 10-8
- Bacteria: ~10-9 to 10-10
- Eukaryotes: ~10-9 to 10-11
Evolutionary trade-off:
- High mutation rates: Increased adaptability but more deleterious mutations
- Low mutation rates: Genetic stability but slower adaptation
3.8.2 Error Correction Mechanisms
DNA replication:
- Proofreading: 3’→5’ exonuclease activity
- Mismatch repair: Post-replication correction
Transcription: - Lower fidelity than replication (errors affect single RNA molecules) - No proofreading in most systems
Translation:
- Ribosome accuracy mechanisms
- Kinetic proofreading: Delay before peptide bond formation
3.8.3 Consequences of Errors
Neutral mutations: No effect on fitness
Deleterious mutations: Reduce fitness
- Recessive disorders: Require two mutant copies (cystic fibrosis)
- Dominant disorders: One mutant copy sufficient (Huntington’s disease)
Advantageous mutations: Increase fitness
- Antibiotic resistance in bacteria
- Lactose tolerance in human populations
Cancer: Accumulation of mutations in growth control genes
3.9 Evolution as Information Gain
3.9.1 Information in Evolutionary Context
Fisher’s Fundamental Theorem: “The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”
Information gain through natural selection:
- Selection increases information about environment
- Populations become better “adapted” (carry more information about optimal traits)
Shannon information and fitness:
H = -Σ pi log pi where pi is frequency of genotype i
Selection reduces entropy (increases information) by changing genotype frequencies
3.9.2 Molecular Evolution
Neutral theory: Most evolutionary change at molecular level is due to neutral mutations Molecular clock: Roughly constant rate of neutral evolution
Positive selection: Detectable as dN/dS > 1
- dN: Rate of nonsynonymous substitutions
- dS: Rate of synonymous substitutions
Comparative genomics: Information content differences between species reveal evolutionary history
3.9.3 Information and Complexity
Complexity increase: Not inevitable but common in evolution Information integration: More complex organisms process more information Trade-offs: Information processing has energy costs
C-value paradox: Genome size doesn’t correlate with organismal complexity
- Humans: 3.2 Gb
- Lungfish: 130 Gb
- Paris japonica (plant): 150 Gb
3.10 Biological vs. Technological Information Systems
3.10.1 Comparison of Features
| Feature | Biological Systems | Human-designed Systems |
|---|---|---|
| Storage medium | DNA/RNA | Magnetic, optical, solid-state |
| Information density | ~2 bits/nm³ | ~0.01 bits/nm³ |
| Error rates | 10-9 to 10-11 | 10-15 to 10-18 |
| Repair mechanisms | Active, energy-dependent | Redundancy, error correction codes |
| Energy efficiency | ~104 kT/bit | ~109 kT/bit |
| Self-replication | Intrinsic | Not inherent |
| Evolution | Continuous, adaptive | Designed, periodic updates |
3.10.2 Lessons from Biology for Technology
- Error tolerance: Biological systems function despite errors
- Distributed processing: No central controller in cells
- Robustness: Multiple pathways to same function
- Adaptability: Systems evolve to meet changing demands
3.10.3 Synthetic Biology: Engineering Biological Information
- Genetic circuits: Designed gene networks with specified functions
- Minimal genomes: Identifying essential genetic information
- DNA data storage: Using DNA for long-term archival storage
- Biological computers: Using cells or molecules for computation
3.11 Chapter Summary
3.11.1 Key Concepts
- Biological information specifies structure and function through encoded instructions
- Central dogma: DNA → RNA → Protein (with important exceptions)
- Shannon entropy measures information content; biological sequences have less than maximum entropy
- DNA is a remarkably dense and durable information storage medium
- Cellular information processing involves gene regulation, signal transduction, and neural coding
- Error rates vary across organisms and processes; evolution balances adaptability and stability
- Natural selection increases information about environment in populations
- Biological and technological information systems differ in fundamental ways but can inform each other
3.11.2 Important Relationships
Information and energy: Information processing requires energy
- DNA replication: ~2 ATP per base pair
- Protein synthesis: ~4 ATP equivalents per amino acid
Information and evolution: More information → better adaptation (generally)
Information and complexity: More complex organisms process more information
3.11.3 Quantitative Measures
- DNA information content: Up to 2 bits per base pair
- Mutation rates: 10-3 to 10-11 per base per replication
- Genetic code: 64 codons for 20 amino acids + stops
- Human genome: 3.2 × 109 bp = ~800 MB information
- Channel capacity: Limits information transmission rate given noise
3.12 Review Questions
3.12.1 Level 1: Recall and Understanding
- Define biological information and list its key properties.
- Describe the central dogma of molecular biology and its three main transfers.
- What is Shannon entropy and how is it calculated?
- List three mechanisms cells use to correct errors in information processing.
- How does the genetic code provide error minimization?
3.12.2 Level 2: Application and Analysis
- Calculate the Shannon entropy for a DNA sequence where: P(A)=0.3, P(T)=0.3, P(C)=0.2, P(G)=0.2.
- Compare the mutation rates of RNA viruses, bacteria, and humans. What evolutionary advantages and disadvantages does each rate provide?
- Explain how alternative splicing increases the information content of a genome without increasing DNA sequence length.
- A gene has a dN/dS ratio of 2.5. What does this suggest about its evolutionary history?
3.12.3 Level 3: Synthesis and Evaluation
- Some scientists argue that epigenetic information represents a “Lamarckian” form of inheritance. Evaluate this claim in light of modern understanding of information flow in biology.
- Compare biological information storage (DNA) with human-designed storage systems. What advantages might DNA storage have for long-term archival purposes?
- How does information theory help explain why some DNA sequences are conserved across species while others vary widely?
- Design an experiment to measure the information content of a regulatory network. What would you measure and how would you interpret the results?
3.13 Key Terms
- Information: In biology, instructions that specify biological structure and function
- Central Dogma: DNA → RNA → Protein information flow
- Transcription: Synthesis of RNA from DNA template
- Translation: Synthesis of protein from mRNA template
- Genetic Code: Correspondence between nucleotide triplets and amino acids
- Shannon Entropy: Measure of information content or uncertainty
- Mutation Rate: Probability of error per base per replication
- Epigenetics: Heritable changes in gene expression not due to DNA sequence changes
- Signal Transduction: Process of converting extracellular signals into cellular responses
- Neutral Theory: Most evolutionary change at molecular level is due to neutral mutations
- dN/dS Ratio: Ratio of nonsynonymous to synonymous substitution rates
- Synthetic Biology: Engineering biological systems for new functions
3.14 Further Reading
3.14.1 Books
- Yockey, H. P. (2005). Information Theory, Evolution, and the Origin of Life. Cambridge University Press.
- Adami, C. (2016). Introduction to Artificial Life. Springer.
- Jablonka, E., & Lamb, M. J. (2014). Evolution in Four Dimensions: Genetic, Epigenetic, Behavioral, and Symbolic Variation in the History of Life. MIT Press.
3.14.2 Scientific Articles
- Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.
- Adami, C. (2004). Information theory in molecular biology. Physics of Life Reviews, 1(1), 3-22.
- Dennett, D. C. (1995). Darwin’s Dangerous Idea: Evolution and the Meanings of Life. Simon & Schuster.
3.14.3 Online Resources
- NIH Genetics Home Reference: How Genes Work
- Information Theory in Biology: Tutorials and Resources
- DNA Learning Center: Central Dogma
3.15 Quantitative Problems
Information Content Calculation:
A protein has 300 amino acids.
- What is the minimum number of DNA base pairs needed to code for it?
- If the DNA sequence has equal nucleotide frequencies, what is its information content in bits?
- If the actual sequence has H = 1.85 bits/base, what is its information content?
Error Rate Analysis: The human genome has ~3.2 × 109 base pairs.
- With a mutation rate of 10-9 per base per replication, how many mutations occur per cell division?
- Given ~1013 cell divisions in a human lifetime, estimate total mutations.
- Only ~2% of the genome codes for proteins. How many mutations would affect protein sequences?
Evolutionary Information: A population has two alleles: A (frequency 0.7) and a (frequency 0.3).
- Calculate the Shannon entropy of this locus.
- If selection increases frequency of A to 0.9, what is the new entropy?
- How much information has been gained by selection?
3.16 Case Study: The Genetic Code
Background: The genetic code is nearly universal, redundant, and shows patterns of error minimization.
Questions:
- Why is the genetic code redundant (multiple codons per amino acid)?
- How does the code minimize effects of point mutations?
- What evidence suggests the code is not random but optimized?
- Are there exceptions to the universal code? What do they tell us about code evolution?
- Could a different genetic code evolve? What constraints exist?
Data for analysis:
- Codon table shows systematic patterns
- Similar codons often specify same or similar amino acids
- Changes in first base often change amino acid class (hydrophobic, polar, etc.)
- Mitochondrial codes differ slightly from nuclear codes
3.17 Research Project: Information in Your Genome
Objective: Estimate the information content of a human gene.
Procedure:
- Choose a human gene from NCBI database (e.g., HBB for beta-globin)
- Obtain DNA sequence for coding region
- Calculate nucleotide frequencies
- Compute Shannon entropy per base
- Compare to maximum possible entropy (2 bits/base)
- Research: What percentage of mutations in this gene are deleterious?
- Analyze: How does information content relate to evolutionary conservation?
Extensions:
- Compare information content across species for same gene
- Analyze non-coding regions (introns, regulatory sequences)
- Estimate information loss in genetic diseases affecting this gene
Next Part: Part II: Cellular Systems