22 Bioinformatics and Computational Biology
22.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define bioinformatics and computational biology and distinguish their scopes
- Describe the major types of biological data and databases
- Perform basic sequence analysis including alignment, comparison, and search
- Explain approaches to genome annotation and gene prediction
- Analyze phylogenetic relationships using molecular data
- Apply basic structural bioinformatics methods to predict protein structure and function
- Evaluate systems biology approaches to model biological networks
- Discuss ethical considerations in biological data sharing and analysis
22.2 Introduction
Bioinformatics and computational biology represent the fusion of biology with computer science, statistics, and mathematics. As biological research generates ever-increasing volumes of data—from DNA sequences to protein structures to ecological observations—computational methods have become essential for storing, analyzing, and interpreting this information. These fields transform raw data into biological insight, enabling discoveries that would be impossible through traditional laboratory methods alone. From identifying disease genes to understanding evolutionary relationships to predicting protein functions, computational approaches have become integral to virtually all areas of modern biology.
22.3 Foundations of Bioinformatics
22.3.1 Definitions and Scope
Bioinformatics: Development and application of computational tools to acquire, store, analyze, and visualize biological data.
Computational biology: Development and application of theoretical methods, mathematical modeling, and computational simulation techniques to study biological systems.
Key relationships:
- Bioinformatics → Data-driven, focused on specific analysis tasks
- Computational biology → Theory-driven, focused on models and simulations
- Systems biology → Integration of multiple data types to understand systems
22.3.2 Historical Development
Early phase (1960s-1980s): Sequence databases, dynamic programming for alignment
Genomics era (1990s-2000s): Human Genome Project, large-scale data analysis
Post-genomics era (2000s-present): Multi-omics integration, systems biology, machine learning
22.3.3 Core Computational Skills
Programming: Python, R, Perl, Java
Statistics: Probability, hypothesis testing, multivariate analysis
Database management: SQL, NoSQL, data structures
Algorithms: Search, alignment, clustering, optimization
Visualization: Graphs, networks, genomic browsers
22.4 Biological Databases
22.4.1 Sequence Databases
Primary databases: Original submitted data
- GenBank (NCBI): DNA and RNA sequences
- ENA (European Nucleotide Archive): European counterpart
- DDBJ (DNA Data Bank of Japan): Asian counterpart
Secondary databases: Curated, annotated data
- RefSeq: Non-redundant, curated reference sequences
- UniProt: Protein sequences and functional information
- Ensembl: Genome annotations for vertebrates
22.4.2 Structure Databases
Protein Data Bank (PDB): 3D structures of proteins, nucleic acids, complexes
CATH, SCOP: Protein structure classification
ModBase: Comparative protein structure models
22.4.3 Specialized Databases
Gene expression: GEO, ArrayExpress
Pathways and interactions: KEGG, Reactome, STRING
Genetic variation: dbSNP, dbVar, ClinVar
Model organisms: FlyBase, WormBase, SGD, TAIR
22.4.4 Database Characteristics
Flat files: Simple text format (FASTA, GenBank format)
Relational databases: Structured tables with relationships (SQL)
Object-oriented databases: Store complex biological objects
Data formats: FASTA, FASTQ, SAM/BAM, VCF, GFF/GTF
22.5 Sequence Analysis
22.5.1 Sequence Alignment
Pairwise alignment: Comparing two sequences
- Global alignment: Entire length alignment (Needleman-Wunsch)
- Local alignment: Best matching regions (Smith-Waterman)
- Scoring: Match/mismatch scores, gap penalties
Multiple sequence alignment: Aligning three or more sequences
- Progressive methods: CLUSTAL, MUSCLE
- Iterative methods: MAFFT, PRANK
- Conservation scores: Identify important positions
Alignment statistics:
- Identity: Percentage of identical residues
- Similarity: Percentage of similar residues (considering properties)
- E-value: Expected number of matches by chance
- Bit score: Normalized alignment score
22.5.2 Sequence Similarity Search
BLAST (Basic Local Alignment Search Tool): Heuristic search algorithm
- BLASTN: Nucleotide vs. nucleotide
- BLASTP: Protein vs. protein
- BLASTX: Translated nucleotide vs. protein
- TBLASTN: Protein vs. translated nucleotide
- TBLASTX: Translated nucleotide vs. translated nucleotide
Parameters: E-value threshold, word size, scoring matrix
Interpretation: Significance, coverage, biological context
22.5.3 Motif and Pattern Discovery
Sequence motifs: Short, conserved patterns with biological significance
Regular expressions: Pattern matching syntax
Position-specific scoring matrices (PSSMs): Weight matrices for motif representation
MEME, Gibbs sampling: Algorithms for discovering motifs
Databases: PROSITE, Pfam, SMART, InterPro
22.6 Genomics and Genome Analysis
22.6.1 Genome Assembly
Sequencing technologies:
- Sanger sequencing: ~1 kb reads, high accuracy
- Next-generation sequencing (NGS): Millions of short reads (100-300 bp)
- Third-generation sequencing: Long reads (10+ kb), higher error rates
Assembly approaches:
- Overlap-layout-consensus: For long reads
- De Bruijn graph: For short reads
- Hybrid assembly: Combining short and long reads
Assembly metrics: N50, L50, completeness, contiguity
22.6.2 Genome Annotation
Gene prediction:
- Ab initio: Based on statistical patterns (GENSCAN, Glimmer)
- Homology-based: Using known genes from related species
- Transcript-based: Using RNA-seq data
Functional annotation:
- Similarity searches: BLAST against databases
- Domain identification: Pfam, SMART
- Gene Ontology (GO): Standardized vocabulary for gene functions
Comparative genomics: Comparing genomes across species
- Synteny: Conservation of gene order
- Orthologs: Genes in different species from common ancestor
- Paralogs: Genes in same species from duplication event
22.6.3 Variant Analysis
SNPs (Single Nucleotide Polymorphisms): Single base changes
Indels: Insertions/deletions
CNVs (Copy Number Variations): Duplications/deletions of regions
Structural variants: Large-scale rearrangements
Variant calling: Identifying variants from sequencing data
Variant annotation: Predicting functional consequences
Population genetics: Allele frequencies, selection signals
22.7 Phylogenetics and Evolution
22.7.1 Molecular Evolution Models
Nucleotide substitution models:
- Jukes-Cantor: All substitutions equally likely
- Kimura 2-parameter: Different rates for transitions/transversions
- General Time Reversible (GTR): Most general model
Amino acid substitution models:
- PAM (Percent Accepted Mutation): Based on global alignments
- BLOSUM (Blocks Substitution Matrix): Based on local alignments
Model selection: AIC, BIC, likelihood ratio tests
22.7.2 Phylogenetic Tree Construction
Distance methods: UPGMA, Neighbor-Joining
Character-based methods:
- Maximum Parsimony: Minimize number of changes
- Maximum Likelihood: Find tree with highest probability given model
- Bayesian Inference: Posterior probability of trees
Tree evaluation:
- Bootstrap: Resampling to assess support
- Posterior probabilities: From Bayesian analysis
- Consensus trees: Combining multiple trees
22.7.3 Molecular Clock and Dating
Molecular clock hypothesis: Constant rate of molecular evolution
Relaxed molecular clocks: Allow rate variation across lineages
Calibration: Using fossil dates or geological events
Applications: Dating evolutionary events, studying rate variation
22.8 Structural Bioinformatics
22.8.1 Protein Structure Prediction
Ab initio prediction: From sequence alone (Rosetta, AlphaFold)
Homology modeling: Using known structures of related proteins
Threading/fold recognition: Matching sequence to known folds
Recent advances: Deep learning approaches (AlphaFold2, RoseTTAFold)
Structure validation:
- Ramachandran plot: Backbone dihedral angles
- Steric clashes: Atoms too close together
- Energy scores: Force field evaluations
22.8.2 Molecular Docking
Protein-ligand docking: Predicting how small molecules bind proteins
Protein-protein docking: Predicting protein complexes
Methods: Rigid docking, flexible docking, induced fit
Scoring functions: Evaluate binding poses
Applications: Drug discovery, protein engineering
22.8.3 Molecular Dynamics
Simulating atomic motions: Newton’s equations of motion
Force fields: Mathematical models of interatomic forces
Timescales: Picoseconds to milliseconds
Applications: Conformational changes, ligand binding, folding/unfolding
22.9 Systems Biology
22.9.1 Network Analysis
Network types:
- Protein-protein interaction networks: Physical interactions
- Gene regulatory networks: Transcriptional regulation
- Metabolic networks: Biochemical reactions
- Signal transduction networks: Cellular signaling
Network properties:
- Degree distribution: Number of connections per node
- Clustering coefficient: Tendency to form clusters
- Path length: Shortest paths between nodes
- Centrality measures: Importance of nodes
Network motifs: Recurring small patterns with specific functions
22.9.2 Omics Data Integration
Multi-omics integration: Combining genomics, transcriptomics, proteomics, metabolomics
Dimensionality reduction: PCA, t-SNE, UMAP
Clustering: K-means, hierarchical, DBSCAN
Machine learning: Classification, regression, feature selection
22.9.3 Mathematical Modeling
Ordinary differential equations (ODEs): Continuous deterministic models
Stochastic models: Incorporating randomness
Boolean networks: Simplified discrete models
Constraint-based models: Flux balance analysis for metabolism
Parameter estimation: Fitting models to data
22.11 Chapter Summary
22.11.1 Key Concepts
- Bioinformatics develops computational tools for biological data analysis
- Biological databases organize and provide access to diverse data types
- Sequence analysis includes alignment, search, and motif discovery
- Genome analysis involves assembly, annotation, and variant calling
- Phylogenetics reconstructs evolutionary relationships from molecular data
- Structural bioinformatics predicts and analyzes biomolecular structures
- Systems biology models biological networks and integrates multi-omics data
- Ethical considerations include privacy, bias, and responsible data use
22.11.2 Major Database Types
| Database Category | Examples | Primary Content |
|---|---|---|
| Nucleotide sequences | GenBank, ENA, DDBJ | DNA/RNA sequences |
| Protein sequences | UniProt, RefSeq | Protein sequences, annotations |
| 3D structures | PDB, CATH, SCOP | Atomic coordinates of biomolecules |
| Gene expression | GEO, ArrayExpress | Microarray, RNA-seq data |
| Pathways | KEGG, Reactome | Metabolic/signaling pathways |
| Genetic variation | dbSNP, ClinVar | SNPs, clinical variants |
| Literature | PubMed, PubMed Central | Scientific publications |
22.11.3 Sequence Alignment Algorithms
| Algorithm Type | Method | Time Complexity | Use Case |
|---|---|---|---|
| Global alignment | Needleman-Wunsch | O(mn) | Comparing similar length sequences |
| Local alignment | Smith-Waterman | O(mn) | Finding conserved domains |
| Heuristic search | BLAST | Faster but approximate | Database searching |
| Multiple alignment | CLUSTAL, MUSCLE | O(N²L²) to O(NL²) | Aligning protein families |
| Genome alignment | BLASTZ, LASTZ | Optimized for large sequences | Comparative genomics |
22.11.4 Phylogenetic Methods
| Method | Approach | Advantages | Limitations |
|---|---|---|---|
| Distance-based | UPGMA, Neighbor-Joining | Fast, good for large datasets | Less accurate, lose character information |
| Maximum Parsimony | Minimize evolutionary changes | Intuitive, no model assumptions | Long branch attraction, less statistical foundation |
| Maximum Likelihood | Find tree with highest probability | Statistical framework, model-based | Computationally intensive |
| Bayesian Inference | Posterior probability of trees | Incorporates prior knowledge, provides support values | Very computationally intensive |
| Network methods | Split networks, consensus networks | Show conflicting signals, recombination | Harder to interpret as trees |
22.11.5 Protein Structure Prediction Methods
| Method Type | Principle | Accuracy | Applications |
|---|---|---|---|
| Comparative modeling | Based on homologous structure | High if good template available | Most common practical method |
| Fold recognition/threading | Match to known folds | Moderate | When no clear homologs |
| Ab initio | Physical principles only | Lower, improving with deep learning | Novel folds, small proteins |
| Deep learning | Neural networks on sequence/structure | State-of-the-art (AlphaFold2) | Revolutionizing the field |
| Molecular dynamics | Simulate physical movements | Atomistic detail, but limited timescales | Refinement, dynamics, binding |
22.11.6 Systems Biology Modeling Approaches
| Model Type | Mathematical Form | Scale | Applications |
|---|---|---|---|
| Boolean networks | Discrete (0/1) states | Large networks | Gene regulation, signal transduction |
| Ordinary differential equations | Continuous, deterministic | Small to medium | Detailed kinetics, metabolic pathways |
| Stochastic models | Include randomness | Small systems | Gene expression noise, small populations |
| Constraint-based | Flux balance analysis | Genome-scale | Metabolic networks, engineering |
| Agent-based | Individual agent rules | Cellular populations | Development, ecology, immunology |
22.12 Review Questions
22.12.1 Level 1: Recall and Understanding
- What are the main differences between bioinformatics and computational biology?
- List five major types of biological databases and what they contain.
- Explain the difference between global and local sequence alignment.
- What is the purpose of BLAST, and what do the E-value and bit score represent?
- Describe the basic steps in genome annotation.
22.12.2 Level 2: Application and Analysis
- You have a protein sequence and want to predict its function. What bioinformatics tools and databases would you use, and in what order?
- Compare and contrast maximum parsimony and maximum likelihood methods for phylogenetic tree construction.
- How might biases in training data affect machine learning models in genomics, and what can be done to address this?
- Explain how multiple sequence alignment can be used to identify functionally important residues in a protein family.
- What information can be gained from integrating transcriptomic and proteomic data that couldn’t be obtained from either alone?
22.12.3 Level 3: Synthesis and Evaluation
- Design a bioinformatics pipeline for analyzing RNA-seq data to identify differentially expressed genes between cancer and normal tissues.
- Evaluate the ethical implications of direct-to-consumer genetic testing from the perspectives of privacy, interpretation, and clinical utility.
- How has the availability of complete genome sequences transformed biological research, and what challenges remain?
- Propose a strategy for making biological data more FAIR (Findable, Accessible, Interoperable, Reusable).
22.13 Key Terms
- Bioinformatics: Application of computational tools to biological data
- BLAST: Basic Local Alignment Search Tool for sequence similarity
- Genome assembly: Reconstructing complete genomes from sequencing reads
- Genome annotation: Identifying genes and their functions in genomes
- Homology: Similarity due to common ancestry
- Multiple sequence alignment: Alignment of three or more biological sequences
- Phylogenetics: Study of evolutionary relationships
- Protein Data Bank (PDB): Repository for 3D structural data of biological macromolecules
- Sequence alignment: Arrangement of sequences to identify regions of similarity
- Systems biology: Study of biological systems as integrated wholes
- Transcriptomics: Study of complete set of RNA transcripts
- Variant calling: Identification of genetic variants from sequencing data
22.14 Further Reading
22.14.1 Books
- Mount, D. W. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press.
- Lesk, A. M. (2019). Introduction to Bioinformatics (5th ed.). Oxford University Press.
- Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
22.14.2 Scientific Articles
- Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410.
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
- 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68-74.
22.14.3 Online Resources
- NCBI (National Center for Biotechnology Information): https://www.ncbi.nlm.nih.gov
- EBI (European Bioinformatics Institute): https://www.ebi.ac.uk
- ROSALIND: Bioinformatics programming platform: http://rosalind.info
- Coursera Bioinformatics Specialization: https://www.coursera.org/specializations/bioinformatics
22.15 Quantitative Problems
- Sequence Alignment Scoring: Given scoring: match = +2, mismatch = -1, gap opening = -4, gap extension = -1 Align sequences: ACGTACGT and ACGACGT
- Calculate score for optimal global alignment
- Calculate score for optimal local alignment
- If sequence length increases 10-fold, how does computation time change?
- BLAST Statistics: A BLAST search returns a hit with E-value = 1e-10 Database size = 10⁹ residues, query length = 300 residues Using Karlin-Altschul statistics: E = Kmn e^(-λS) Where K = 0.1, λ = 0.25, m = query length, n = db length
- What is the bit score?
- What E-value would correspond to bit score = 50?
- If database size doubles, how does E-value change for same alignment?
- Phylogenetic Tree Probability: For a tree with 4 taxa, there are 15 possible unrooted trees. Under Jukes-Cantor model, sequence length = 1000 bases Tree 1 has log-likelihood = -2050, Tree 2 = -2100
- What are the relative probabilities of the two trees?
- If prior is uniform, what are posterior probabilities?
- How does increasing sequence length affect confidence?
- Genome Assembly: Genome size = 3 × 10⁹ bp, read length = 150 bp, coverage = 30×
- How many reads are needed?
- If reads have 1% error rate, how many errors per read?
- With 30× coverage, what’s probability a base is not covered?
22.16 Case Study: The Human Genome Project
Background: The Human Genome Project (1990-2003) was an international research effort to sequence and map all human genes. It transformed biology and medicine while raising important ethical, legal, and social issues.
Questions:
- What were the main technological innovations that made the Human Genome Project possible?
- How did the public (International Human Genome Sequencing Consortium) and private (Celera Genomics) approaches differ?
- What were the major scientific findings from the human genome sequence?
- How has the human genome sequence enabled subsequent research in genomics and medicine?
- What ethical issues were raised by the project, and how were they addressed?
Data for analysis:
- Cost: $3 billion over 13 years (public project)
- Size: ~3.2 billion base pairs, ~20,000 protein-coding genes
- Timeline: Started 1990, draft 2000, complete 2003
- Impact: ~$1 trillion economic impact (estimated)
- Sequencing cost reduction: From ~$1 per base (1990) to <$0.000001 per base (2020)
Next Chapter: Medical Applications and Biotechnology