22  Bioinformatics and Computational Biology

22.1 Learning Objectives

By the end of this chapter, you should be able to:

  1. Define bioinformatics and computational biology and distinguish their scopes
  2. Describe the major types of biological data and databases
  3. Perform basic sequence analysis including alignment, comparison, and search
  4. Explain approaches to genome annotation and gene prediction
  5. Analyze phylogenetic relationships using molecular data
  6. Apply basic structural bioinformatics methods to predict protein structure and function
  7. Evaluate systems biology approaches to model biological networks
  8. Discuss ethical considerations in biological data sharing and analysis

22.2 Introduction

Bioinformatics and computational biology represent the fusion of biology with computer science, statistics, and mathematics. As biological research generates ever-increasing volumes of data—from DNA sequences to protein structures to ecological observations—computational methods have become essential for storing, analyzing, and interpreting this information. These fields transform raw data into biological insight, enabling discoveries that would be impossible through traditional laboratory methods alone. From identifying disease genes to understanding evolutionary relationships to predicting protein functions, computational approaches have become integral to virtually all areas of modern biology.


22.3 Foundations of Bioinformatics

22.3.1 Definitions and Scope

Bioinformatics: Development and application of computational tools to acquire, store, analyze, and visualize biological data.

Computational biology: Development and application of theoretical methods, mathematical modeling, and computational simulation techniques to study biological systems.

Key relationships:

  • Bioinformatics → Data-driven, focused on specific analysis tasks
  • Computational biology → Theory-driven, focused on models and simulations
  • Systems biology → Integration of multiple data types to understand systems

22.3.2 Historical Development

Early phase (1960s-1980s): Sequence databases, dynamic programming for alignment

Genomics era (1990s-2000s): Human Genome Project, large-scale data analysis

Post-genomics era (2000s-present): Multi-omics integration, systems biology, machine learning

22.3.3 Core Computational Skills

Programming: Python, R, Perl, Java

Statistics: Probability, hypothesis testing, multivariate analysis

Database management: SQL, NoSQL, data structures

Algorithms: Search, alignment, clustering, optimization

Visualization: Graphs, networks, genomic browsers


22.4 Biological Databases

22.4.1 Sequence Databases

Primary databases: Original submitted data

  • GenBank (NCBI): DNA and RNA sequences
  • ENA (European Nucleotide Archive): European counterpart
  • DDBJ (DNA Data Bank of Japan): Asian counterpart

Secondary databases: Curated, annotated data

  • RefSeq: Non-redundant, curated reference sequences
  • UniProt: Protein sequences and functional information
  • Ensembl: Genome annotations for vertebrates

22.4.2 Structure Databases

Protein Data Bank (PDB): 3D structures of proteins, nucleic acids, complexes

CATH, SCOP: Protein structure classification

ModBase: Comparative protein structure models

22.4.3 Specialized Databases

Gene expression: GEO, ArrayExpress

Pathways and interactions: KEGG, Reactome, STRING

Genetic variation: dbSNP, dbVar, ClinVar

Model organisms: FlyBase, WormBase, SGD, TAIR

22.4.4 Database Characteristics

Flat files: Simple text format (FASTA, GenBank format)

Relational databases: Structured tables with relationships (SQL)

Object-oriented databases: Store complex biological objects

Data formats: FASTA, FASTQ, SAM/BAM, VCF, GFF/GTF


22.5 Sequence Analysis

22.5.1 Sequence Alignment

Pairwise alignment: Comparing two sequences

  • Global alignment: Entire length alignment (Needleman-Wunsch)
  • Local alignment: Best matching regions (Smith-Waterman)
  • Scoring: Match/mismatch scores, gap penalties

Multiple sequence alignment: Aligning three or more sequences

  • Progressive methods: CLUSTAL, MUSCLE
  • Iterative methods: MAFFT, PRANK
  • Conservation scores: Identify important positions

Alignment statistics:

  • Identity: Percentage of identical residues
  • Similarity: Percentage of similar residues (considering properties)
  • E-value: Expected number of matches by chance
  • Bit score: Normalized alignment score

22.5.3 Motif and Pattern Discovery

Sequence motifs: Short, conserved patterns with biological significance

Regular expressions: Pattern matching syntax

Position-specific scoring matrices (PSSMs): Weight matrices for motif representation

MEME, Gibbs sampling: Algorithms for discovering motifs

Databases: PROSITE, Pfam, SMART, InterPro


22.6 Genomics and Genome Analysis

22.6.1 Genome Assembly

Sequencing technologies:

  • Sanger sequencing: ~1 kb reads, high accuracy
  • Next-generation sequencing (NGS): Millions of short reads (100-300 bp)
  • Third-generation sequencing: Long reads (10+ kb), higher error rates

Assembly approaches:

  • Overlap-layout-consensus: For long reads
  • De Bruijn graph: For short reads
  • Hybrid assembly: Combining short and long reads

Assembly metrics: N50, L50, completeness, contiguity

22.6.2 Genome Annotation

Gene prediction:

  • Ab initio: Based on statistical patterns (GENSCAN, Glimmer)
  • Homology-based: Using known genes from related species
  • Transcript-based: Using RNA-seq data

Functional annotation:

  • Similarity searches: BLAST against databases
  • Domain identification: Pfam, SMART
  • Gene Ontology (GO): Standardized vocabulary for gene functions

Comparative genomics: Comparing genomes across species

  • Synteny: Conservation of gene order
  • Orthologs: Genes in different species from common ancestor
  • Paralogs: Genes in same species from duplication event

22.6.3 Variant Analysis

  • SNPs (Single Nucleotide Polymorphisms): Single base changes

  • Indels: Insertions/deletions

  • CNVs (Copy Number Variations): Duplications/deletions of regions

  • Structural variants: Large-scale rearrangements

  • Variant calling: Identifying variants from sequencing data

  • Variant annotation: Predicting functional consequences

  • Population genetics: Allele frequencies, selection signals


22.7 Phylogenetics and Evolution

22.7.1 Molecular Evolution Models

Nucleotide substitution models:

  • Jukes-Cantor: All substitutions equally likely
  • Kimura 2-parameter: Different rates for transitions/transversions
  • General Time Reversible (GTR): Most general model

Amino acid substitution models:

  • PAM (Percent Accepted Mutation): Based on global alignments
  • BLOSUM (Blocks Substitution Matrix): Based on local alignments

Model selection: AIC, BIC, likelihood ratio tests

22.7.2 Phylogenetic Tree Construction

Distance methods: UPGMA, Neighbor-Joining

Character-based methods:

  • Maximum Parsimony: Minimize number of changes
  • Maximum Likelihood: Find tree with highest probability given model
  • Bayesian Inference: Posterior probability of trees

Tree evaluation:

  • Bootstrap: Resampling to assess support
  • Posterior probabilities: From Bayesian analysis
  • Consensus trees: Combining multiple trees

22.7.3 Molecular Clock and Dating

Molecular clock hypothesis: Constant rate of molecular evolution

Relaxed molecular clocks: Allow rate variation across lineages

Calibration: Using fossil dates or geological events

Applications: Dating evolutionary events, studying rate variation


22.8 Structural Bioinformatics

22.8.1 Protein Structure Prediction

Ab initio prediction: From sequence alone (Rosetta, AlphaFold)

Homology modeling: Using known structures of related proteins

Threading/fold recognition: Matching sequence to known folds

Recent advances: Deep learning approaches (AlphaFold2, RoseTTAFold)

Structure validation:

  • Ramachandran plot: Backbone dihedral angles
  • Steric clashes: Atoms too close together
  • Energy scores: Force field evaluations

22.8.2 Molecular Docking

Protein-ligand docking: Predicting how small molecules bind proteins

Protein-protein docking: Predicting protein complexes

Methods: Rigid docking, flexible docking, induced fit

Scoring functions: Evaluate binding poses

Applications: Drug discovery, protein engineering

22.8.3 Molecular Dynamics

Simulating atomic motions: Newton’s equations of motion

Force fields: Mathematical models of interatomic forces

Timescales: Picoseconds to milliseconds

Applications: Conformational changes, ligand binding, folding/unfolding


22.9 Systems Biology

22.9.1 Network Analysis

Network types:

  • Protein-protein interaction networks: Physical interactions
  • Gene regulatory networks: Transcriptional regulation
  • Metabolic networks: Biochemical reactions
  • Signal transduction networks: Cellular signaling

Network properties:

  • Degree distribution: Number of connections per node
  • Clustering coefficient: Tendency to form clusters
  • Path length: Shortest paths between nodes
  • Centrality measures: Importance of nodes

Network motifs: Recurring small patterns with specific functions

22.9.2 Omics Data Integration

Multi-omics integration: Combining genomics, transcriptomics, proteomics, metabolomics

Dimensionality reduction: PCA, t-SNE, UMAP

Clustering: K-means, hierarchical, DBSCAN

Machine learning: Classification, regression, feature selection

22.9.3 Mathematical Modeling

Ordinary differential equations (ODEs): Continuous deterministic models

Stochastic models: Incorporating randomness

Boolean networks: Simplified discrete models

Constraint-based models: Flux balance analysis for metabolism

Parameter estimation: Fitting models to data


22.10 Ethical and Social Considerations

22.10.2 Data Sharing and Access

Open science: Making data and methods publicly available

FAIR principles: Findable, Accessible, Interoperable, Reusable

Data repositories: NCBI, EBI, Dryad, Zenodo

Preprint servers: bioRxiv, arXiv

22.10.3 Algorithmic Bias and Fairness

Bias in training data: Underrepresentation of certain populations

Algorithmic fairness: Ensuring equitable performance across groups

Interpretability: Understanding how algorithms make decisions

Reproducibility: Ability to reproduce computational analyses

22.10.4 Dual Use and Security

Pathogen genomics: Sequencing dangerous pathogens

Synthesis screening: Preventing synthesis of harmful sequences

Cyberbiosecurity: Security of biological data and systems

Responsible conduct: Ethical guidelines for computational biologists


22.11 Chapter Summary

22.11.1 Key Concepts

  1. Bioinformatics develops computational tools for biological data analysis
  2. Biological databases organize and provide access to diverse data types
  3. Sequence analysis includes alignment, search, and motif discovery
  4. Genome analysis involves assembly, annotation, and variant calling
  5. Phylogenetics reconstructs evolutionary relationships from molecular data
  6. Structural bioinformatics predicts and analyzes biomolecular structures
  7. Systems biology models biological networks and integrates multi-omics data
  8. Ethical considerations include privacy, bias, and responsible data use

22.11.2 Major Database Types

Database Category Examples Primary Content
Nucleotide sequences GenBank, ENA, DDBJ DNA/RNA sequences
Protein sequences UniProt, RefSeq Protein sequences, annotations
3D structures PDB, CATH, SCOP Atomic coordinates of biomolecules
Gene expression GEO, ArrayExpress Microarray, RNA-seq data
Pathways KEGG, Reactome Metabolic/signaling pathways
Genetic variation dbSNP, ClinVar SNPs, clinical variants
Literature PubMed, PubMed Central Scientific publications

22.11.3 Sequence Alignment Algorithms

Algorithm Type Method Time Complexity Use Case
Global alignment Needleman-Wunsch O(mn) Comparing similar length sequences
Local alignment Smith-Waterman O(mn) Finding conserved domains
Heuristic search BLAST Faster but approximate Database searching
Multiple alignment CLUSTAL, MUSCLE O(N²L²) to O(NL²) Aligning protein families
Genome alignment BLASTZ, LASTZ Optimized for large sequences Comparative genomics

22.11.4 Phylogenetic Methods

Method Approach Advantages Limitations
Distance-based UPGMA, Neighbor-Joining Fast, good for large datasets Less accurate, lose character information
Maximum Parsimony Minimize evolutionary changes Intuitive, no model assumptions Long branch attraction, less statistical foundation
Maximum Likelihood Find tree with highest probability Statistical framework, model-based Computationally intensive
Bayesian Inference Posterior probability of trees Incorporates prior knowledge, provides support values Very computationally intensive
Network methods Split networks, consensus networks Show conflicting signals, recombination Harder to interpret as trees

22.11.5 Protein Structure Prediction Methods

Method Type Principle Accuracy Applications
Comparative modeling Based on homologous structure High if good template available Most common practical method
Fold recognition/threading Match to known folds Moderate When no clear homologs
Ab initio Physical principles only Lower, improving with deep learning Novel folds, small proteins
Deep learning Neural networks on sequence/structure State-of-the-art (AlphaFold2) Revolutionizing the field
Molecular dynamics Simulate physical movements Atomistic detail, but limited timescales Refinement, dynamics, binding

22.11.6 Systems Biology Modeling Approaches

Model Type Mathematical Form Scale Applications
Boolean networks Discrete (0/1) states Large networks Gene regulation, signal transduction
Ordinary differential equations Continuous, deterministic Small to medium Detailed kinetics, metabolic pathways
Stochastic models Include randomness Small systems Gene expression noise, small populations
Constraint-based Flux balance analysis Genome-scale Metabolic networks, engineering
Agent-based Individual agent rules Cellular populations Development, ecology, immunology

22.12 Review Questions

22.12.1 Level 1: Recall and Understanding

  1. What are the main differences between bioinformatics and computational biology?
  2. List five major types of biological databases and what they contain.
  3. Explain the difference between global and local sequence alignment.
  4. What is the purpose of BLAST, and what do the E-value and bit score represent?
  5. Describe the basic steps in genome annotation.

22.12.2 Level 2: Application and Analysis

  1. You have a protein sequence and want to predict its function. What bioinformatics tools and databases would you use, and in what order?
  2. Compare and contrast maximum parsimony and maximum likelihood methods for phylogenetic tree construction.
  3. How might biases in training data affect machine learning models in genomics, and what can be done to address this?
  4. Explain how multiple sequence alignment can be used to identify functionally important residues in a protein family.
  5. What information can be gained from integrating transcriptomic and proteomic data that couldn’t be obtained from either alone?

22.12.3 Level 3: Synthesis and Evaluation

  1. Design a bioinformatics pipeline for analyzing RNA-seq data to identify differentially expressed genes between cancer and normal tissues.
  2. Evaluate the ethical implications of direct-to-consumer genetic testing from the perspectives of privacy, interpretation, and clinical utility.
  3. How has the availability of complete genome sequences transformed biological research, and what challenges remain?
  4. Propose a strategy for making biological data more FAIR (Findable, Accessible, Interoperable, Reusable).

22.13 Key Terms

  • Bioinformatics: Application of computational tools to biological data
  • BLAST: Basic Local Alignment Search Tool for sequence similarity
  • Genome assembly: Reconstructing complete genomes from sequencing reads
  • Genome annotation: Identifying genes and their functions in genomes
  • Homology: Similarity due to common ancestry
  • Multiple sequence alignment: Alignment of three or more biological sequences
  • Phylogenetics: Study of evolutionary relationships
  • Protein Data Bank (PDB): Repository for 3D structural data of biological macromolecules
  • Sequence alignment: Arrangement of sequences to identify regions of similarity
  • Systems biology: Study of biological systems as integrated wholes
  • Transcriptomics: Study of complete set of RNA transcripts
  • Variant calling: Identification of genetic variants from sequencing data

22.14 Further Reading

22.14.1 Books

  1. Mount, D. W. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press.
  2. Lesk, A. M. (2019). Introduction to Bioinformatics (5th ed.). Oxford University Press.
  3. Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

22.14.2 Scientific Articles

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410.
  2. Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.
  3. 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68-74.

22.14.3 Online Resources

  1. NCBI (National Center for Biotechnology Information): https://www.ncbi.nlm.nih.gov
  2. EBI (European Bioinformatics Institute): https://www.ebi.ac.uk
  3. ROSALIND: Bioinformatics programming platform: http://rosalind.info
  4. Coursera Bioinformatics Specialization: https://www.coursera.org/specializations/bioinformatics

22.15 Quantitative Problems

  1. Sequence Alignment Scoring: Given scoring: match = +2, mismatch = -1, gap opening = -4, gap extension = -1 Align sequences: ACGTACGT and ACGACGT
    1. Calculate score for optimal global alignment
    2. Calculate score for optimal local alignment
    3. If sequence length increases 10-fold, how does computation time change?
  2. BLAST Statistics: A BLAST search returns a hit with E-value = 1e-10 Database size = 10⁹ residues, query length = 300 residues Using Karlin-Altschul statistics: E = Kmn e^(-λS) Where K = 0.1, λ = 0.25, m = query length, n = db length
    1. What is the bit score?
    2. What E-value would correspond to bit score = 50?
    3. If database size doubles, how does E-value change for same alignment?
  3. Phylogenetic Tree Probability: For a tree with 4 taxa, there are 15 possible unrooted trees. Under Jukes-Cantor model, sequence length = 1000 bases Tree 1 has log-likelihood = -2050, Tree 2 = -2100
    1. What are the relative probabilities of the two trees?
    2. If prior is uniform, what are posterior probabilities?
    3. How does increasing sequence length affect confidence?
  4. Genome Assembly: Genome size = 3 × 10⁹ bp, read length = 150 bp, coverage = 30×
    1. How many reads are needed?
    2. If reads have 1% error rate, how many errors per read?
    3. With 30× coverage, what’s probability a base is not covered?

22.16 Case Study: The Human Genome Project

Background: The Human Genome Project (1990-2003) was an international research effort to sequence and map all human genes. It transformed biology and medicine while raising important ethical, legal, and social issues.

Questions:

  1. What were the main technological innovations that made the Human Genome Project possible?
  2. How did the public (International Human Genome Sequencing Consortium) and private (Celera Genomics) approaches differ?
  3. What were the major scientific findings from the human genome sequence?
  4. How has the human genome sequence enabled subsequent research in genomics and medicine?
  5. What ethical issues were raised by the project, and how were they addressed?

Data for analysis:

  • Cost: $3 billion over 13 years (public project)
  • Size: ~3.2 billion base pairs, ~20,000 protein-coding genes
  • Timeline: Started 1990, draft 2000, complete 2003
  • Impact: ~$1 trillion economic impact (estimated)
  • Sequencing cost reduction: From ~$1 per base (1990) to <$0.000001 per base (2020)

Next Chapter: Medical Applications and Biotechnology