
The Foundation of BioGA
Multi-Objective Genetic Algorithm for Genomic Data Optimization
Dany Mukesha
2025-12-04
Source:vignettes/Foundation.Rmd
Foundation.RmdIntroduction
The BioGA utilizes multi-objective Genetic Algorithm (GA) specifically tailored for the optimization and analysis of genomic data. A detailed mathematical and theoretical analysis of the algorithms implemented within this package is described below; covering all major components of the evolutionary process with formal mathematical notation, proofs, and a discussion of key properties.
The Genetic Algorithm Framework
The BioGA package employs a standard generational Genetic Algorithm structure, modified for multi-objective optimization. This framework involves the iterative evolutionary cycle of initialization, evaluation, selection, reproduction (crossover and mutation), and replacement.
Mathematical Representation
Let us define the core elements of the search space and the algorithm’s state:
- is the gene network, where are genes.
- is the genomic data matrix, consisting of genes across samples.
- is the population matrix at generation , containing individuals (rows) and gene values (columns).
The evolution from one generation to the next, , is described by the following functional relationship:
Where the functions represent the core operators:
- : Fitness Evaluation function.
- : Selection operator (NSGA-II inspired).
- : Crossover operator (SBX).
- : Mutation operator (Adaptive).
- : Replacement operator (Elitism + Diversity Preservation).
Population Initialization
The initialization phase creates the starting population, , by drawing random samples directly from the measured genomic data .
Mathematical Formulation
Given the genomic data and a population size , the value of gene in the -th individual of the initial population is set as:
This means that for each element in , a sample is chosen uniformly at random from the available samples, and the corresponding gene expression value is used.
Initialization with Clustering:
If a clustering breakdown is provided (e.g., is the set of clusters), the initialization respects the structure: for all genes belonging to the same cluster .
Statistical Properties
- Distribution Preservation: This method ensures that the initial population individuals maintain the original statistical distribution of expression values for each gene .
- Structural Integrity: If clustering is used, the method preserves the pre-defined cluster structure within the initial population.
- Expected Value: The expected value of any element over the initialization process is the mean of the observed expression for gene :
Fitness Evaluation
The BioGA package utilizes a multi-objective approach,
simultaneously optimizing two conflicting goals: data fidelity
(Expression Difference) and model parsimony (Sparsity).
Objective 1: Expression Difference ()
This function measures the squared Euclidean distance between the individual and all observed data samples for all genes. It quantifies how accurately the individual reflects the overall expression patterns.
Properties of
- Convexity: is a convex function, which is a desirable property for optimization as it guarantees that any local minimum is also a global minimum in the absence of other constraints. The minimum occurs when is the mean expression .
- Gradient (for theoretical analysis): The partial derivative with respect to the gene value is:
Objective 2: Sparsity ()
This objective promotes parsimonious solutions by penalizing non-zero gene values. It measures the fraction of non-zero genes within an individual .
Here, is the indicator function, which equals 1 if the condition is met, and is a small constant (typically ) used to define “non-zero.”
Properties of
- Mathematical Form: is inherently Non-convex and Non-differentiable due to the indicator function . This makes standard gradient-based optimization methods unsuitable, justifying the use of a heuristic approach like the GA.
- Goal: It actively encourages sparse solutions, where many gene values are effectively zero.
- Range: The range of the function is restricted to , where represents a fully sparse individual ( for all ) and represents a non-sparse individual.
Selection (NSGA-II Inspired)
The selection mechanism is based on the principles of the Non-dominated Sorting Genetic Algorithm II (NSGA-II), ensuring that evolutionary pressure drives the population toward the Pareto front.
Domination Criteria
Individual dominates individual (denoted ) if and only if:
- Individual is no worse than on all objectives, and
- Individual is strictly better than on at least one objective.
Formally:
Proof of Partial Order
The two-objective domination relation defines a partial order on the solution set because it satisfies the following properties:
- Irreflexive: No individual dominates itself. If , the second condition () is false.
- Antisymmetric Property: If , then cannot dominate . If is strictly better on at least one objective, must be strictly worse ().
- Transitive: If and , then . This follows directly from the transitivity of the combined and relations.
Front Construction
The population is sorted into non-dominated fronts:
- Domination Metrics: For every individual, we compute its domination count (number of individuals that dominate it) and its dominated set (set of individuals it dominates).
- First Front (): This front consists of all individuals whose domination count is . These are the current non-dominated solutions.
- Subsequent Fronts: Individuals in are temporarily removed. The domination counts of the individuals they dominated are reduced by one. The next front, , consists of all remaining individuals whose updated domination count is . This process continues until the entire population is sorted.
Theorem on Computational Complexity
The non-dominated sorting and front construction algorithm terminates in quadratic time complexity relative to the population size.
Theorem: The front construction algorithm terminates in time, where is the population size and is the number of objectives.
Proof Sketch:
- The pairwise domination check between any two individuals is an operation.
- The computation of the domination count and dominated set for all individuals requires time, as it involves comparisons.
- Once these metrics are computed, the iterative process of assigning individuals to fronts requires time per front, leading to an overall complexity dominated by the initial comparison phase.
Crossover (Simulated Binary Crossover - SBX)
The Simulated Binary Crossover (SBX) is a prominent operator for real-coded GAs, designed to mimic the convergence properties of binary crossover while operating directly on continuous variables.
Given two parent individuals, , the SBX operator generates a new offspring vector . For each gene :
With crossover probability , the offspring gene is determined: 1. A random number is drawn: . 2. The spreading factor is computed based on the distribution index : 3. The gene value is calculated:
Otherwise (with probability ):
Properties of SBX
- Mean Preservation: The expected value of the offspring gene is average of the parents’ genes: .
- Distribution Control: The distribution index controls the proximity of the offspring to the parents.
- Low (e.g., ): Produces a wide spread and approaches uniform crossover, generating many extreme values.
- High (e.g., ): Clusters offspring tightly around the parents, approaching no crossover.
Mutation
The mutation operation introduces random small changes to the
offspring for exploration. The BioGA uses an adaptive
mutation rate and incorporates network constraints.
For each gene , with adaptive probability : where is the initial rate, is the current generation, and is the total number of generations. This rate increases linearly over time to enhance exploration in later stages.
The change in the gene value, , is applied as follows:
A Gaussian perturbation is generated: .
If a Gene Network is provided ( being the adjacency matrix): The mutation is penalized by the network connectivity:
If no Network is provided: Standard Gaussian mutation is applied:
Properties of Adaptive Network Mutation
- Adaptive Rate: The mutation rate monotonically increases with generation , preventing premature convergence and maintaining diversity late in the run.
- Network Constraint Impact: The term acts as a coefficient. Highly connected genes (large ) experience a reduced magnitude of mutation, preserving known network structures.
- Expected Change: Since , the expected change remains zero: .
- Variance Control: For the network-constrained case, the variance of the change is: This mathematical structure formalizes how network relationships dynamically control evolutionary exploration.
Replacement
The replacement operator constructs the next population from the current population and the offspring . It employs an Elitism strategy combined with a metric for diversity preservation.
- Elitism: The best individual, , defined as the solution minimizing the primary objective , is always retained in the next generation:
-
Diversity-Preserving Replacement: For the remaining
slots in
,
a non-elitist replacement strategy is performed iteratively:
- Randomly select an individual from the current population .
- Select an offspring from the newly generated offspring .
- Replace with only if the increase in diversity is significant: Where the diversity metric is the squared Euclidean distance: .
Theorem on Population Quality
Theorem: This replacement strategy preserves elitism while ensuring that the population’s diversity does not collapse into a single point.
Proof Sketch:
- Elitism Preservation (Quality): By explicitly retaining , the search process ensures that the best solution found so far across the primary objective is never lost. This guarantees monotonic, non-decreasing convergence quality on .
- Diversity Maintenance: Replacements are conditional on . Since only replacements that increase or maintain the distance (and hence diversity) above a threshold are accepted, the expected diversity of the population is non-decreasing over the generations, effectively counteracting sampling drift.
Convergence Analysis
The convergence of the BioGA algorithm, derived from established Multi-objective Evolutionary Algorithm (MOEA) theory, guarantees convergence to the optimal set under general conditions.
Assumptions
For convergence proofs in MOEAs, the following are often assumed:
- Finite Search Space: While the gene values are continuous, they can be discretized in theory (or bounded in practice).
- Strictly Positive Mutation Probability: for all , which ensures ergodicity.
- Elitism: The best solutions are preserved across generations.
Theorem on Convergence
Theorem: Under the assumptions above, the BioGA algorithm converges in probability to the global Pareto front of the multi-objective optimization problem.
Proof Sketch:
- Preservation of Optimality: The NSGA-II selection and Elitism replacement ensure that once a solution is found on the Pareto front, it or a strictly better non-dominated alternative is retained.
- Ergodicity: The strictly positive mutation probability guarantees that any point in the search space can theoretically be reached from any other point in a finite number of steps.
- Convergence Result: Based on general MOEA convergence theorems (e.g., Rudolph 1998, or similar proofs for NSGA-II), the combination of selection pressure toward non-dominated solutions and exploratory power (mutation/crossover) ensures that the population will tend toward the true Pareto optimal set as .
Computational Complexity
Analyzing the complexity is crucial for understanding the algorithm’s performance on large genomic datasets. Key variables are defined:
- = population size
- = number of genes
- = number of samples
- = number of objectives ( in BioGA)
- = number of generations
Component Complexities
| GA Component | Complexity (Per Generation) | Dominating Variables |
|---|---|---|
| Initialization | ||
| Fitness Evaluation | ||
| Selection (Sorting) | (quadratic) | |
| Crossover (SBX) | ||
| Mutation (Adaptive) | ||
| Replacement |
Total Complexity
The total complexity for generations is dominated by the Fitness Evaluation and the Selection phases:
Factoring : This result highlights that scalability is limited by the quadratic dependency on population size (due to sorting) and the linear dependency on the product of genes/samples (due to fitness evaluation).
Mathematical Optimization Interpretation
From a purely mathematical perspective, the BioGA algorithm serves as a robust stochastic optimization solver for the following multi-objective problem:
Core Characteristics that Favor GA:
- Multi-Objective Nature: Explicitly seeking a set of trade-off solutions (the Pareto front).
- High Dimensionality: Search space size is , making deterministic grid search intractable.
- Non-Differentiable Objectives: The sparsity objective is non-differentiable, making standard calculus-based methods impossible.
- Non-Convexity: The fitness landscape, when combined, may be non-convex, meaning the GA’s global search capability is essential to avoid local optima.
Biological Interpretation and Relevance
The mathematical operations within the BioGA framework have direct analogues in the biological context of genomic data analysis, ensuring the algorithm is biologically meaningful.
| Mathematical Component | Biological Concept | Function |
|---|---|---|
| Population Initialization | Initial Biological Variability | Starting the search process by sampling observed expression levels. |
| Fitness () | Functional Efficacy & Parsimony | ensures the optimized solution is relevant to observed data; enforces biological simplicity (Minimal Regulatory Model). |
| Network Constraints | Gene-Gene Interactions | Incorporating known biological constraints (e.g., regulatory or metabolic pathways) to guide mutation. |
| Clustering | Co-expressed Gene Modules | Respecting known functional groupings during initialization. |
Conclusion
The BioGA package implements a theoretically sound and
computationally structured multi-objective evolutionary algorithm. Its
mathematical foundation—encompassing NSGA-II principles for
multi-objective handling, sophisticated SBX and adaptive
network-constrained mutation for exploration, and diversity-preserving
replacement—provides a rigorous framework for simultaneously achieving
data fidelity and model sparsity in genomic data optimization.
Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.5.2 (2025-10-31)
#> os Ubuntu 24.04.3 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2025-12-04
#> pandoc 3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> quarto NA
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bookdown 0.45 2025-10-03 [1] RSPM
#> bslib 0.9.0 2025-01-30 [1] RSPM
#> cachem 1.1.0 2024-05-16 [1] RSPM
#> cli 3.6.5 2025-04-23 [1] RSPM
#> desc 1.4.3 2023-12-10 [1] RSPM
#> digest 0.6.39 2025-11-19 [1] RSPM
#> evaluate 1.0.5 2025-08-27 [1] RSPM
#> fastmap 1.2.0 2024-05-15 [1] RSPM
#> fs 1.6.6 2025-04-12 [1] RSPM
#> htmltools 0.5.8.1 2024-04-04 [1] RSPM
#> htmlwidgets 1.6.4 2023-12-06 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 2.0.0 2025-03-27 [1] RSPM
#> knitr 1.50 2025-03-16 [1] RSPM
#> lifecycle 1.0.4 2023-11-07 [1] RSPM
#> pkgdown 2.2.0 2025-11-06 [1] any (@2.2.0)
#> R6 2.6.1 2025-02-15 [1] RSPM
#> ragg 1.5.0 2025-09-02 [1] RSPM
#> rlang 1.1.6 2025-04-11 [1] RSPM
#> rmarkdown 2.30 2025-09-28 [1] RSPM
#> sass 0.4.10 2025-04-11 [1] RSPM
#> sessioninfo 1.2.3 2025-02-05 [1] RSPM
#> systemfonts 1.3.1 2025-10-01 [1] RSPM
#> textshaping 1.0.4 2025-10-10 [1] RSPM
#> xfun 0.54 2025-10-30 [1] RSPM
#> yaml 2.3.11 2025-11-28 [1] RSPM
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.5.2/lib/R/site-library
#> [3] /opt/R/4.5.2/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────