Skip to contents

Introduction

The BioGA utilizes multi-objective Genetic Algorithm (GA) specifically tailored for the optimization and analysis of genomic data. A detailed mathematical and theoretical analysis of the algorithms implemented within this package is described below; covering all major components of the evolutionary process with formal mathematical notation, proofs, and a discussion of key properties.

The Genetic Algorithm Framework

The BioGA package employs a standard generational Genetic Algorithm structure, modified for multi-objective optimization. This framework involves the iterative evolutionary cycle of initialization, evaluation, selection, reproduction (crossover and mutation), and replacement.

Mathematical Representation

Let us define the core elements of the search space and the algorithm’s state:

  • G=(V,E)G = (V, E) is the gene network, where V={g1,g2,,gn}V = \{g_1, g_2, \ldots, g_n\} are nn genes.
  • Xn×mX \in \mathbb{R}^{n \times m} is the genomic data matrix, consisting of nn genes across mm samples.
  • Ptp×nP_t \in \mathbb{R}^{p \times n} is the population matrix at generation tt, containing pp individuals (rows) and nn gene values (columns).

The evolution from one generation to the next, PtPt+1P_t \to P_{t+1}, is described by the following functional relationship:

Pt+1=R(M(C(S(Pt,f(Pt)),X)),Pt,f(Pt)) P_{t+1} = R(M(C(S(P_t, f(P_t)), X)), P_t, f(P_t))

Where the functions represent the core operators:

  • ff: Fitness Evaluation function.
  • SS: Selection operator (NSGA-II inspired).
  • CC: Crossover operator (SBX).
  • MM: Mutation operator (Adaptive).
  • RR: Replacement operator (Elitism + Diversity Preservation).

Population Initialization

The initialization phase creates the starting population, P0P_0, by drawing random samples directly from the measured genomic data XX.

Mathematical Formulation

Given the genomic data Xn×mX \in \mathbb{R}^{n \times m} and a population size pp, the value of gene jj in the ii-th individual of the initial population is set as:

P0[i,j]=X[j,k]wherekUniform{1,,m} P_0[i,j] = X[j, k] \quad \text{where} \quad k \sim \text{Uniform}\{1, \ldots, m\} This means that for each element (i,j)(i, j) in P0P_0, a sample kk is chosen uniformly at random from the mm available samples, and the corresponding gene expression value X[j,k]X[j, k] is used.

Initialization with Clustering:

If a clustering breakdown is provided (e.g., CC is the set of clusters), the initialization respects the structure: P0[i,j]=X[j,k]jc,kUniform{1,,m} P_0[i,j] = X[j, k] \quad \forall j \in c, \quad k \sim \text{Uniform}\{1, \ldots, m\} for all genes jj belonging to the same cluster cc.

Statistical Properties

  1. Distribution Preservation: This method ensures that the initial population individuals maintain the original statistical distribution of expression values for each gene jj.
  2. Structural Integrity: If clustering is used, the method preserves the pre-defined cluster structure within the initial population.
  3. Expected Value: The expected value of any element P0[i,j]P_0[i,j] over the initialization process is the mean of the observed expression for gene jj: 𝔼[P0[i,j]]=μj=1mk=1mX[j,k] \mathbb{E}[P_0[i,j]] = \mu_j = \frac{1}{m}\sum_{k=1}^m X[j,k]

Fitness Evaluation

The BioGA package utilizes a multi-objective approach, simultaneously optimizing two conflicting goals: data fidelity (Expression Difference) and model parsimony (Sparsity).

Objective 1: Expression Difference (f1f_1)

This function measures the squared Euclidean distance between the individual ii and all mm observed data samples XjkX_{jk} for all nn genes. It quantifies how accurately the individual reflects the overall expression patterns.

f1(i)=j=1nk=1m(XjkPij)2 f_1(i) = \sum_{j=1}^n \sum_{k=1}^m (X_{jk} - P_{ij})^2

Properties of f1f_1

  1. Convexity: f1f_1 is a convex function, which is a desirable property for optimization as it guarantees that any local minimum is also a global minimum in the absence of other constraints. The minimum occurs when PijP_{ij} is the mean expression μj\mu_j.
  2. Gradient (for theoretical analysis): The partial derivative with respect to the gene value PijP_{ij} is: Pijf1=2k=1m(XjkPij) \nabla_{P_{ij}} f_1 = -2\sum_{k=1}^m (X_{jk} - P_{ij})

Objective 2: Sparsity (f2f_2)

This objective promotes parsimonious solutions by penalizing non-zero gene values. It measures the fraction of non-zero genes within an individual ii.

f2(i)=j=1nI(|Pij|>ϵ)n f_2(i) = \frac{\sum_{j=1}^n I(|P_{ij}| > \epsilon)}{n}

Here, II is the indicator function, which equals 1 if the condition is met, and ϵ\epsilon is a small constant (typically 10610^{-6}) used to define “non-zero.”

Properties of f2f_2

  1. Mathematical Form: f2f_2 is inherently Non-convex and Non-differentiable due to the indicator function II. This makes standard gradient-based optimization methods unsuitable, justifying the use of a heuristic approach like the GA.
  2. Goal: It actively encourages sparse solutions, where many gene values are effectively zero.
  3. Range: The range of the function is restricted to [0,1][0, 1], where 00 represents a fully sparse individual (Pij=0P_{ij} = 0 for all jj) and 11 represents a non-sparse individual.

Combined Fitness

The overall fitness value F(i)F(i) for an individual ii is a weighted combination of the two objectives:

F(i)=w1f1(i)+w2f2(i) F(i) = w_1f_1(i) + w_2f_2(i)

Where w1w_1 and w2w_2 are user-specified weights that dictate the trade-off between minimizing expression error (f1f_1) and maximizing sparsity (f2f_2).

Selection (NSGA-II Inspired)

The selection mechanism is based on the principles of the Non-dominated Sorting Genetic Algorithm II (NSGA-II), ensuring that evolutionary pressure drives the population toward the Pareto front.

Domination Criteria

Individual ii dominates individual jj (denoted iji \succ j) if and only if:

  1. Individual ii is no worse than jj on all objectives, and
  2. Individual ii is strictly better than jj on at least one objective.

Formally: ijiffk{1,2}:fk(i)fk(j)andk{1,2}:fk(i)<fk(j) i \succ j \quad \text{iff} \quad \forall k \in \{1, 2\}: f_k(i) \leq f_k(j) \quad \text{and} \quad \exists k \in \{1, 2\}: f_k(i) < f_k(j)

Proof of Partial Order

The two-objective domination relation ()(\succ) defines a partial order on the solution set because it satisfies the following properties:

  1. Irreflexive: No individual dominates itself. If i=ji = j, the second condition (k:fk(i)<fk(j)\exists k: f_k(i) < f_k(j)) is false.
  2. Antisymmetric Property: If iji \succ j, then jj cannot dominate ii. If ii is strictly better on at least one objective, jj must be strictly worse (k:fk(j)fk(i)\forall k: f_k(j) \not\leq f_k(i)).
  3. Transitive: If iji \succ j and jkj \succ k, then iki \succ k. This follows directly from the transitivity of the combined \leq and << relations.

Front Construction

The population is sorted into non-dominated fronts:

  1. Domination Metrics: For every individual, we compute its domination count (number of individuals that dominate it) and its dominated set (set of individuals it dominates).
  2. First Front (1\mathcal{F}_1): This front consists of all individuals whose domination count is 00. These are the current non-dominated solutions.
  3. Subsequent Fronts: Individuals in t\mathcal{F}_t are temporarily removed. The domination counts of the individuals they dominated are reduced by one. The next front, t+1\mathcal{F}_{t+1}, consists of all remaining individuals whose updated domination count is 00. This process continues until the entire population is sorted.

Theorem on Computational Complexity

The non-dominated sorting and front construction algorithm terminates in quadratic time complexity relative to the population size.

Theorem: The front construction algorithm terminates in O(p2o)O(p^2o) time, where pp is the population size and oo is the number of objectives.

Proof Sketch:

  • The pairwise domination check between any two individuals is an O(o)O(o) operation.
  • The computation of the domination count and dominated set for all individuals requires O(p2o)O(p^2o) time, as it involves (p2)\binom{p}{2} comparisons.
  • Once these metrics are computed, the iterative process of assigning individuals to fronts requires O(p)O(p) time per front, leading to an overall complexity dominated by the O(p2o)O(p^2o) initial comparison phase.

Crossover (Simulated Binary Crossover - SBX)

The Simulated Binary Crossover (SBX) is a prominent operator for real-coded GAs, designed to mimic the convergence properties of binary crossover while operating directly on continuous variables.

Given two parent individuals, x,ynx, y \in \mathbb{R}^n, the SBX operator generates a new offspring vector zz. For each gene jj:

With crossover probability pcp_c, the offspring gene zjz_j is determined: 1. A random number uu is drawn: uUniform(0,1)u \sim \text{Uniform}(0,1). 2. The spreading factor β\beta is computed based on the distribution index η\eta: β={(2u)1/(η+1)if u0.5(12(1u))1/(η+1)otherwise \beta = \begin{cases} (2u)^{1/(\eta+1)} & \text{if } u \leq 0.5 \\ \left(\frac{1}{2(1-u)}\right)^{1/(\eta+1)} & \text{otherwise} \end{cases} 3. The gene value zjz_j is calculated: zj=0.5[(xj+yj)β|yjxj|] z_j = 0.5[(x_j + y_j) - \beta|y_j - x_j|]

Otherwise (with probability 1pc1-p_c): zj=xjorzj=yj z_j = x_j \quad \text{or} \quad z_j = y_j

Properties of SBX

  1. Mean Preservation: The expected value of the offspring gene is average of the parents’ genes: 𝔼[zj]=xj+yj2\mathbb{E}[z_j] = \frac{x_j + y_j}{2}.
  2. Distribution Control: The distribution index η\eta controls the proximity of the offspring to the parents.
  3. Low η\eta (e.g., η0\eta \to 0): Produces a wide spread and approaches uniform crossover, generating many extreme values.
  4. High η\eta (e.g., η\eta \to \infty): Clusters offspring tightly around the parents, approaching no crossover.

Mutation

The mutation operation introduces random small changes to the offspring for exploration. The BioGA uses an adaptive mutation rate and incorporates network constraints.

For each gene jj, with adaptive probability pm(t)p_m(t): pm(t)=p0(1+0.5tT) p_m(t) = p_0\left(1 + 0.5\frac{t}{T}\right) where p0p_0 is the initial rate, tt is the current generation, and TT is the total number of generations. This rate increases linearly over time to enhance exploration in later stages.

The change in the gene value, zjz_j, is applied as follows:

  1. A Gaussian perturbation is generated: ΔjN(0,σ2)\Delta_j \sim N(0, \sigma^2).

  2. If a Gene Network NN is provided (NjkN_{jk} being the adjacency matrix): The mutation is penalized by the network connectivity: zjzj+Δj(1kNjkzk) z_j \leftarrow z_j + \Delta_j\left(1 - \sum_{k} N_{jk}z_k\right)

  3. If no Network is provided: Standard Gaussian mutation is applied: zjzj+Δj z_j \leftarrow z_j + \Delta_j

Properties of Adaptive Network Mutation

  1. Adaptive Rate: The mutation rate monotonically increases with generation tt, preventing premature convergence and maintaining diversity late in the run.
  2. Network Constraint Impact: The term (1kNjkzk)(1 - \sum_{k} N_{jk}z_k) acts as a coefficient. Highly connected genes (large kNjk\sum_{k} N_{jk}) experience a reduced magnitude of mutation, preserving known network structures.
  3. Expected Change: Since 𝔼[Δj]=0\mathbb{E}[\Delta_j] = 0, the expected change remains zero: 𝔼[Δzj]=0\mathbb{E}[\Delta z_j] = 0.
  4. Variance Control: For the network-constrained case, the variance of the change is: Var(Δzj)=σ2(1kNjkzk)2 \text{Var}(\Delta z_j) = \sigma^2\left(1 - \sum_{k} N_{jk}z_k\right)^2 This mathematical structure formalizes how network relationships dynamically control evolutionary exploration.

Replacement

The replacement operator RR constructs the next population Pt+1P_{t+1} from the current population PtP_t and the offspring OtO_t. It employs an Elitism strategy combined with a metric for diversity preservation.

  1. Elitism: The best individual, x*x^*, defined as the solution minimizing the primary objective f1f_1, is always retained in the next generation: x*=argminxPtOtf1(x) x^* = \text{argmin}_{x \in P_t \cup O_t} f_1(x)
  2. Diversity-Preserving Replacement: For the remaining p1p-1 slots in Pt+1P_{t+1}, a non-elitist replacement strategy is performed iteratively:
    • Randomly select an individual xx from the current population PtP_t.
    • Select an offspring yy from the newly generated offspring OtO_t.
    • Replace xx with yy only if the increase in diversity is significant: Replace xyifdiversity(x,y)>ϵ \text{Replace } x \leftarrow y \quad \text{if} \quad \text{diversity}(x,y) > \epsilon Where the diversity metric is the squared Euclidean distance: diversity(x,y)=xy22\text{diversity}(x,y) = \|x - y\|_2^2.

Theorem on Population Quality

Theorem: This replacement strategy preserves elitism while ensuring that the population’s diversity does not collapse into a single point.

Proof Sketch:

  1. Elitism Preservation (Quality): By explicitly retaining x*x^*, the search process ensures that the best solution found so far across the primary objective f1f_1 is never lost. This guarantees monotonic, non-decreasing convergence quality on f1f_1.
  2. Diversity Maintenance: Replacements are conditional on diversity(x,y)>ϵ\text{diversity}(x,y) > \epsilon. Since only replacements that increase or maintain the distance (and hence diversity) above a threshold are accepted, the expected diversity of the population is non-decreasing over the generations, effectively counteracting sampling drift.

Convergence Analysis

The convergence of the BioGA algorithm, derived from established Multi-objective Evolutionary Algorithm (MOEA) theory, guarantees convergence to the optimal set under general conditions.

Assumptions

For convergence proofs in MOEAs, the following are often assumed:

  1. Finite Search Space: While the gene values are continuous, they can be discretized in theory (or bounded in practice).
  2. Strictly Positive Mutation Probability: pm(t)>0p_m(t) > 0 for all tt, which ensures ergodicity.
  3. Elitism: The best solutions are preserved across generations.

Theorem on Convergence

Theorem: Under the assumptions above, the BioGA algorithm converges in probability to the global Pareto front of the multi-objective optimization problem.

Proof Sketch:

  1. Preservation of Optimality: The NSGA-II selection and Elitism replacement ensure that once a solution is found on the Pareto front, it or a strictly better non-dominated alternative is retained.
  2. Ergodicity: The strictly positive mutation probability guarantees that any point in the search space can theoretically be reached from any other point in a finite number of steps.
  3. Convergence Result: Based on general MOEA convergence theorems (e.g., Rudolph 1998, or similar proofs for NSGA-II), the combination of selection pressure toward non-dominated solutions and exploratory power (mutation/crossover) ensures that the population will tend toward the true Pareto optimal set as tt \to \infty.

Computational Complexity

Analyzing the complexity is crucial for understanding the algorithm’s performance on large genomic datasets. Key variables are defined:

  • pp = population size
  • nn = number of genes
  • mm = number of samples
  • oo = number of objectives (o=2o=2 in BioGA)
  • TT = number of generations

Component Complexities

GA Component Complexity (Per Generation) Dominating Variables
Initialization O(pn)O(pn) p,np, n
Fitness Evaluation O(pn)×O(m)O(pn) \times O(m)\toO(pmn)O(pmn) p,m,np, m, n
Selection (Sorting) O(p2o)O(p^2o) pp (quadratic)
Crossover (SBX) O(pn)O(pn) p,np, n
Mutation (Adaptive) O(pn)O(pn) p,np, n
Replacement O(pn)O(pn) p,np, n

Total Complexity

The total complexity for TT generations is dominated by the Fitness Evaluation and the Selection phases:

Total Complexity=O(T(pmn+p2o)) \text{Total Complexity} = O(T \cdot (pmn + p^2o)) Factoring pp: Total Complexity=O(Tp(mn+po)) \text{Total Complexity} = O(Tp(mn + po)) This result highlights that scalability is limited by the quadratic dependency on population size pp (due to sorting) and the linear dependency on the product of genes/samples mnmn (due to fitness evaluation).

Mathematical Optimization Interpretation

From a purely mathematical perspective, the BioGA algorithm serves as a robust stochastic optimization solver for the following multi-objective problem:

minimize (f1(P),f2(P))subject to Pp×n \begin{aligned} \text{minimize } & (f_1(P), f_2(P)) \\ \text{subject to } & P \in \mathbb{R}^{p \times n} \end{aligned}

Core Characteristics that Favor GA:

  1. Multi-Objective Nature: Explicitly seeking a set of trade-off solutions (the Pareto front).
  2. High Dimensionality: Search space size is p×np \times n, making deterministic grid search intractable.
  3. Non-Differentiable Objectives: The sparsity objective f2f_2 is non-differentiable, making standard calculus-based methods impossible.
  4. Non-Convexity: The fitness landscape, when combined, may be non-convex, meaning the GA’s global search capability is essential to avoid local optima.

Biological Interpretation and Relevance

The mathematical operations within the BioGA framework have direct analogues in the biological context of genomic data analysis, ensuring the algorithm is biologically meaningful.

Mathematical Component Biological Concept Function
Population Initialization Initial Biological Variability Starting the search process by sampling observed expression levels.
Fitness (f1,f2f_1, f_2) Functional Efficacy & Parsimony f1f_1 ensures the optimized solution is relevant to observed data; f2f_2 enforces biological simplicity (Minimal Regulatory Model).
Network Constraints Gene-Gene Interactions Incorporating known biological constraints (e.g., regulatory or metabolic pathways) to guide mutation.
Clustering Co-expressed Gene Modules Respecting known functional groupings during initialization.

Conclusion

The BioGA package implements a theoretically sound and computationally structured multi-objective evolutionary algorithm. Its mathematical foundation—encompassing NSGA-II principles for multi-objective handling, sophisticated SBX and adaptive network-constrained mutation for exploration, and diversity-preserving replacement—provides a rigorous framework for simultaneously achieving data fidelity and model sparsity in genomic data optimization.

Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.2 (2025-10-31)
#>  os       Ubuntu 24.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2025-12-04
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>  quarto   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bookdown      0.45    2025-10-03 [1] RSPM
#>  bslib         0.9.0   2025-01-30 [1] RSPM
#>  cachem        1.1.0   2024-05-16 [1] RSPM
#>  cli           3.6.5   2025-04-23 [1] RSPM
#>  desc          1.4.3   2023-12-10 [1] RSPM
#>  digest        0.6.39  2025-11-19 [1] RSPM
#>  evaluate      1.0.5   2025-08-27 [1] RSPM
#>  fastmap       1.2.0   2024-05-15 [1] RSPM
#>  fs            1.6.6   2025-04-12 [1] RSPM
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM
#>  htmlwidgets   1.6.4   2023-12-06 [1] RSPM
#>  jquerylib     0.1.4   2021-04-26 [1] RSPM
#>  jsonlite      2.0.0   2025-03-27 [1] RSPM
#>  knitr         1.50    2025-03-16 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM
#>  pkgdown       2.2.0   2025-11-06 [1] any (@2.2.0)
#>  R6            2.6.1   2025-02-15 [1] RSPM
#>  ragg          1.5.0   2025-09-02 [1] RSPM
#>  rlang         1.1.6   2025-04-11 [1] RSPM
#>  rmarkdown     2.30    2025-09-28 [1] RSPM
#>  sass          0.4.10  2025-04-11 [1] RSPM
#>  sessioninfo   1.2.3   2025-02-05 [1] RSPM
#>  systemfonts   1.3.1   2025-10-01 [1] RSPM
#>  textshaping   1.0.4   2025-10-10 [1] RSPM
#>  xfun          0.54    2025-10-30 [1] RSPM
#>  yaml          2.3.11  2025-11-28 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.5.2/lib/R/site-library
#>  [3] /opt/R/4.5.2/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────