Skip to contents

Introduction

This part reports a detailed mathematical analysis of the BioGA R package, which implements a multi-objective genetic algorithm (GA) for genomic data optimization. The analysis covers all major components of the algorithm with formal mathematical notation and proofs where applicable.

GA framework

The package implements a standard generational GA with the following components:

  1. Population initialization
  2. Fitness evaluation
  3. Selection (NSGA-II inspired)
  4. Crossover (SBX)
  5. Mutation (Adaptive)
  6. Replacement (Elitism + Diversity Preservation)

Mathematical representation

Let:

  • G=(V,E)G = (V, E) be the gene network where V={g1,g2,,gn}V = \{g_1, g_2, \ldots, g_n\} are genes
  • Xn×mX \in \mathbb{R}^{n \times m} be the genomic data matrix (nn genes × mm samples)
  • Ptp×nP_t \in \mathbb{R}^{p \times n} be the population at generation tt (pp individuals × nn genes)

The GA can be represented as:

Pt+1=R(M(C(S(Pt,f(Pt)),X)),Pt,f(Pt)) P_{t+1} = R(M(C(S(P_t, f(P_t)), X)), P_t, f(P_t))

Where:

  • ff: Fitness evaluation function

  • SS: Selection operator

  • CC: Crossover operator

  • MM: Mutation operator

  • RR: Replacement operator

Population initialization

Mathematical formulation

Given genomic data Xn×mX \in \mathbb{R}^{n \times m} and population size pp:

P0[i,j]=X[j,k]wherekUniform{1,,m} P_0[i,j] = X[j, k] \quad \text{where} \quad k \sim \text{Uniform}\{1, \ldots, m\}

With clustering (if provided): For each cluster cCc \in C:

P0[i,j]=X[j,k]jc,kUniform{1,,m} P_0[i,j] = X[j, k] \quad \forall j \in c, \quad k \sim \text{Uniform}\{1, \ldots, m\}

Properties

  1. Maintains original data distribution per gene
  2. Preserves cluster structure if provided
  3. Expected value: 𝔼[P0[i,j]]=μj\mathbb{E}[P_0[i,j]] = \mu_j (mean of gene jj)

Fitness evaluation

The package implements a multi-objective fitness function with two components:

Objective 1: Expression difference

f1(i)=jk(XjkPij)2 f_1(i) = \sum_j \sum_k (X_{jk} - P_{ij})^2

This measures how well the individual matches the observed expression patterns.

Properties:

  1. Convex function with minimum at Pij=μjP_{ij} = \mu_j

  2. Gradient: f1=2k(XjkPij)\nabla f_1 = -2\sum_k(X_{jk} - P_{ij})

Objective 2: Sparsity

f2(i)=jI(|Pij|>ϵ)n f_2(i) = \frac{\sum_j I(|P_{ij}| > \epsilon)}{n}

Where II is the indicator function and ϵ\epsilon is a small constant (10610^{-6}).

Properties:

  1. Non-convex, non-differentiable

  2. Encourages sparse solutions

  3. Range: [0,1][0, 1]

Combined fitness

F(i)=w1f1(i)+w2f2(i) F(i) = w_1f_1(i) + w_2f_2(i)

Where ww are user-provided weights.

Selection (NSGA-II inspired)

The selection implements a simplified version of NSGA-II’s non-dominated sorting:

Domination criteria

Individual ii dominates jj iff:

k:fk(i)fk(j)andk:fk(i)<fk(j) \forall k: f_k(i) \leq f_k(j) \quad \text{and} \quad \exists k: f_k(i) < f_k(j)

Proof of partial order:

  1. Reflexive: No individual dominates itself

  2. Antisymmetric: If ii dominates jj, jj cannot dominate ii

  3. Transitive: If ii dominates jj and jj dominates kk, then ii dominates kk

Front construction

  1. Compute domination counts and dominated sets
  2. First front: Individuals with domination count =0= 0
  3. Subsequent fronts: Remove current front, update counts

Theorem: The front construction algorithm terminates in O(p2o)O(p^2o) time where pp is population size and oo is number of objectives.

Proof:

  • Domination check between two individuals is O(o)O(o)

  • All pairs check is O(p2o)O(p^2o)

  • Front construction is O(p)O(p) per front

Crossover (Simulated Binary Crossover - SBX)

Given parents x,ynx, y \in \mathbb{R}^n, create offspring zz:

For each gene jj: With probability pcp_c: uUniform(0,1)β={(2u)1/(η+1)if u0.5(12(1u))1/(η+1)otherwisezj=0.5[(xj+yj)β|yjxj|] \begin{aligned} u &\sim \text{Uniform}(0,1) \\ \beta &= \begin{cases} (2u)^{1/(\eta+1)} & \text{if } u \leq 0.5 \\ \left(\frac{1}{2(1-u)}\right)^{1/(\eta+1)} & \text{otherwise} \end{cases} \\ z_j &= 0.5[(x_j + y_j) - \beta|y_j - x_j|] \end{aligned} Else: zj=xj z_j = x_j

Properties:

  1. Preserves mean: 𝔼[zj]=xj+yj2\mathbb{E}[z_j] = \frac{x_j + y_j}{2}

  2. Variance controlled by η\eta (distribution index)

  3. For η0\eta \to 0: approaches uniform crossover

  4. For η\eta \to \infty: approaches no crossover (z=xz = x or yy)

Mutation

Adaptive mutation with network constraints:

For each gene jj: With probability pm(t)=p0(1+0.5t/T)p_m(t) = p_0(1 + 0.5t/T): ΔjN(0,σ2)If network provided:zjzj+Δj(1kNjkzk)Else:zjzj+Δj \begin{aligned} \Delta_j &\sim N(0, \sigma^2) \\ \text{If network provided:} & \quad z_j \leftarrow z_j + \Delta_j(1 - \sum_k N_{jk}z_k) \\ \text{Else:} & \quad z_j \leftarrow z_j + \Delta_j \end{aligned}

Properties:

  1. Mutation rate increases with generation tt

  2. Network term reduces mutation magnitude for highly connected genes

  3. Expected change: 𝔼[Δzj]=0\mathbb{E}[\Delta z_j] = 0

  4. Variance: Var(Δzj)=σ2(1kNjkzk)2\text{Var}(\Delta z_j) = \sigma^2(1 - \sum_k N_{jk}z_k)^2 if network provided

Replacement

Elitism + diversity-preserving replacement:

  1. Keep best individual: x*=argmin f1(x)x^* = \text{argmin } f_1(x)
  2. For remaining replacements:
    • Select random individual xx
    • Select offspring yy
    • Replace xx with yy if diversity(x,y)>ϵ\text{diversity}(x,y) > \epsilon

Where diversity(x,y)=xy22\text{diversity}(x,y) = \|x - y\|_2^2

Theorem: This strategy preserves elitism while maintaining population diversity.

Proof:

  1. Best solution is never lost

  2. Expected diversity is non-decreasing since replacements only occur when diversity increases

Convergence analysis

The algorithm can be shown to converge under certain conditions:

Assumptions:

  1. Finite search space

  2. Strictly positive mutation probability 3. Elitism is maintained

Theorem: The algorithm converges in probability to the Pareto front.

Proof sketch:

  1. The selection and replacement strategies preserve Pareto optimal solutions (elitism)

  2. Mutation provides ergodicity (any state reachable)

  3. By the multi-objective GA convergence theorems (Rudolph 1998), the algorithm converges to the Pareto front

Computational complexity

Let: - pp = population size - nn = number of genes - mm = number of samples - oo = number of objectives - TT = number of generations

Component complexities:

  1. Initialization: O(pn)O(pn)

  2. Fitness evaluation: O(Tpmn)O(Tpmn) (parallelized)

  3. Selection: O(Tp2o)O(Tp^2o) worst case

  4. Crossover: O(Tpn)O(Tpn)

  5. Mutation: O(Tpn)O(Tpn)

  6. Replacement: O(Tpn)O(Tpn)

Total complexity: O(Tp(po+mn))O(Tp(po + mn))

Mathematical optimization interpretation

The algorithm can be viewed as a stochastic optimization method for:

minimize (f1(P),f2(P))subject to Pp×n \begin{aligned} \text{minimize } & (f_1(P), f_2(P)) \\ \text{subject to } & P \in \mathbb{R}^{p \times n} \end{aligned}

Where: - f1f_1 measures data fidelity - f2f_2 measures sparsity

The GA approach is particularly suitable because:

  1. The problem is multi-objective

  2. The search space is high-dimensional

  3. The fitness landscape may be non-convex

  4. Sparsity objective is non-differentiable

Special Cases and Relationships

  1. Single objective case (w2=0w_2 = 0):
    • Reduces to nonlinear least squares optimization
    • GA serves as global optimizer avoiding local minima
  2. No Network Constraints:
    • Mutation becomes standard Gaussian mutation
    • Problem decomposes by genes
  3. High Crossover Rate:
    • Approaches a recombination-based search
    • Faster convergence but reduced diversity

Biological interpretation

The mathematical operations correspond to biological concepts:

  1. Population Initialization: Sampling from observed biological variability
  2. Fitness: Measuring both functional efficacy (expression matching) and parsimony (sparsity)
  3. Network Constraints: Incorporating known gene-gene interactions
  4. Clustering: Respecting co-expressed gene modules

Conclusion

This mathematical foundation shows that the BioGA package implements a theoretically sound multi-objective evolutionary algorithm for genomic data optimization, with proper attention to both computational efficiency and biological relevance.

Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       Ubuntu 24.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2025-07-09
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>  quarto   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  BiocManager   1.30.26 2025-06-05 [1] RSPM
#>  BiocStyle   * 2.36.0  2025-04-15 [1] Bioconduc~
#>  bookdown      0.43    2025-04-15 [1] RSPM
#>  bslib         0.9.0   2025-01-30 [1] RSPM
#>  cachem        1.1.0   2024-05-16 [1] RSPM
#>  cli           3.6.5   2025-04-23 [1] RSPM
#>  desc          1.4.3   2023-12-10 [1] RSPM
#>  digest        0.6.37  2024-08-19 [1] RSPM
#>  evaluate      1.0.4   2025-06-18 [1] RSPM
#>  fastmap       1.2.0   2024-05-15 [1] RSPM
#>  fs            1.6.6   2025-04-12 [1] RSPM
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM
#>  htmlwidgets   1.6.4   2023-12-06 [1] RSPM
#>  jquerylib     0.1.4   2021-04-26 [1] RSPM
#>  jsonlite      2.0.0   2025-03-27 [1] RSPM
#>  knitr         1.50    2025-03-16 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM
#>  pkgdown       2.1.3   2025-05-25 [1] any (@2.1.3)
#>  R6            2.6.1   2025-02-15 [1] RSPM
#>  ragg          1.4.0   2025-04-10 [1] RSPM
#>  rlang         1.1.6   2025-04-11 [1] RSPM
#>  rmarkdown     2.29    2024-11-04 [1] RSPM
#>  sass          0.4.10  2025-04-11 [1] RSPM
#>  sessioninfo   1.2.3   2025-02-05 [1] RSPM
#>  systemfonts   1.2.3   2025-04-30 [1] RSPM
#>  textshaping   1.0.1   2025-05-01 [1] RSPM
#>  xfun          0.52    2025-04-02 [1] RSPM
#>  yaml          2.3.10  2024-07-26 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.5.1/lib/R/site-library
#>  [3] /opt/R/4.5.1/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────