Skip to contents

Introduction

This report provides a detailed mathematical analysis of the BioGA R package, which implements a multi-objective genetic algorithm (GA) for genomic data optimization. The analysis covers all major components of the algorithm with formal mathematical notation and proofs where applicable.

Genetic Algorithm Framework

The package implements a standard generational GA with the following components:

  1. Population Initialization
  2. Fitness Evaluation
  3. Selection (NSGA-II inspired)
  4. Crossover (SBX)
  5. Mutation (Adaptive)
  6. Replacement (Elitism + Diversity Preservation)

Mathematical Representation

Let:

  • G=(V,E)G = (V, E) be the gene network where V={g1,g2,,gn}V = \{g_1, g_2, \ldots, g_n\} are genes
  • Xn×mX \in \mathbb{R}^{n \times m} be the genomic data matrix (nn genes × mm samples)
  • Ptp×nP_t \in \mathbb{R}^{p \times n} be the population at generation tt (pp individuals × nn genes)

The GA can be represented as:

Pt+1=R(M(C(S(Pt,f(Pt)),X)),Pt,f(Pt)) P_{t+1} = R(M(C(S(P_t, f(P_t)), X)), P_t, f(P_t))

Where:

  • ff: Fitness evaluation function

  • SS: Selection operator

  • CC: Crossover operator

  • MM: Mutation operator

  • RR: Replacement operator

Population Initialization

Mathematical Formulation

Given genomic data Xn×mX \in \mathbb{R}^{n \times m} and population size pp:

P0[i,j]=X[j,k]wherekUniform{1,,m} P_0[i,j] = X[j, k] \quad \text{where} \quad k \sim \text{Uniform}\{1, \ldots, m\}

With clustering (if provided): For each cluster cCc \in C:

P0[i,j]=X[j,k]jc,kUniform{1,,m} P_0[i,j] = X[j, k] \quad \forall j \in c, \quad k \sim \text{Uniform}\{1, \ldots, m\}

Properties

  1. Maintains original data distribution per gene
  2. Preserves cluster structure if provided
  3. Expected value: 𝔼[P0[i,j]]=μj\mathbb{E}[P_0[i,j]] = \mu_j (mean of gene jj)

Fitness Evaluation

The package implements a multi-objective fitness function with two components:

Objective 1: Expression Difference

f1(i)=jk(XjkPij)2 f_1(i) = \sum_j \sum_k (X_{jk} - P_{ij})^2

This measures how well the individual matches the observed expression patterns.

Properties:

  1. Convex function with minimum at Pij=μjP_{ij} = \mu_j

  2. Gradient: f1=2k(XjkPij)\nabla f_1 = -2\sum_k(X_{jk} - P_{ij})

Objective 2: Sparsity

f2(i)=jI(|Pij|>ϵ)n f_2(i) = \frac{\sum_j I(|P_{ij}| > \epsilon)}{n}

Where II is the indicator function and ϵ\epsilon is a small constant (10610^{-6}).

Properties:

  1. Non-convex, non-differentiable

  2. Encourages sparse solutions

  3. Range: [0,1][0, 1]

Combined Fitness

F(i)=w1f1(i)+w2f2(i) F(i) = w_1f_1(i) + w_2f_2(i)

Where ww are user-provided weights.

Selection (NSGA-II Inspired)

The selection implements a simplified version of NSGA-II’s non-dominated sorting:

Domination Criteria

Individual ii dominates jj iff:

k:fk(i)fk(j)andk:fk(i)<fk(j) \forall k: f_k(i) \leq f_k(j) \quad \text{and} \quad \exists k: f_k(i) < f_k(j)

Proof of Partial Order:

  1. Reflexive: No individual dominates itself

  2. Antisymmetric: If ii dominates jj, jj cannot dominate ii

  3. Transitive: If ii dominates jj and jj dominates kk, then ii dominates kk

Front Construction

  1. Compute domination counts and dominated sets
  2. First front: Individuals with domination count =0= 0
  3. Subsequent fronts: Remove current front, update counts

Theorem: The front construction algorithm terminates in O(p2o)O(p^2o) time where pp is population size and oo is number of objectives.

Proof:

  • Domination check between two individuals is O(o)O(o)

  • All pairs check is O(p2o)O(p^2o)

  • Front construction is O(p)O(p) per front

Crossover (Simulated Binary Crossover - SBX)

Given parents x,ynx, y \in \mathbb{R}^n, create offspring zz:

For each gene jj: With probability pcp_c: uUniform(0,1)β={(2u)1/(η+1)if u0.5(12(1u))1/(η+1)otherwisezj=0.5[(xj+yj)β|yjxj|] \begin{aligned} u &\sim \text{Uniform}(0,1) \\ \beta &= \begin{cases} (2u)^{1/(\eta+1)} & \text{if } u \leq 0.5 \\ \left(\frac{1}{2(1-u)}\right)^{1/(\eta+1)} & \text{otherwise} \end{cases} \\ z_j &= 0.5[(x_j + y_j) - \beta|y_j - x_j|] \end{aligned} Else: zj=xj z_j = x_j

Properties:

  1. Preserves mean: 𝔼[zj]=xj+yj2\mathbb{E}[z_j] = \frac{x_j + y_j}{2}

  2. Variance controlled by η\eta (distribution index)

  3. For η0\eta \to 0: approaches uniform crossover

  4. For η\eta \to \infty: approaches no crossover (z=xz = x or yy)

Mutation

Adaptive mutation with network constraints:

For each gene jj: With probability pm(t)=p0(1+0.5t/T)p_m(t) = p_0(1 + 0.5t/T): ΔjN(0,σ2)If network provided:zjzj+Δj(1kNjkzk)Else:zjzj+Δj \begin{aligned} \Delta_j &\sim N(0, \sigma^2) \\ \text{If network provided:} & \quad z_j \leftarrow z_j + \Delta_j(1 - \sum_k N_{jk}z_k) \\ \text{Else:} & \quad z_j \leftarrow z_j + \Delta_j \end{aligned}

Properties:

  1. Mutation rate increases with generation tt

  2. Network term reduces mutation magnitude for highly connected genes

  3. Expected change: 𝔼[Δzj]=0\mathbb{E}[\Delta z_j] = 0

  4. Variance: Var(Δzj)=σ2(1kNjkzk)2\text{Var}(\Delta z_j) = \sigma^2(1 - \sum_k N_{jk}z_k)^2 if network provided

Replacement

Elitism + diversity-preserving replacement:

  1. Keep best individual: x*=argmin f1(x)x^* = \text{argmin } f_1(x)
  2. For remaining replacements:
    • Select random individual xx
    • Select offspring yy
    • Replace xx with yy if diversity(x,y)>ϵ\text{diversity}(x,y) > \epsilon

Where diversity(x,y)=xy22\text{diversity}(x,y) = \|x - y\|_2^2

Theorem: This strategy preserves elitism while maintaining population diversity.

Proof:

  1. Best solution is never lost

  2. Expected diversity is non-decreasing since replacements only occur when diversity increases

Convergence Analysis

The algorithm can be shown to converge under certain conditions:

Assumptions:

  1. Finite search space

  2. Strictly positive mutation probability 3. Elitism is maintained

Theorem: The algorithm converges in probability to the Pareto front.

Proof Sketch:

  1. The selection and replacement strategies preserve Pareto optimal solutions (elitism)

  2. Mutation provides ergodicity (any state reachable)

  3. By the multi-objective GA convergence theorems (Rudolph 1998), the algorithm converges to the Pareto front

Computational Complexity

Let: - pp = population size - nn = number of genes - mm = number of samples - oo = number of objectives - TT = number of generations

Component Complexities:

  1. Initialization: O(pn)O(pn)

  2. Fitness Evaluation: O(Tpmn)O(Tpmn) (parallelized)

  3. Selection: O(Tp2o)O(Tp^2o) worst case

  4. Crossover: O(Tpn)O(Tpn)

  5. Mutation: O(Tpn)O(Tpn)

  6. Replacement: O(Tpn)O(Tpn)

Total Complexity: O(Tp(po+mn))O(Tp(po + mn))

Mathematical Optimization Interpretation

The algorithm can be viewed as a stochastic optimization method for:

minimize (f1(P),f2(P))subject to Pp×n \begin{aligned} \text{minimize } & (f_1(P), f_2(P)) \\ \text{subject to } & P \in \mathbb{R}^{p \times n} \end{aligned}

Where: - f1f_1 measures data fidelity - f2f_2 measures sparsity

The GA approach is particularly suitable because:

  1. The problem is multi-objective

  2. The search space is high-dimensional

  3. The fitness landscape may be non-convex

  4. Sparsity objective is non-differentiable

Special Cases and Relationships

  1. Single Objective Case (w2=0w_2 = 0):
    • Reduces to nonlinear least squares optimization
    • GA serves as global optimizer avoiding local minima
  2. No Network Constraints:
    • Mutation becomes standard Gaussian mutation
    • Problem decomposes by genes
  3. High Crossover Rate:
    • Approaches a recombination-based search
    • Faster convergence but reduced diversity

Biological Interpretation

The mathematical operations correspond to biological concepts:

  1. Population Initialization: Sampling from observed biological variability
  2. Fitness: Measuring both functional efficacy (expression matching) and parsimony (sparsity)
  3. Network Constraints: Incorporating known gene-gene interactions
  4. Clustering: Respecting co-expressed gene modules

Conclusion

This mathematical foundation shows that the BioGA package implements a theoretically sound multi-objective evolutionary algorithm for genomic data optimization, with proper attention to both computational efficiency and biological relevance.

Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       Ubuntu 24.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2025-07-03
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>  quarto   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  BiocManager   1.30.26 2025-06-05 [1] RSPM
#>  BiocStyle   * 2.36.0  2025-04-15 [1] Bioconduc~
#>  bookdown      0.43    2025-04-15 [1] RSPM
#>  bslib         0.9.0   2025-01-30 [1] RSPM
#>  cachem        1.1.0   2024-05-16 [1] RSPM
#>  cli           3.6.5   2025-04-23 [1] RSPM
#>  desc          1.4.3   2023-12-10 [1] RSPM
#>  digest        0.6.37  2024-08-19 [1] RSPM
#>  evaluate      1.0.4   2025-06-18 [1] RSPM
#>  fastmap       1.2.0   2024-05-15 [1] RSPM
#>  fs            1.6.6   2025-04-12 [1] RSPM
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM
#>  htmlwidgets   1.6.4   2023-12-06 [1] RSPM
#>  jquerylib     0.1.4   2021-04-26 [1] RSPM
#>  jsonlite      2.0.0   2025-03-27 [1] RSPM
#>  knitr         1.50    2025-03-16 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM
#>  pkgdown       2.1.3   2025-05-25 [1] any (@2.1.3)
#>  R6            2.6.1   2025-02-15 [1] RSPM
#>  ragg          1.4.0   2025-04-10 [1] RSPM
#>  rlang         1.1.6   2025-04-11 [1] RSPM
#>  rmarkdown     2.29    2024-11-04 [1] RSPM
#>  sass          0.4.10  2025-04-11 [1] RSPM
#>  sessioninfo   1.2.3   2025-02-05 [1] RSPM
#>  systemfonts   1.2.3   2025-04-30 [1] RSPM
#>  textshaping   1.0.1   2025-05-01 [1] RSPM
#>  xfun          0.52    2025-04-02 [1] RSPM
#>  yaml          2.3.10  2024-07-26 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.5.1/lib/R/site-library
#>  [3] /opt/R/4.5.1/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────