📐 Mathematics Behind

Introduction

This part reports a detailed mathematical analysis of the BioGA R package, which implements a multi-objective genetic algorithm (GA) for genomic data optimization. The analysis covers all major components of the algorithm with formal mathematical notation and proofs where applicable.

GA framework

The package implements a standard generational GA with the following components:

Population initialization
Fitness evaluation
Selection (NSGA-II inspired)
Crossover (SBX)
Mutation (Adaptive)
Replacement (Elitism + Diversity Preservation)

Mathematical representation

Let:

$G = (V, E)$ be the gene network where $V = \{g_1, g_2, \ldots, g_n\}$ are genes
$X \in \mathbb{R}^{n \times m}$ be the genomic data matrix ( $n$ genes × $m$ samples)
$P_t \in \mathbb{R}^{p \times n}$ be the population at generation $t$ ( $p$ individuals × $n$ genes)

The GA can be represented as:

$P_{t+1} = R(M(C(S(P_t, f(P_t)), X)), P_t, f(P_t))$

Where:

$f$ : Fitness evaluation function
$S$ : Selection operator
$C$ : Crossover operator
$M$ : Mutation operator
$R$ : Replacement operator

Population initialization

Mathematical formulation

Given genomic data $X \in \mathbb{R}^{n \times m}$ and population size $p$ :

$P_0[i,j] = X[j, k] \quad \text{where} \quad k \sim \text{Uniform}\{1, \ldots, m\}$

With clustering (if provided): For each cluster $c \in C$ :

$P_0[i,j] = X[j, k] \quad \forall j \in c, \quad k \sim \text{Uniform}\{1, \ldots, m\}$

Properties

Maintains original data distribution per gene
Preserves cluster structure if provided
Expected value: $\mathbb{E}[P_0[i,j]] = \mu_j$ (mean of gene $j$ )

Fitness evaluation

The package implements a multi-objective fitness function with two components:

Objective 1: Expression difference

$f_1(i) = \sum_j \sum_k (X_{jk} - P_{ij})^2$

This measures how well the individual matches the observed expression patterns.

Properties:

Convex function with minimum at $P_{ij} = \mu_j$
Gradient: $\nabla f_1 = -2\sum_k(X_{jk} - P_{ij})$

Objective 2: Sparsity

$f_2(i) = \frac{\sum_j I(|P_{ij}| > \epsilon)}{n}$

Where $I$ is the indicator function and $\epsilon$ is a small constant ( $10^{-6}$ ).

Properties:

Non-convex, non-differentiable
Encourages sparse solutions
Range: $[0, 1]$

Combined fitness

$F(i) = w_1f_1(i) + w_2f_2(i)$

Where $w$ are user-provided weights.

Selection (NSGA-II inspired)

The selection implements a simplified version of NSGA-II’s non-dominated sorting:

Domination criteria

Individual $i$ dominates $j$ iff:

$\forall k: f_k(i) \leq f_k(j) \quad \text{and} \quad \exists k: f_k(i) < f_k(j)$

Proof of partial order:

Reflexive: No individual dominates itself
Antisymmetric: If $i$ dominates $j$ , $j$ cannot dominate $i$
Transitive: If $i$ dominates $j$ and $j$ dominates $k$ , then $i$ dominates $k$

Front construction

Compute domination counts and dominated sets
First front: Individuals with domination count $= 0$
Subsequent fronts: Remove current front, update counts

Theorem: The front construction algorithm terminates in $O(p^2o)$ time where $p$ is population size and $o$ is number of objectives.

Proof:

Domination check between two individuals is $O(o)$
All pairs check is $O(p^2o)$
Front construction is $O(p)$ per front

Crossover (Simulated Binary Crossover - SBX)

Given parents $x, y \in \mathbb{R}^n$ , create offspring $z$ :

For each gene $j$ : With probability $p_c$ : $\begin{aligned} u &\sim \text{Uniform}(0,1) \\ \beta &= \begin{cases} (2u)^{1/(\eta+1)} & \text{if } u \leq 0.5 \\ \left(\frac{1}{2(1-u)}\right)^{1/(\eta+1)} & \text{otherwise} \end{cases} \\ z_j &= 0.5[(x_j + y_j) - \beta|y_j - x_j|] \end{aligned}$ Else: $z_j = x_j$

Properties:

Preserves mean: $\mathbb{E}[z_j] = \frac{x_j + y_j}{2}$
Variance controlled by $\eta$ (distribution index)
For $\eta \to 0$ : approaches uniform crossover
For $\eta \to \infty$ : approaches no crossover ( $z = x$ or $y$ )

Mutation

Adaptive mutation with network constraints:

For each gene $j$ : With probability $p_m(t) = p_0(1 + 0.5t/T)$ : $\begin{aligned} \Delta_j &\sim N(0, \sigma^2) \\ \text{If network provided:} & \quad z_j \leftarrow z_j + \Delta_j(1 - \sum_k N_{jk}z_k) \\ \text{Else:} & \quad z_j \leftarrow z_j + \Delta_j \end{aligned}$

Properties:

Mutation rate increases with generation $t$
Network term reduces mutation magnitude for highly connected genes
Expected change: $\mathbb{E}[\Delta z_j] = 0$
Variance: $\text{Var}(\Delta z_j) = \sigma^2(1 - \sum_k N_{jk}z_k)^2$ if network provided

Replacement

Elitism + diversity-preserving replacement:

Keep best individual: $x^* = \text{argmin } f_1(x)$
For remaining replacements:
- Select random individual $x$
- Select offspring $y$
- Replace $x$ with $y$ if $\text{diversity}(x,y) > \epsilon$

Where $\text{diversity}(x,y) = \|x - y\|_2^2$

Theorem: This strategy preserves elitism while maintaining population diversity.

Proof:

Best solution is never lost
Expected diversity is non-decreasing since replacements only occur when diversity increases

Convergence analysis

The algorithm can be shown to converge under certain conditions:

Assumptions:

Finite search space
Strictly positive mutation probability 3. Elitism is maintained

Theorem: The algorithm converges in probability to the Pareto front.

Proof sketch:

The selection and replacement strategies preserve Pareto optimal solutions (elitism)
Mutation provides ergodicity (any state reachable)
By the multi-objective GA convergence theorems (Rudolph 1998), the algorithm converges to the Pareto front

Computational complexity

Let: - $p$ = population size - $n$ = number of genes - $m$ = number of samples - $o$ = number of objectives - $T$ = number of generations

Component complexities:

Initialization: $O(pn)$
Fitness evaluation: $O(Tpmn)$ (parallelized)
Selection: $O(Tp^2o)$ worst case
Crossover: $O(Tpn)$
Mutation: $O(Tpn)$
Replacement: $O(Tpn)$

Total complexity: $O(Tp(po + mn))$

Mathematical optimization interpretation

The algorithm can be viewed as a stochastic optimization method for:

$\begin{aligned} \text{minimize } & (f_1(P), f_2(P)) \\ \text{subject to } & P \in \mathbb{R}^{p \times n} \end{aligned}$

Where: - $f_1$ measures data fidelity - $f_2$ measures sparsity

The GA approach is particularly suitable because:

The problem is multi-objective
The search space is high-dimensional
The fitness landscape may be non-convex
Sparsity objective is non-differentiable

Special Cases and Relationships

Single objective case (w2=0w_2 = 0):
- Reduces to nonlinear least squares optimization
- GA serves as global optimizer avoiding local minima
No Network Constraints:
- Mutation becomes standard Gaussian mutation
- Problem decomposes by genes
High Crossover Rate:
- Approaches a recombination-based search
- Faster convergence but reduced diversity

Biological interpretation

The mathematical operations correspond to biological concepts:

Population Initialization: Sampling from observed biological variability
Fitness: Measuring both functional efficacy (expression matching) and parsimony (sparsity)
Network Constraints: Incorporating known gene-gene interactions
Clustering: Respecting co-expressed gene modules

Conclusion

This mathematical foundation shows that the BioGA package implements a theoretically sound multi-objective evolutionary algorithm for genomic data optimization, with proper attention to both computational efficiency and biological relevance.

Session Info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       Ubuntu 24.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2025-07-09
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>  quarto   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  BiocManager   1.30.26 2025-06-05 [1] RSPM
#>  BiocStyle   * 2.36.0  2025-04-15 [1] Bioconduc~
#>  bookdown      0.43    2025-04-15 [1] RSPM
#>  bslib         0.9.0   2025-01-30 [1] RSPM
#>  cachem        1.1.0   2024-05-16 [1] RSPM
#>  cli           3.6.5   2025-04-23 [1] RSPM
#>  desc          1.4.3   2023-12-10 [1] RSPM
#>  digest        0.6.37  2024-08-19 [1] RSPM
#>  evaluate      1.0.4   2025-06-18 [1] RSPM
#>  fastmap       1.2.0   2024-05-15 [1] RSPM
#>  fs            1.6.6   2025-04-12 [1] RSPM
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM
#>  htmlwidgets   1.6.4   2023-12-06 [1] RSPM
#>  jquerylib     0.1.4   2021-04-26 [1] RSPM
#>  jsonlite      2.0.0   2025-03-27 [1] RSPM
#>  knitr         1.50    2025-03-16 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM
#>  pkgdown       2.1.3   2025-05-25 [1] any (@2.1.3)
#>  R6            2.6.1   2025-02-15 [1] RSPM
#>  ragg          1.4.0   2025-04-10 [1] RSPM
#>  rlang         1.1.6   2025-04-11 [1] RSPM
#>  rmarkdown     2.29    2024-11-04 [1] RSPM
#>  sass          0.4.10  2025-04-11 [1] RSPM
#>  sessioninfo   1.2.3   2025-02-05 [1] RSPM
#>  systemfonts   1.2.3   2025-04-30 [1] RSPM
#>  textshaping   1.0.1   2025-05-01 [1] RSPM
#>  xfun          0.52    2025-04-02 [1] RSPM
#>  yaml          2.3.10  2024-07-26 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.5.1/lib/R/site-library
#>  [3] /opt/R/4.5.1/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Dany Mukesha

2025-07-09

Introduction

GA framework

Mathematical representation

Population initialization

Mathematical formulation

Properties

Fitness evaluation

Objective 1: Expression difference

Objective 2: Sparsity

Combined fitness

Selection (NSGA-II inspired)

Domination criteria

Front construction

Crossover (Simulated Binary Crossover - SBX)

Mutation

Replacement

Convergence analysis

Computational complexity

Mathematical optimization interpretation

Special Cases and Relationships

Biological interpretation

Conclusion