Skip to contents

Abstract

Genetic algorithms (GAs) are optimization techniques inspired by the process of natural selection and genetics. They operate by evolving a population of candidate solutions over successive generations, with each individual representing a potential solution to the optimization problem at hand. Through the application of genetic operators such as selection, crossover, and mutation, the genetic algorithms iteratively improve the population by eventually converging towards optimal or near-optimal solutions. In the field of genomics, where data sets are often large, complex, and high-dimensional, the genetic algorithms offer a good approach for addressing optimization challenges such as feature selection, parameter tuning, and model optimization. By harnessing the power of evolutionary principles, genetic algorithms can effectively explore the solution space, identify informative features, and optimize model parameters, leading to improved accuracy and interpretability in genomic data analysis. The BioGA package extends the capabilities of genetic algorithms to the realm of genomic data analysis, providing a suite of functions optimized for handling high throughput genomic data. Implemented in C++ for enhanced performance, BioGA offers efficient algorithms for tasks such as feature selection, classification, clustering, and more. By integrating flawlessly with the Bioconductor ecosystem, BioGA empowers researchers and analysts to take advantage of the power of genetic algorithms within their genomics workflows, facilitating the discovery of biological insights from large-scale genomic data sets.


Getting Started

Installation

To install this package, start R (version “4.4”) and enter:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install(pkgs = "BioGA", version = "devel", force = TRUE)

You can also install the package directly from GitHub using the devtools package:

devtools::install_github("danymukesha/BioGA")

With a simplified example, we illustrated the usage of BioGA for genetic algorithm optimization in the context of high throughput genomic data analysis. We showcased its interoperability with Bioconductor classes, demonstrating how genetic algorithm optimization can be integrated into existing genomics pipelines to improve the analysis and interpretation.

We demonstrated the usage of BioGA in the context of selecting the best combination of genes for predicting a certain trait, such as disease susceptibility.

Overview

Genomic data generally refers to the genetic information stored in a DNA of an organism. The DNA molecules are basically mmade up of sequence of nucleotides (adenine, thymine, cytosine, and guanine).

The genetic information likely provides clear understanding into various biological processes, such as gene expression, genetic variation, and evolutionary relationships.

In this context, genomic data could consist of gene expression profiles measured across different individuals (e.g., patients).

Where:

  • Each row in the genomic_data matrix represents a gene, and each column represents a patient sample.

and

  • Values in the matrix represent the expression levels of each gene in each patient sample.

As an example of genomic data, let’s consider have a table similar to the following:

      Sample 1   Sample 2   Sample 3   Sample 4
Gene1    0.1        0.2        0.3        0.4
Gene2    1.2        1.3        1.4        1.5
Gene3    2.3        2.2        2.1        2.0

In the table above, each row represents a gene (or genomic feature), and each column represents a sample. The values in the matrix represent some measurement of gene expression, such as mRNA levels or protein abundance, in each sample.

For instance, the value 0.1 in “Sample 1” for Gene1 indicates the expression level of Gene1 in “Sample 1”. Similarly, the value 2.2 in “Sample 2” for Gene3 indicates the expression level of Gene3 in “Sample 2”.

Genomic data can be used in various analyses, including genetic association studies, gene expression analysis, and comparative genomics. In the context of the evaluate_fitness_cpp function, genomic data is used to calculate fitness scores for individuals in a population.

Just to clarify, an individual is “feature” typically in the context of genetic algorithm optimization.

The population represents a set of candidate combinations of genes that could be predictive of the trait (a specific characteristic of an individual, which is determined by the genes).

The important information that we need to know is if a gene is part of this set or not. To do so, each individual in the population is represented by a binary vector indicating the presence or absence of each gene.

For example, a set of candidate genes (a population) might be represented as [1, 0, 1], indicating the presence of Gene1 and Gene3 but the absence of Gene2. The population undergoes genetic algorithm operations such as selection, crossover, mutation, and replacement to evolve towards individuals with higher predictive power for the trait.

Example of scenario

Consider an example scenario of using genetic algorithm optimization to select the best combination of genes for predicting a certain trait, such as disease susceptibility.

# define parameters
num_genes <- 1000
num_samples <- 10

# parameters for genetic algorithm
population_size <- 100
generations <- 10
mutation_rate <- 0.5
# generate an example of genomic data using "SummarizedExperiment"
counts <- matrix(rpois(num_genes * num_samples, lambda = 10),
    nrow = num_genes
)
rownames(counts) <- paste0("Gene", 1:num_genes)
colnames(counts) <- paste0("Sample", 1:num_samples)

# create "SummarizedExperiment" object
se <-
  SummarizedExperiment::SummarizedExperiment(assays = list(counts = counts))

# convert "SummarizedExperiment" to matrix for compatibility with `BioGA` 
genomic_data <- assay(se)

In the above example, counts is a matrix that represents the counts of gene expression levels across different samples. Each row corresponds to a gene, and each column corresponds to a sample. We used the SummarizedExperiment class to store this data, which is common Bioconductor class for representing rectangular feature x sample data, such as RNAseq count matrices or microarray data.

head(genomic_data)
#>       Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
#> Gene1       5      11       6      14       9      14       4      12       8
#> Gene2       6      10       3      10      11       6       6       8      10
#> Gene3       9      12       9       9      12      10      12       7      10
#> Gene4      11      11       7      13      13       6      11      16       6
#> Gene5      13      12       8       7       9       9      14      10       9
#> Gene6       4      10      11       7       9      13       7      16      11
#>       Sample10
#> Gene1       12
#> Gene2       10
#> Gene3       10
#> Gene4       11
#> Gene5       12
#> Gene6        8

Initialization

# (select the number of canditate you wish `population`)
population <- BioGA::initialize_population_cpp(genomic_data,
    population_size = 5
)

As we already mentioned, the population represents a set of candidate combinations of genes that could be predictive of the trait. The population undergoes then genetic algorithm operations such as selection, crossover, mutation, and replacement to evolve towards individuals with higher predictive power for the trait.

GA Optimization

fitness_history <- list()

start_time <- Sys.time()

generation <- 0
while (TRUE) {
    generation <- generation + 1

    # evaluate fitness
    fitness <- BioGA::evaluate_fitness_cpp(genomic_data, population)
    fitness_history[[generation]] <- fitness

    # check termination condition
    if (generation == generations) { # defined number of generations
        break
    }

    # selection
    selected_parents <- BioGA::selection_cpp(population,
        fitness,
        num_parents = 2
    )

    # crossover and Mutation
    offspring <- BioGA::crossover_cpp(selected_parents, offspring_size = 4)
    # (no mutation in this example)
    mutated_offspring <- BioGA::mutation_cpp(offspring, mutation_rate = mutation_rate)

    # replacement
    population <- BioGA::replacement_cpp(population, mutated_offspring,
        num_to_replace = 1
    )

    # calculate time progress
    elapsed_time <- difftime(Sys.time(), start_time, units = "secs")

    cat(
        "\rGeneration:", generation, "- Elapsed Time:",
        format(elapsed_time, units = "secs\n"), "     "
    ) 
}
#> Generation: 1 - Elapsed Time: 0.009012699 secs      Generation: 2 - Elapsed Time: 0.01123285 secs      Generation: 3 - Elapsed Time: 0.01172781 secs      Generation: 4 - Elapsed Time: 0.01219893 secs      Generation: 5 - Elapsed Time: 0.01266813 secs      Generation: 6 - Elapsed Time: 0.01313615 secs      Generation: 7 - Elapsed Time: 0.01360917 secs      Generation: 8 - Elapsed Time: 0.01407409 secs      Generation: 9 - Elapsed Time: 0.01455975 secs

Fitness calculation

The fitness calculation described in the provided code calculates a measure of dissimilarity between the gene expression profiles of individuals in the population and the genomic data. This measure of dissimilarity, or “fitness”, quantifies how well the gene expression profile of an individual matches the genomic data.

Mathematically, the fitness calculation can be represented as follows:

Let:

  • gijkg_{ijk} be the gene expression level of gene jj in individual ii and sample kk from the genomic data.

  • pijp_{ij} be the gene expression level of gene jj in individual ii from the population.

  • NN be the number of individuals in the population.

  • GG be the number of genes.

  • SS be the number of samples.

Then, the fitness FiF_i for individual ii in the population can be calculated as the sum of squared differences between the gene expression levels of individual ii and the corresponding gene expression levels in the genomic data, across all genes and samples: Fi=j=1Gk=1S(gijkpij)2 F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2

This fitness calculation aims to minimize the overall dissimilarity between the gene expression profiles of individuals in the population and the genomic data. Individuals with lower fitness scores are considered to have gene expression profiles that are more similar to the genomic data and are therefore more likely to be selected for further optimization in the genetic algorithm.

# Plot fitness change over generations
BioGA::plot_fitness_history(fitness_history)

This showcases the integration of genetic algorithms with genomic data analysis and highlights the potential of genetic algorithms for feature selection in genomics.

Here’s how BioGA could work in the context of high throughput genomic data analysis:

  1. Problem Definition: BioGA starts with a clear definition of the problem to be solved. This could include tasks such as identifying genetic markers associated with a particular disease, optimizing gene expression patterns, or clustering genomic data to identify patterns or groupings.

  2. Representation: Genomic data would need to be appropriately represented for use within the genetic algorithm framework. This might involve encoding the data in a suitable format, such as binary strings representing genes or chromosomes.

  3. Fitness Evaluation: BioGA would define a fitness function that evaluates how well a particular solution performs with respect to the problem being addressed. In the context of genomic data analysis, this could involve measures such as classification accuracy, correlation with clinical outcomes, or fitness to a particular model.

  4. Initialization: The algorithm would initialize a population of candidate solutions, typically randomly or using some heuristic method. Each solution in the population represents a potential solution to the problem at hand.

  5. Genetic Operations: BioGA would apply genetic operators such as selection, crossover, and mutation to evolve the population over successive generations. Selection identifies individuals with higher fitness to serve as parents for the next generation. Crossover combines genetic material from two parent solutions to produce offspring. Mutation introduces random changes to the offspring to maintain genetic diversity.

  6. Termination Criteria: The algorithm would continue iterating through generations until a termination criterion is met. This could be a maximum number of generations, reaching a satisfactory solution, or convergence of the population.

  7. Result Analysis: Once the algorithm terminates, BioGA would analyze the final population to identify the best solution(s) found. This could involve further validation or interpretation of the results in the context of the original problem.

Other applications of BioGA in genomic data analysis could include genome-wide association studies (GWAS), gene expression analysis, pathway analysis, and predictive modeling for personalized medicine, among others. By leveraging genetic algorithms, BioGA offers a powerful approach to exploring complex genomic datasets and identifying meaningful patterns and associations.

Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       Ubuntu 22.04.5 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2024-12-04
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  abind                  1.4-8     2024-09-12 [1] RSPM
#>  animation              2.7       2021-10-07 [1] RSPM
#>  Biobase              * 2.66.0    2024-10-29 [1] Bioconduc~
#>  BiocGenerics         * 0.52.0    2024-10-29 [1] Bioconduc~
#>  BiocManager            1.30.25   2024-08-28 [1] RSPM
#>  BiocStyle            * 2.34.0    2024-10-29 [1] Bioconduc~
#>  biocViews              1.74.0    2024-10-29 [1] Bioconduc~
#>  BioGA                * 0.99.6    2024-12-04 [1] local
#>  bitops                 1.0-9     2024-10-03 [1] RSPM
#>  bookdown               0.41      2024-10-16 [1] RSPM
#>  bslib                  0.8.0     2024-07-29 [1] RSPM
#>  cachem                 1.1.0     2024-05-16 [1] RSPM
#>  cli                    3.6.3     2024-06-21 [1] RSPM
#>  colorspace             2.1-1     2024-07-26 [1] RSPM
#>  crayon                 1.5.3     2024-06-20 [1] RSPM
#>  DelayedArray           0.32.0    2024-10-29 [1] Bioconduc~
#>  desc                   1.4.3     2023-12-10 [1] RSPM
#>  digest                 0.6.37    2024-08-19 [1] RSPM
#>  evaluate               1.0.1     2024-10-10 [1] RSPM
#>  fansi                  1.0.6     2023-12-08 [1] RSPM
#>  farver                 2.1.2     2024-05-13 [1] RSPM
#>  fastmap                1.2.0     2024-05-15 [1] RSPM
#>  fs                     1.6.5     2024-10-30 [1] RSPM
#>  GenomeInfoDb         * 1.42.1    2024-11-28 [1] Bioconduc~
#>  GenomeInfoDbData       1.2.13    2024-12-04 [1] Bioconductor
#>  GenomicRanges        * 1.58.0    2024-10-29 [1] Bioconduc~
#>  ggplot2                3.5.1     2024-04-23 [1] RSPM
#>  glue                   1.8.0     2024-09-30 [1] RSPM
#>  graph                  1.84.0    2024-10-29 [1] Bioconduc~
#>  gtable                 0.3.6     2024-10-25 [1] RSPM
#>  htmltools              0.5.8.1   2024-04-04 [1] RSPM
#>  httr                   1.4.7     2023-08-15 [1] RSPM
#>  IRanges              * 2.40.0    2024-10-29 [1] Bioconduc~
#>  jquerylib              0.1.4     2021-04-26 [1] RSPM
#>  jsonlite               1.8.9     2024-09-20 [1] RSPM
#>  knitr                  1.49      2024-11-08 [1] RSPM
#>  labeling               0.4.3     2023-08-29 [1] RSPM
#>  lattice                0.22-6    2024-03-20 [3] CRAN (R 4.4.2)
#>  lifecycle              1.0.4     2023-11-07 [1] RSPM
#>  magrittr               2.0.3     2022-03-30 [1] RSPM
#>  Matrix                 1.7-1     2024-10-18 [3] CRAN (R 4.4.2)
#>  MatrixGenerics       * 1.18.0    2024-10-29 [1] Bioconduc~
#>  matrixStats          * 1.4.1     2024-09-08 [1] RSPM
#>  munsell                0.5.1     2024-04-01 [1] RSPM
#>  pillar                 1.9.0     2023-03-22 [1] RSPM
#>  pkgconfig              2.0.3     2019-09-22 [1] RSPM
#>  pkgdown                2.1.1     2024-09-17 [1] any (@2.1.1)
#>  R6                     2.5.1     2021-08-19 [1] RSPM
#>  ragg                   1.3.3     2024-09-11 [1] RSPM
#>  RBGL                   1.82.0    2024-10-29 [1] Bioconduc~
#>  Rcpp                   1.0.13-1  2024-11-02 [1] RSPM
#>  RCurl                  1.98-1.16 2024-07-11 [1] RSPM
#>  rlang                  1.1.4     2024-06-04 [1] RSPM
#>  rmarkdown              2.29      2024-11-04 [1] RSPM
#>  RUnit                  0.4.33    2024-02-22 [1] RSPM
#>  S4Arrays               1.6.0     2024-10-29 [1] Bioconduc~
#>  S4Vectors            * 0.44.0    2024-10-29 [1] Bioconduc~
#>  sass                   0.4.9     2024-03-15 [1] RSPM
#>  scales                 1.3.0     2023-11-28 [1] RSPM
#>  sessioninfo            1.2.2     2021-12-06 [1] RSPM
#>  SparseArray            1.6.0     2024-10-29 [1] Bioconduc~
#>  SummarizedExperiment * 1.36.0    2024-10-29 [1] Bioconduc~
#>  systemfonts            1.1.0     2024-05-15 [1] RSPM
#>  textshaping            0.4.0     2024-05-24 [1] RSPM
#>  tibble                 3.2.1     2023-03-20 [1] RSPM
#>  UCSC.utils             1.2.0     2024-10-29 [1] Bioconduc~
#>  utf8                   1.2.4     2023-10-22 [1] RSPM
#>  vctrs                  0.6.5     2023-12-01 [1] RSPM
#>  withr                  3.0.2     2024-10-28 [1] RSPM
#>  xfun                   0.49      2024-10-31 [1] RSPM
#>  XML                    3.99-0.17 2024-06-25 [1] RSPM
#>  XVector                0.46.0    2024-10-29 [1] Bioconduc~
#>  yaml                   2.3.10    2024-07-26 [1] RSPM
#>  zlibbioc               1.52.0    2024-10-29 [1] Bioconduc~
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.4.2/lib/R/site-library
#>  [3] /opt/R/4.4.2/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────