Overview
UStatDecouple implements probabilistic decoupling techniques for U-statistics, transforming dependent sample statistics into averages of independent sequences. This package provides rigorous statistical tools for genomic data analysis, enabling accurate variance estimation, hypothesis testing, and confidence interval construction for complex genomic metrics.
The package is based on the theoretical framework developed by de la Peña and colleagues for decoupling inequalities in probability theory , specifically optimized for biological sequence analysis and genomic applications.
The framework is realized through an S4 object-oriented design and a high-performance C++ (Rcpp) back-end implementation, addressing the quadratic () complexity of U-statistics. Parallel processing is supported via BiocParallel, and pre-defined biological kernels facilitate DNA sequence and gene expression analyses. Vignettes and case studies illustrating are provide with the package for practical applications of the methods.
Installation
You can also install the package directly from GitHub using the devtools package:
devtools::install_github("danymukesha/UStatDecouple")Quick Start
Performing decoupled U-statistic analysis on DNA sequence data. It allows to compute kernel-based distances, estimate distributions, and visualize results efficiently.
library(UStatDecouple)
data <- load_example_sequences()
kernel <- create_kernel(hamming_distance_kernel, "Hamming Distance")
result <- decouple_u_stat(data, kernel, B = 500)
print(result)
#> DecoupleResult object:
#> Original U-statistic: 2.0000
#> Decoupled mean: 1.4696
#> Decoupled SD: 0.2567
#> Kernel: Hamming Distance
#> Method: Friedman-de la Pena Decoupling
#> P-value: 0.0388
#> Z-score: 2.0663
#> Significance: * (p = 0.0388)
plot(result)
Biological Applications
DNA Sequence Diversity Analysis
result <- run_genomic_case_study(
num_sequences = 15,
sequence_length = 100,
B = 200
)
#>
#> === Biological Interpretation ===
#> Original mean Hamming distance: 74.0667
#> Expected distance under independence: 69.1327
#> Observed distance is 3.80 standard deviations from independence expectation
#> Significant evidence of dependence between sequences (p < 0.05)
#> This suggests shared evolutionary history or functional constraintsGene Expression Correlation Analysis
expr_result <- analyze_gene_expression_correlations(
num_genes = 30,
num_samples = 20,
B = 500
)
#> Warning: package 'MASS' was built under R version 4.4.2
#>
#> === Gene Expression Analysis ===
#> Original mean absolute correlation: 0.2255
#> Expected correlation under independence: 0.2512
#> Variance inflation factor: Inf
#> No significant evidence of co-expression structure (p >= 0.05)
#> Genes appear to be expressed independentlyTheoretical Background
The package implements the Friedman-de la Peña decoupling method, which transforms dependent U-statistics:
into independent versions:
where are independent copies of . This transformation enables the use of classical statistical methods designed for independent data .
Performance
The C++ implementation provides significant speed improvements:
| Dataset Size | Pure R (seconds) | C++ (seconds) | Speedup |
|---|---|---|---|
| n=10, B=100 | 2.1 | 0.05 | 42x |
| n=50, B=500 | 156.3 | 1.8 | 87x |
| n=100, B=1000 | 1242.7 | 12.5 | 99x |