Getting Started with {mwbenchr}

Abstract

Accessing and integrating data from the Metabolomics Workbench (MW) is often cumbersome with the existing tools. This interface simplifies programmatic access to MW’s REST API, enabling efficient retrieval of study metadata, sample information, and compound annotations. By integrating MW’s extensive datasets with R, it enhances data exploration, reproducibility, and seamless integration into bioinformatics workflows. This tool provides a streamlined approach for the users in metabolomics and systems biology to analyze large-scale metabolomics data with ease.

Introduction

The mwbenchr package provides a R interface to the Metabolomics Workbench REST API, enabling programmatic access to one of the largest metabolomics data repositories. This vignette demonstrates how to use mwbenchr to retrieve study data, compound information, and perform common metabolomics data analysis tasks.

What is Metabolomics Workbench?

The Metabolomics Workbench is a public repository for metabolomics data, metadata, and analysis tools. It contains thousands of studies across various species, sample types, and analytical platforms, making it an invaluable resource for metabolomics research.

Getting Started

Set-up libraries

library(mwbenchr)
library(dplyr)
library(ggplot2)

Creating a Client

All interactions with the API begin by creating a client object:

# Caching enabled client (recommended for repeated queries: enable caching)
client <- mw_rest_client(cache = TRUE)

# Custom client configuration
client <- mw_rest_client(
    cache = TRUE,
    cache_dir = tempdir(),
    timeout = 60
)

# Basic client with default settings
client <- mw_rest_client()

client
#> Metabolomics Workbench REST Client
#>   Base URL: https://www.metabolomicsworkbench.org/rest 
#>   Caching: Disabled 
#>   Timeout: 30 seconds

Exploring Available Endpoints

Get an overview of available functions:

list_endpoints(client)

Working with Studies

Discovering Studies

Start by getting an overview of available studies:

# Obtain all studies (this returns a lot of data)
all_studies <- get_study_summary(client)

head(all_studies, 3)
#> # A tibble: 3 × 6
#>   study_id study_title         species institute analysis_type number_of_samples
#>   <chr>    <chr>               <chr>   <chr>     <chr>         <chr>            
#> 1 ST004148 Surface Acoustic W… Homo s… Universi… LC-MS         42               
#> 2 ST004144 Metabolic rewiring… Homo s… Brunel U… LC-MS         294              
#> 3 ST004134 Metabolomic analys… Mus mu… Indian I… LC-MS         13
colnames(all_studies)
#> [1] "study_id"          "study_title"       "species"          
#> [4] "institute"         "analysis_type"     "number_of_samples"

Filtering and Exploring Studies

# Filter for human studies
human_studies <- all_studies %>%
    filter(grepl("Human", study_title, ignore.case = TRUE)) %>%
    select(
        study_id, study_title # , submit_date, last_modified
    )

# Look for specific topics
diabetes_studies <- all_studies %>%
    filter(grepl("diabetes|glucose", study_title, ignore.case = TRUE))

print(paste("Found", nrow(diabetes_studies), "diabetes-related studies"))
#> [1] "Found 113 diabetes-related studies"

Getting Detailed Study Information

Once you’ve identified studies of interest, get detailed information:

# Get experimental factors (sample metadata) for a study
study_id <- "ST000001"
factors <- get_study_factors(client, study_id)
head(factors)
#> # A tibble: 6 × 6
#>   study_id local_sample_id sample_source factors           mb_sample_id raw_data
#>   <chr>    <chr>           <chr>         <chr>             <chr>        <chr>   
#> 1 ST000001 LabF_115904     Plant         Arabidopsis Geno… SA000019     ""      
#> 2 ST000001 LabF_115909     Plant         Arabidopsis Geno… SA000020     ""      
#> 3 ST000001 LabF_115914     Plant         Arabidopsis Geno… SA000021     ""      
#> 4 ST000001 LabF_115919     Plant         Arabidopsis Geno… SA000022     ""      
#> 5 ST000001 LabF_115924     Plant         Arabidopsis Geno… SA000023     ""      
#> 6 ST000001 LabF_115929     Plant         Arabidopsis Geno… SA000024     ""

# Obtain the list of metabolites measured
metabolites <- get_study_metabolites(client, study_id)
head(metabolites)
#> # A tibble: 6 × 5
#>   study_id analysis_id analysis_summary       metabolite_name        refmet_name
#>   <chr>    <chr>       <chr>                  <chr>                  <chr>      
#> 1 ST000001 AN000001    GCMS positive ion mode 1,2,4-benzenetriol     1,2,4-Trih…
#> 2 ST000001 AN000001    GCMS positive ion mode 1-monostearin          MG 18:0/0:…
#> 3 ST000001 AN000001    GCMS positive ion mode 2-hydroxyvaleric acid  2-Hydroxyv…
#> 4 ST000001 AN000001    GCMS positive ion mode 3-phosphoglycerate     3-Phosphog…
#> 5 ST000001 AN000001    GCMS positive ion mode 5-hydroxynorvaline NI… 2-Amino-5-…
#> 6 ST000001 AN000001    GCMS positive ion mode adenosine              Adenosine

# The complete data matrix
study_data <- get_study_data(client, study_id)
dim(study_data)
#> [1] 2448    8

Advanced Study Search with MetStat

The MetStat functionality allows sophisticated searching across multiple criteria:

# Search for human blood LCMS studies
human_blood_lcms <- search_metstat(client,
    analysis_type = "LCMS",
    species = "Human",
    sample_source = "Blood"
)

# Search for mouse liver studies with any analytical method
mouse_liver <- search_metstat(client,
    species = "Mouse",
    sample_source = "Liver"
)

# Search for studies containing specific metabolites
glucose_studies <- search_metstat(client, refmet_name = "Glucose")

# Complex search combining multiple criteria
complex_search <- search_metstat(client,
    analysis_type = "LCMS",
    polarity = "POSITIVE",
    species = "Human",
    sample_source = "Urine"
)

Working with Compounds

Retrieving Compound Information

Get detailed information about specific compounds:

# Get compound by registry number
compound_1 <- get_compound_by_regno(client, "1")
print(compound_1$systematic_name)
#> NULL

# Get compound by PubChem CID
glucose_pubchem <- get_compound_by_pubchem_cid(client, "5793")

# Get specific fields only
compound_name <- get_compound_by_regno(client, "1", fields = "name")

# Get classification hierarchy
classification <- get_compound_classification(client, "regno", "1")
head(classification)
#> # A tibble: 1 × 5
#>   regno name                                  sys_name lm_category lm_main_class
#>   <chr> <chr>                                 <chr>    <chr>       <chr>        
#> 1 1     2-methoxy-12-methyloctadec-17-en-5-y… 2-metho… Fatty Acyl… Other Fatty …

Downloading Structure Files

# Download MOL file
mol_content <- download_compound_structure(client, "regno", "1", "molfile")

# Save to file
# writeLines(mol_content, "glucose.mol")

# Download SDF format
sdf_content <- download_compound_structure(client, "regno", "1", "sdf")

RefMet Integration

RefMet provides standardized metabolite nomenclature:

# Get RefMet information for a metabolite
glucose_refmet <- get_refmet_by_name(client, "Glucose")

# Standardize alternative names
standardized <- standardize_to_refmet(client, "D-glucose")
print(standardized$refmet_name)
#> [1] "Glucose"

# Get all RefMet names (large dataset - use caching!)
client_cached <- mw_rest_client(cache = TRUE)
# all_refmet <- get_all_refmet_names(client_cached)

Mass Spectrometry Tools

Searching by Mass

Find compounds based on accurate mass measurements:

# Search in RefMet database
mass_matches <- search_by_mass(client,
    db = "REFMET",
    mz = 180.063,
    ion_type = "M+H",
    tolerance = 0.01
)

head(mass_matches)
#> # A tibble: 6 × 10
#>   `Input m/z` `Matched m/z` Delta Name  `Systematic name` Formula Ion   Category
#>         <dbl>         <dbl> <dbl> <chr> <chr>             <chr>   <chr> <chr>   
#> 1        180.          180.  4e-4 1D-c… (1R,2R,3R,4R,5S,… C6H12O6 Neut… Organic…
#> 2        180.          180.  4e-4 Allo… (3S,4S,5R,6S)-6-… C6H12O6 Neut… Organic…
#> 3        180.          180.  4e-4 D-Ta… (3S,4S,5R)-1,3,4… C6H12O6 Neut… Organic…
#> 4        180.          180.  4e-4 Fruc… (3S,4R,5R)-1,3,4… C6H12O6 Neut… Carbohy…
#> 5        180.          180.  4e-4 Gala… (3R,4S,5R,6R)-6-… C6H12O6 Neut… Carbohy…
#> 6        180.          180.  4e-4 Gluc… (3R,4S,5S,6R)-6-… C6H12O6 Neut… Carbohy…
#> # ℹ 2 more variables: `Main class` <chr>, `Sub class` <chr>

# Search in lipids database with different ion type
lipid_matches <- search_by_mass(client,
    db = "LIPIDS",
    mz = 760.585,
    ion_type = "M+H",
    tolerance = 0.05
)

# Compare results across databases
mb_matches <- search_by_mass(client,
    db = "MB",
    mz = 180.063,
    ion_type = "M+H",
    tolerance = 0.01
)

Calculating Exact Masses

Calculate theoretical masses for lipid species:

# Calculate exact mass for phosphatidylcholine
pc_mass <- calculate_exact_mass(client, "PC(34:1)", "M+H")
print(paste("PC(34:1) [M+H]+ =", pc_mass$mz))
#> [1] "PC(34:1) [M+H]+ = 760.585077"

# Different ion types
pc_sodium <- calculate_exact_mass(client, "PC(34:1)", "M+Na")
pc_negative <- calculate_exact_mass(client, "PC(34:1)", "M-H")

# Compare masses
mass_comparison <- data.frame(
    ion_type = c("M+H", "M+Na", "M-H"),
    exact_mass = c(pc_mass$mz, pc_sodium$mz, pc_negative$mz)
)
print(mass_comparison)
#>   ion_type exact_mass
#> 1      M+H   760.5851
#> 2     M+Na   782.5670
#> 3      M-H   758.5705

Data Integration and Analysis Workflows

Complete Metabolomics Study Analysis

Here’s a complete workflow for analyzing a metabolomics study:

# 1. Find relevant studies
studies <- search_metstat(client,
    analysis_type = "LCMS",
    species = "Human",
    sample_source = "Blood"
)

# Select a study of interest
target_study <- studies$study[1]

# 2. Get study metadata
study_info <- get_study_summary(client, target_study)
factors <- get_study_factors(client, target_study)
metabolites <- get_study_metabolites(client, target_study)

# 3. Get the data matrix
data_matrix <- get_study_data(client, target_study)

# 4. Basic data exploration
cat("Study:", study_info$study_title, "\n")
#> Study: Surface Acoustic Wave Hemolysis Assay for Evaluating Stored Red Blood Cells
cat("Number of samples:", ncol(data_matrix) - 7, "\n") # Subs. metadata columns
#> Number of samples: 1
cat("Number of metabolites:", nrow(data_matrix), "\n")
#> Number of metabolites: 7014

# 5. Data processing
numeric_data <- data_matrix %>%
    select(-c(study_id:units)) %>% # Remove metadata columns
    mutate(across(everything(), as.numeric))

# Calculate basic statistics
metabolite_stats <- data_matrix %>%
    select(metabolite_name, refmet_name) %>%
    bind_cols(
        mean_intensity = rowMeans(numeric_data, na.rm = TRUE),
        cv = apply(numeric_data, 1, function(x) {
            sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
        })
    )

head(metabolite_stats)
#> # A tibble: 6 × 4
#>   metabolite_name        refmet_name                mean_intensity    cv
#>   <chr>                  <chr>                               <dbl> <dbl>
#> 1 10-Formyldihydrofolate 10-Formyldihydrofolic acid           NaN     NA
#> 2 10-Formyldihydrofolate 10-Formyldihydrofolic acid         11905.    NA
#> 3 10-Formyldihydrofolate 10-Formyldihydrofolic acid         14177.    NA
#> 4 10-Formyldihydrofolate 10-Formyldihydrofolic acid         13401.    NA
#> 5 10-Formyldihydrofolate 10-Formyldihydrofolic acid         35540.    NA
#> 6 10-Formyldihydrofolate 10-Formyldihydrofolic acid         15646.    NA

Standardizing Metabolite Names

Standardize metabolite names across different studies:

# Get metabolites from multiple studies
study1_metabolites <- get_study_metabolites(client, "ST000001")
study2_metabolites <- get_study_metabolites(client, "ST000002")

# Combine and get unique names
all_metabolite_names <- unique(c(
    study1_metabolites$metabolite_name,
    study2_metabolites$metabolite_name
))

# Standardize names (in practice, you'd do this in batches)
standardized_names <- lapply(all_metabolite_names[1:5], function(name) {
    tryCatch(
        {
            result <- standardize_to_refmet(client, name)
            data.frame(
                original_name = name,
                refmet_name = result$refmet_name %||% NA,
                stringsAsFactors = FALSE
            )
        },
        error = function(e) {
            data.frame(
                original_name = name,
                refmet_name = NA,
                stringsAsFactors = FALSE
            )
        }
    )
})

standardization_results <- do.call(rbind, standardized_names)
print(standardization_results)
#>             original_name                     refmet_name
#> 1      1,2,4-benzenetriol         1,2,4-Trihydroxybenzene
#> 2           1-monostearin                 MG 18:0/0:0/0:0
#> 3   2-hydroxyvaleric acid           2-Hydroxyvaleric acid
#> 4      3-phosphoglycerate          3-Phosphoglyceric acid
#> 5 5-hydroxynorvaline NIST 2-Amino-5-hydroxypentanoic acid

Cross-Study Comparisons

Compare metabolite coverage across studies:

# Get metabolites from multiple studies
studies_to_compare <- c("ST000001", "ST000002", "ST000003")

metabolite_coverage <- lapply(studies_to_compare, function(study_id) {
    metabolites <- get_study_metabolites(client, study_id)
    data.frame(
        study_id = study_id,
        metabolite_count = nrow(metabolites),
        unique_refmet = length(unique(metabolites$refmet_name)),
        stringsAsFactors = FALSE
    )
})

coverage_df <- do.call(rbind, metabolite_coverage)
print(coverage_df)
#>   study_id metabolite_count unique_refmet
#> 1 ST000001              102            99
#> 2 ST000002              142           141
#> 3 ST000003               51            51

Visualization Examples

Study Statistics

# Visualize study submission over time
studies_with_dates <- all_studies %>%
    filter(!is.na(submit_date)) %>%
    mutate(
        submit_year = as.numeric(format(as.Date(submit_date), "%Y")),
        analysis_type = case_when(
            grepl("LCMS", study_title, ignore.case = TRUE) ~ "LC-MS",
            grepl("GCMS", study_title, ignore.case = TRUE) ~ "GC-MS",
            grepl("NMR", study_title, ignore.case = TRUE) ~ "NMR",
            TRUE ~ "Other"
        )
    ) %>%
    filter(submit_year >= 2010 & submit_year <= 2023)

# Plot submissions by year and analysis type
ggplot(studies_with_dates, aes(x = submit_year, fill = analysis_type)) +
    geom_bar(position = "stack") +
    labs(
        title = "Metabolomics Workbench Study Submissions by Year",
        x = "Submission Year",
        y = "Number of Studies",
        fill = "Analysis Type"
    ) +
    theme_minimal()

Mass Accuracy Visualization

# Compare mass accuracy across different tolerances
tolerance_values <- c(0.001, 0.005, 0.01, 0.05, 0.1)
target_mz <- 180.063

mass_accuracy_results <- lapply(tolerance_values, function(tol) {
    matches <- search_by_mass(client, "REFMET", target_mz, "M+H", tol)
    data.frame(
        tolerance = tol,
        matches = nrow(matches),
        stringsAsFactors = FALSE
    )
})

accuracy_df <- do.call(rbind, mass_accuracy_results)

ggplot(accuracy_df, aes(x = tolerance, y = matches)) +
    geom_line() +
    geom_point() +
    scale_x_log10() +
    labs(
        title = "Number of Mass Matches vs. Tolerance",
        x = "Mass Tolerance (Da, log scale)",
        y = "Number of Matches"
    ) +
    theme_minimal()

Error Handling and Best Practices

Robust API Calls

Making safely API calls with retries

safe_api_call <- function(func, ..., max_attempts = 3) {
    for (attempt in 1:max_attempts) {
        result <- tryCatch(
            {
                func(...)
            },
            error = function(e) {
                if (attempt == max_attempts) {
                    stop(
                        "API call failed after ", max_attempts, " attempts: ",
                        e$message
                    )
                }
                message("Attempt ", attempt, " failed, retrying...")
                Sys.sleep(1)
                NULL
            }
        )

        if (!is.null(result)) {
            return(result)
        }
    }
}

study_data <- safe_api_call(get_study_summary, client, "ST000001")

Batch Processing

Processing multiple studies efficiently

process_studies <- function(client, study_ids, max_concurrent = 5) {
    results <- list()

    for (i in seq_along(study_ids)) {
        study_id <- study_ids[i]

        cat("Processing study", i, "of", length(study_ids), ":", study_id, "\n")

        tryCatch(
            {
                summary <- get_study_summary(client, study_id)
                metabolites <- get_study_metabolites(client, study_id)

                results[[study_id]] <- list(
                    summary = summary,
                    metabolite_count = nrow(metabolites),
                    metabolites = metabolites$metabolite_name
                )

                if (i %% max_concurrent == 0) {
                    Sys.sleep(1)
                }
            },
            error = function(e) {
                warning("Failed to process study ", study_id, ": ", e$message)
                results[[study_id]] <- list(error = e$message)
            }
        )
    }

    return(results)
}

# simple usage (with a small subset)
study_subset <- all_studies$study_id[1:3]
batch_results <- process_studies(client, study_subset)
#> Processing study 1 of 3 : ST004148 
#> Processing study 2 of 3 : ST004144 
#> Processing study 3 of 3 : ST004134

Performance Optimization

Using Caching

# set up caching for expensive operations
cached_client <- mw_rest_client(
    cache = TRUE,
    cache_dir = file.path(tempdir(), "mwbenchr_cache")
)
refmet_data <- get_all_refmet_names(cached_client) # Large dataset
study_list <- get_study_summary(cached_client) # Frequently accessed

# subsequent calls will use cached data
refmet_data_cached <- get_all_refmet_names(cached_client) # Fast!

Memory Management

# For large datasets, process in chunks
process_large_study <- function(client, study_id) {
    # Get metadata first (small)
    summary <- get_study_summary(client, study_id)
    metabolites <- get_study_metabolites(client, study_id)

    # Check data size before loading full matrix
    n_metabolites <- nrow(metabolites)

    if (n_metabolites > 1000) {
        warning(
            "Large study detected (", n_metabolites, " metabolites). ",
            "Consider processing in chunks."
        )
    }

    # Load full data only if reasonable size
    if (n_metabolites <= 5000) {
        data_matrix <- get_study_data(client, study_id)
        return(list(
            summary = summary,
            metabolites = metabolites,
            data = data_matrix
        ))
    } else {
        return(list(
            summary = summary,
            metabolites = metabolites,
            data = NULL,
            note = "Data matrix too large - not loaded"
        ))
    }
}

Session Info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       Ubuntu 24.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2025-09-01
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>  quarto   NA
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  BiocManager    1.30.26 2025-06-05 [1] RSPM
#>  BiocStyle    * 2.36.0  2025-04-15 [1] Bioconduc~
#>  bit            4.6.0   2025-03-06 [1] RSPM
#>  bit64          4.6.0-1 2025-01-16 [1] RSPM
#>  bookdown       0.44    2025-08-21 [1] RSPM
#>  bslib          0.9.0   2025-01-30 [1] RSPM
#>  cachem         1.1.0   2024-05-16 [1] RSPM
#>  cli            3.6.5   2025-04-23 [1] RSPM
#>  crayon         1.5.3   2024-06-20 [1] RSPM
#>  curl           7.0.0   2025-08-19 [1] RSPM
#>  desc           1.4.3   2023-12-10 [1] RSPM
#>  digest         0.6.37  2024-08-19 [1] RSPM
#>  dplyr        * 1.1.4   2023-11-17 [1] RSPM
#>  evaluate       1.0.5   2025-08-27 [1] RSPM
#>  farver         2.1.2   2024-05-13 [1] RSPM
#>  fastmap        1.2.0   2024-05-15 [1] RSPM
#>  fs             1.6.6   2025-04-12 [1] RSPM
#>  generics       0.1.4   2025-05-09 [1] RSPM
#>  ggplot2      * 3.5.2   2025-04-09 [1] RSPM
#>  glue           1.8.0   2024-09-30 [1] RSPM
#>  gtable         0.3.6   2024-10-25 [1] RSPM
#>  htmltools      0.5.8.1 2024-04-04 [1] RSPM
#>  httr2          1.2.1   2025-07-22 [1] RSPM
#>  jquerylib      0.1.4   2021-04-26 [1] RSPM
#>  jsonlite       2.0.0   2025-03-27 [1] RSPM
#>  knitr          1.50    2025-03-16 [1] RSPM
#>  labeling       0.4.3   2023-08-29 [1] RSPM
#>  lifecycle      1.0.4   2023-11-07 [1] RSPM
#>  magrittr       2.0.3   2022-03-30 [1] RSPM
#>  mwbenchr     * 0.0.1   2025-09-01 [1] local
#>  pillar         1.11.0  2025-07-04 [1] RSPM
#>  pkgconfig      2.0.3   2019-09-22 [1] RSPM
#>  pkgdown        2.1.3   2025-05-25 [1] any (@2.1.3)
#>  purrr          1.1.0   2025-07-10 [1] RSPM
#>  R6             2.6.1   2025-02-15 [1] RSPM
#>  ragg           1.4.0   2025-04-10 [1] RSPM
#>  rappdirs       0.3.3   2021-01-31 [1] RSPM
#>  RColorBrewer   1.1-3   2022-04-03 [1] RSPM
#>  rlang          1.1.6   2025-04-11 [1] RSPM
#>  rmarkdown      2.29    2024-11-04 [1] RSPM
#>  sass           0.4.10  2025-04-11 [1] RSPM
#>  scales         1.4.0   2025-04-24 [1] RSPM
#>  sessioninfo    1.2.3   2025-02-05 [1] any (@1.2.3)
#>  systemfonts    1.2.3   2025-04-30 [1] RSPM
#>  textshaping    1.0.1   2025-05-01 [1] RSPM
#>  tibble         3.3.0   2025-06-08 [1] RSPM
#>  tidyselect     1.2.1   2024-03-11 [1] RSPM
#>  tzdb           0.5.0   2025-03-15 [1] RSPM
#>  utf8           1.2.6   2025-06-08 [1] RSPM
#>  vctrs          0.6.5   2023-12-01 [1] RSPM
#>  vroom          1.6.5   2023-12-05 [1] RSPM
#>  withr          3.0.2   2024-10-28 [1] RSPM
#>  xfun           0.53    2025-08-19 [1] RSPM
#>  yaml           2.3.10  2024-07-26 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.5.1/lib/R/site-library
#>  [3] /opt/R/4.5.1/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Conclusion

We demonstrated the key functionality of mwbenchr for accessing and analyzing metabolomics data from the Metabolomics Workbench.

Here are some key takeaways:

Client Setup: Always start with mw_rest_client(), enable caching for better performance
Study Discovery: Use get_study_summary() and search_metstat() to find relevant studies
Data Retrieval: Get detailed study information with study-specific functions
Compound Analysis: Use compound functions and mass spectrometry tools for molecular identification
Name Standardization: Leverage RefMet for consistent metabolite naming
Best Practices: Implement error handling, use caching, and consider rate limiting

For more information, see the function documentation (?function_name) and visit the Metabolomics Workbench website.

Dany Mukesha

2025-09-01