masbayes: Bayesian Genomic Prediction Models for SNPs and Microhaplotypes

Description

Bayesian methods for genomic prediction using SNP or microhaplotype markers. Implements BayesR (four-class mixture) and BayesA (marker-specific variance), each available as full MCMC or stochastic EM. Continuous (Gaussian) and binary (Albert-Chib data augmentation) traits are supported.

Workflow

The standard pipeline is three steps:

construct_wah_matrix() — build the \(W_{\alpha h}\) design matrix from phased haplotypes.
run_bayesr() or run_bayesa() — fit the Bayesian model. The fit object is auto-saved to results_bayesr.Rds / results_bayesa.Rds by default.
summary(fit) — full report (heritability, MCMC ESS / Geweke, variance components, training metrics, top alleles).
predict(fit, newdata, y_new) — evaluate the model on any data: in-sample, test split, or k-fold CV held-out fold.

Three usage scenarios

summary() and predict() are scheme-agnostic — the same methods cover all three of:

Full-data fit: train on everything; predict(fit) returns in-sample metrics.
Train/test split: train on a subset; evaluate via predict(fit, W_test, y_test).
k-fold CV: the user loops over folds (typically with save_rds = FALSE, verbose = FALSE), then predict() on each held-out fold.

Cross-validation orchestration is intentionally left to the caller — the package does not ship a built-in cv_*() helper.

Quick start

library(masbayes)

# 1. Design matrix
train <- construct_wah_matrix(hap, colnames(hap), allele_freq)
W     <- train\$W_ah

# 2. Sufficient statistics
wtw <- colSums(W^2)

# 3. Fit
fit <- run_bayesr(W, y, wtw,
                  sigma2_e_init = var(y) * 0.5,
                  sigma2_ah     = var(y) * 0.5,
                  method        = "mcmc")

# 4. Report and evaluate
summary(fit)
pred <- predict(fit, W_test, y_test)   # train/test mode
pred\$metrics\$accuracy

Main functions

construct_wah_matrix: Build the \(W_{\alpha h}\) design matrix.
run_bayesr: Fit BayesR (4-class mixture).
run_bayesa: Fit BayesA (marker-specific variance).
summary.masbayes_bayesr, summary.masbayes_bayesa: Full-fit reports (S3 methods).
predict.masbayes_bayesr, predict.masbayes_bayesa: GEBV + metrics, scheme- agnostic (S3 methods).

Output of run_bayesr() / run_bayesa()

The returned S3 object (class masbayes_bayesr or masbayes_bayesa) contains:

GEBV, beta_hat, beta_samples (and sigma2_j_samples for BayesA)
sigma2_e_samples, mu_samples, runtime (seconds)
h2, sigma2_g, sigma2_e
training_metrics: R2, RMSE, accuracy/AUC, bias
diagnostics: ESS, Geweke Z (MCMC only)
variance_components: small / medium / large + pi (BayesR), or tertile bins of marker variances (BayesA)
rds_path: where the fit was auto-saved

Disable auto-save with save_rds = FALSE — recommended for CV loops.

Low-level bindings

run_bayesr_mcmc(), run_bayesr_em(), run_bayesa_mcmc(), and run_bayesa_em() are the direct Rust bindings. They are exported but undocumented; prefer run_bayesr() / run_bayesa() for routine use.

References

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819
Erbe, M., Hayes, B. J., Matukumalli, L. K., Goswami, S., Bowman, P. J., Reich, C. M., Mason, B. A., & Goddard, M. E. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science, 95(7), 4114-4129. doi:10.3168/jds.2011-5019
Da, Y. (2015). Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genetics, 16(1), 144. doi:10.1186/s12863-015-0301-1

Author(s)

Maintainer: Wibowo Agus aguswibowo1698@gmail.com