Fit a BayesR Mixture Model (MCMC or stochastic EM)

Fit a BayesR Mixture Model (MCMC or stochastic EM)

Description

BayesR (Erbe et al., 2012) places a four-component mixture prior on allele effects: zero, small, medium, and large effect classes with proportions pi_vec and class variances proportional to sigma2_ah. Use method = “mcmc” for full Bayesian posterior inference, or method = “em” for fast stochastic-EM point estimates (Gaussian only).

Usage

run_bayesr(
  w,
  y,
  wtw_diag,
  X = NULL,
  pi_vec = c(0.95, 0.02, 0.02, 0.01),
  sigma2_e_init,
  sigma2_ah = NULL,
  sigma2_vec = NULL,
  prior_params = NULL,
  mcmc_params = NULL,
  em_params = NULL,
  method = c("mcmc", "em"),
  response_type = c("gaussian", "binary"),
  fold_id = 0L,
  save_rds = FALSE,
  save_path = NULL,
  verbose = FALSE
)

Arguments

w Numeric design matrix (n x p). Typically the $W_ah element returned by construct_wah_matrix.
y Phenotype vector (length n). For binary traits use 0/1 coding.
wtw_diag Pre-computed colSums(w^2), length p.
X Optional fixed-effects design matrix (n x q). When supplied, the model is \(y = X\alpha + W\beta + \mu + \epsilon\) with a flat prior on \(\alpha\). Do not include a column of ones for the intercept; \(\mu\) is sampled separately.
pi_vec Initial mixture proportions for the four classes (zero, small, medium, large). Must sum to 1. Default c(0.95, 0.02, 0.02, 0.01).
sigma2_e_init Initial residual variance. Typically var(y) * 0.5; for binary traits fix at 1.0.
sigma2_ah Prior total genetic variance (required). Typically var(y) * 0.5; for binary traits use 1.0.
sigma2_vec Optional explicit variance vector for EM; if NULL it is derived from sigma2_ah, pi_vec, and prior_params$variance_class.
prior_params Optional named list overriding defaults: a0_e (10), a0_g (10), variance_class (c(0, 0.001, 0.01, 0.1)). b0_e and b0_g are computed internally.
mcmc_params Optional named list: n_iter (40000), n_burn (20000), n_thin (10), seed (123).
em_params Optional named list: max_iter (500), tol (1e-6).
method Either “mcmc” (default) or “em”.
response_type “gaussian” (default) or “binary”. Binary requires method = “mcmc”.
fold_id Integer label printed in progress messages; useful when running CV loops.
save_rds If TRUE (default) the fit object is written to disk as an RDS file. Set FALSE (default) for CV loops.
save_path Optional explicit RDS path. If NULL (default), defaults to “results_bayesr.Rds” in the current working directory.
verbose If TRUE, the Rust engine streams per-iteration progress (start banner, Iter X/Y diagnostics, ESS / Geweke, completion) to stderr. Default FALSE keeps the Rust side silent. The brief R post-fit summary (model, runtime, \(h^2\), ESS, Geweke) prints regardless.

Details

Algorithm choice. MCMC uses marginalised Gibbs sampling and returns full posterior chains; recommended when posterior uncertainty, ESS, or Geweke diagnostics are needed. EM is much faster but yields no uncertainty quantification and does not support response_type = “binary”.

Auto-save. By default the returned fit is also saved to results_bayesr.Rds in getwd(). Set save_rds = FALSE (or pass an explicit save_path) when running cross-validation loops to avoid overwriting earlier folds.

Three usage scenarios. The same run_bayesr() + summary() + predict() pipeline supports:

  • Full-data fit: train on all data; predict(fit) returns in-sample metrics.

  • Train-test split: train on a subset; evaluate via predict(fit, W_test, y_test).

  • k-fold CV: the user loops over folds, calling run_bayesr() with save_rds = FALSE and fold_id = k, then predict() on each held-out fold.

Derived hyperparameters. The wrapper computes b0_e = sigma2_e_init * (a0_e - 1) and b0_g = sigma2_ah * (a0_g - 2) / (a0_g * (1 - pi_vec[1])) before calling the Rust engine. For EM, a sigma2_vec is derived as variance_class * sigma2_ah / ((1 - pi_vec[1]) * p).

Value

An object of class c(“masbayes_bayesr”, “masbayes”) — a list with the following key fields:

beta_hat, mu_hat, sigma2_e_hat
Posterior point estimates.
beta_samples, gamma_samples, pi_samples, sigma2_e_samples, sigma2_small_samples, sigma2_medium_samples, sigma2_large_samples, mu_samples
Posterior chains (single-row matrices for EM).
GEBV / pred_train
Training genomic estimated breeding values. For binary traits this is on the liability scale.
prob_train
Binary only: training-set predicted probabilities P(y = 1) = pnorm(pred_train) (probit inverse link from Albert-Chib augmentation).
h2, sigma2_g, sigma2_e
Heritability and variance components.
runtime
Elapsed seconds.
training_metrics
R2, RMSE, accuracy (or AUC for binary), bias. For binary, all metrics are on the observed (probability) scale — bias is the calibration slope (1.0 = perfectly calibrated) and RMSE^2 approximates the Brier score. AUC is rank-invariant so unaffected.
diagnostics
ESS and Geweke Z for key parameters (MCMC only).
variance_components
Posterior mean and 95% CI for each class plus the mixture proportions pi.
rds_path
Path of the saved RDS file, or NULL when save_rds = FALSE.

References

Erbe, M., Hayes, B. J., Matukumalli, L. K., Goswami, S., Bowman, P. J., Reich, C. M., Mason, B. A., & Goddard, M. E. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science, 95(7), 4114-4129. doi:10.3168/jds.2011-5019

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819

See Also

run_bayesa, construct_wah_matrix, summary.masbayes_bayesr, predict.masbayes_bayesr

Examples

library("masbayes")

set.seed(42)
n     <- 200
mcmc  <- list(n_iter = 2000L, n_burn = 1000L, n_thin = 5L, seed = 123L)
X_cov <- cbind(sex = rbinom(n, 1, 0.5), batch = rnorm(n))  # optional

# ---- (A) SNP path ------------------------------------------------------
n_snp <- 100
X     <- matrix(rbinom(n * n_snp, 2, prob = runif(n_snp, 0.1, 0.5)),
                n, n_snp)
W_snp <- construct_snp_matrix(X)$W
y     <- W_snp[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_snp <- run_bayesr(
  w             = W_snp,
  y             = y,
  wtw_diag      = colSums(W_snp^2),
  X             = X_cov,           # optional
  sigma2_e_init = var(y) * 0.5,
  sigma2_ah     = var(y) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_snp)

# ---- (B) Microhaplotype path ------------------------------------------
n_block <- 50
hap <- matrix(sample.int(3, n * n_block * 2, replace = TRUE), nrow = n)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))
freq <- data.frame(
  haplotype = paste0("hap_", rep(seq_len(n_block), each = 3)),
  allele    = rep(1:3, n_block),
  freq      = rep(c(0.5, 0.3, 0.2), n_block)
)
W_mh <- construct_wah_matrix(hap, colnames(hap), freq)$W_ah
y_mh <- W_mh[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_mh <- run_bayesr(
  w             = W_mh,
  y             = y_mh,
  wtw_diag      = colSums(W_mh^2),
  X             = X_cov,           # optional
  sigma2_e_init = var(y_mh) * 0.5,
  sigma2_ah     = var(y_mh) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_mh)

# ---- Train/test split (scheme-agnostic) -------------------------------
idx  <- sample(n, 0.8 * n)
W_tr <- W_snp[idx, ]
y_tr <- y[idx]
fit  <- run_bayesr(
  w             = W_tr,
  y             = y_tr,
  wtw_diag      = colSums(W_tr^2),
  sigma2_e_init = var(y_tr) * 0.5,
  sigma2_ah     = var(y_tr) * 0.5,
  mcmc_params   = mcmc
)
pred <- predict(fit, W_snp[-idx, ], y[-idx])
pred$metrics$accuracy