Fit a BayesR Mixture Model (MCMC or stochastic EM)

Description

BayesR (Erbe et al., 2012) places a four-component mixture prior on allele effects: zero, small, medium, and large effect classes with proportions pi_vec and class variances proportional to sigma2_ah. Use method = “mcmc” for full Bayesian posterior inference, or method = “em” for fast stochastic-EM point estimates (Gaussian only).

Usage

run_bayesr(
  w,
  y,
  wtw_diag,
  X = NULL,
  pi_vec = c(0.95, 0.02, 0.02, 0.01),
  sigma2_e_init,
  sigma2_ah = NULL,
  sigma2_vec = NULL,
  prior_params = NULL,
  mcmc_params = NULL,
  em_params = NULL,
  method = c("mcmc", "em"),
  response_type = c("gaussian", "binary"),
  fold_id = 0L,
  save_rds = FALSE,
  save_path = NULL,
  verbose = FALSE
)

Arguments

`w`	Numeric design matrix (`n x p`). Typically the `$W_ah` element returned by `construct_wah_matrix`.
`y`	Phenotype vector (length `n`). For binary traits use 0/1 coding.
`wtw_diag`	Pre-computed `colSums(w^2)`, length `p`.
`X`	Optional fixed-effects design matrix (`n x q`). When supplied, the model is $y = X\alpha + W\beta + \mu + \epsilon$ with a flat prior on $\alpha$. Do not include a column of ones for the intercept; $\mu$ is sampled separately.
`pi_vec`	Initial mixture proportions for the four classes (zero, small, medium, large). Must sum to 1. Default `c(0.95, 0.02, 0.02, 0.01)`.
`sigma2_e_init`	Initial residual variance. Typically `var(y) * 0.5`; for binary traits fix at `1.0`.
`sigma2_ah`	Prior total genetic variance (required). Typically `var(y) * 0.5`; for binary traits use `1.0`.
`sigma2_vec`	Optional explicit variance vector for EM; if `NULL` it is derived from `sigma2_ah`, `pi_vec`, and `prior_params$variance_class`.
`prior_params`	Optional named list overriding defaults: `a0_e` (10), `a0_g` (10), `variance_class` (`c(0, 0.001, 0.01, 0.1)`). `b0_e` and `b0_g` are computed internally.
`mcmc_params`	Optional named list: `n_iter` (40000), `n_burn` (20000), `n_thin` (10), `seed` (123).
`em_params`	Optional named list: `max_iter` (500), `tol` (`1e-6`).
`method`	Either `“mcmc”` (default) or `“em”`.
`response_type`	`“gaussian”` (default) or `“binary”`. Binary requires `method = “mcmc”`.
`fold_id`	Integer label printed in progress messages; useful when running CV loops.
`save_rds`	If `TRUE` (default) the fit object is written to disk as an RDS file. Set `FALSE` (default) for CV loops.
`save_path`	Optional explicit RDS path. If `NULL` (default), defaults to `“results_bayesr.Rds”` in the current working directory.
`verbose`	If `TRUE`, the Rust engine streams per-iteration progress (start banner, Iter X/Y diagnostics, ESS / Geweke, completion) to stderr. Default `FALSE` keeps the Rust side silent. The brief R post-fit summary (model, runtime, $h^2$, ESS, Geweke) prints regardless.

Details

Algorithm choice. MCMC uses marginalised Gibbs sampling and returns full posterior chains; recommended when posterior uncertainty, ESS, or Geweke diagnostics are needed. EM is much faster but yields no uncertainty quantification and does not support response_type = “binary”.

Auto-save. By default the returned fit is also saved to results_bayesr.Rds in getwd(). Set save_rds = FALSE (or pass an explicit save_path) when running cross-validation loops to avoid overwriting earlier folds.

Three usage scenarios. The same run_bayesr() + summary() + predict() pipeline supports:

Full-data fit: train on all data; predict(fit) returns in-sample metrics.
Train-test split: train on a subset; evaluate via predict(fit, W_test, y_test).
k-fold CV: the user loops over folds, calling run_bayesr() with save_rds = FALSE and fold_id = k, then predict() on each held-out fold.

Derived hyperparameters. The wrapper computes b0_e = sigma2_e_init * (a0_e - 1) and b0_g = sigma2_ah * (a0_g - 2) / (a0_g * (1 - pi_vec[1])) before calling the Rust engine. For EM, a sigma2_vec is derived as variance_class * sigma2_ah / ((1 - pi_vec[1]) * p).

Value

An object of class c(“masbayes_bayesr”, “masbayes”) — a list with the following key fields:

beta_hat, mu_hat, sigma2_e_hat: Posterior point estimates.
beta_samples, gamma_samples, pi_samples, sigma2_e_samples, sigma2_small_samples, sigma2_medium_samples, sigma2_large_samples, mu_samples: Posterior chains (single-row matrices for EM).
GEBV / pred_train: Training genomic estimated breeding values. For binary traits this is on the liability scale.
prob_train: Binary only: training-set predicted probabilities P(y = 1) = pnorm(pred_train) (probit inverse link from Albert-Chib augmentation).
h2, sigma2_g, sigma2_e: Heritability and variance components.
runtime: Elapsed seconds.
training_metrics: R2, RMSE, accuracy (or AUC for binary), bias. For binary, all metrics are on the observed (probability) scale — bias is the calibration slope (1.0 = perfectly calibrated) and RMSE^2 approximates the Brier score. AUC is rank-invariant so unaffected.
diagnostics: ESS and Geweke Z for key parameters (MCMC only).
variance_components: Posterior mean and 95% CI for each class plus the mixture proportions pi.
rds_path: Path of the saved RDS file, or NULL when save_rds = FALSE.

References

Erbe, M., Hayes, B. J., Matukumalli, L. K., Goswami, S., Bowman, P. J., Reich, C. M., Mason, B. A., & Goddard, M. E. (2012). Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. Journal of Dairy Science, 95(7), 4114-4129. doi:10.3168/jds.2011-5019

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819

Examples

library("masbayes")

set.seed(42)
n     <- 200
mcmc  <- list(n_iter = 2000L, n_burn = 1000L, n_thin = 5L, seed = 123L)
X_cov <- cbind(sex = rbinom(n, 1, 0.5), batch = rnorm(n))  # optional

# ---- (A) SNP path ------------------------------------------------------
n_snp <- 100
X     <- matrix(rbinom(n * n_snp, 2, prob = runif(n_snp, 0.1, 0.5)),
                n, n_snp)
W_snp <- construct_snp_matrix(X)$W
y     <- W_snp[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_snp <- run_bayesr(
  w             = W_snp,
  y             = y,
  wtw_diag      = colSums(W_snp^2),
  X             = X_cov,           # optional
  sigma2_e_init = var(y) * 0.5,
  sigma2_ah     = var(y) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_snp)

# ---- (B) Microhaplotype path ------------------------------------------
n_block <- 50
hap <- matrix(sample.int(3, n * n_block * 2, replace = TRUE), nrow = n)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))
freq <- data.frame(
  haplotype = paste0("hap_", rep(seq_len(n_block), each = 3)),
  allele    = rep(1:3, n_block),
  freq      = rep(c(0.5, 0.3, 0.2), n_block)
)
W_mh <- construct_wah_matrix(hap, colnames(hap), freq)$W_ah
y_mh <- W_mh[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_mh <- run_bayesr(
  w             = W_mh,
  y             = y_mh,
  wtw_diag      = colSums(W_mh^2),
  X             = X_cov,           # optional
  sigma2_e_init = var(y_mh) * 0.5,
  sigma2_ah     = var(y_mh) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_mh)

# ---- Train/test split (scheme-agnostic) -------------------------------
idx  <- sample(n, 0.8 * n)
W_tr <- W_snp[idx, ]
y_tr <- y[idx]
fit  <- run_bayesr(
  w             = W_tr,
  y             = y_tr,
  wtw_diag      = colSums(W_tr^2),
  sigma2_e_init = var(y_tr) * 0.5,
  sigma2_ah     = var(y_tr) * 0.5,
  mcmc_params   = mcmc
)
pred <- predict(fit, W_snp[-idx, ], y[-idx])
pred$metrics$accuracy