Fit a BayesA Marker-Specific Variance Model (MCMC or stochastic EM)

Description

BayesA (Meuwissen et al., 2001) places a scaled inverse chi-squared prior on each allele’s effect variance, allowing heavier-tailed effect-size distributions than ridge regression. Use method = “mcmc” for full posterior inference or method = “em” for fast stochastic-EM point estimates (Gaussian only).

Usage

run_bayesa(
  w,
  y,
  wtw_diag,
  X = NULL,
  nu = 4.5,
  sigma2_g,
  sigma2_e_init,
  prior_params = NULL,
  mcmc_params = NULL,
  em_params = NULL,
  method = c("mcmc", "em"),
  response_type = c("gaussian", "binary"),
  fold_id = 0L,
  save_rds = FALSE,
  save_path = NULL,
  verbose = FALSE
)

Arguments

`w`	Numeric design matrix (`n x p`). Typically the `$W_ah` element returned by `construct_wah_matrix`.
`y`	Phenotype vector (length `n`). Use 0/1 for binary traits.
`wtw_diag`	Pre-computed `colSums(w^2)`.
`X`	Optional fixed-effects design matrix (`n x q`). When supplied, the model is $y = X\alpha + W\beta + \mu + \epsilon$ with a flat prior on $\alpha$. Do not include a column of ones for the intercept; $\mu$ is sampled separately.
`nu`	Degrees of freedom for the scaled inverse chi-squared prior on marker variances. Smaller values allow heavier tails. Must be > 2. Default `4.5`.
`sigma2_g`	Prior total genetic variance. Typically `var(y) * 0.5`; for binary traits use `1.0`.
`sigma2_e_init`	Initial residual variance. For binary traits fix at `1.0`.
`prior_params`	Optional named list. Only `a0_e` (default 10) is used; `b0_e` is derived as `sigma2_e_init * (a0_e - 1)`.
`mcmc_params`	Optional named list: `n_iter` (40000), `n_burn` (20000), `n_thin` (10), `seed` (123).
`em_params`	Optional named list: `max_iter` (500), `tol` (`1e-6`).
`method`	Either `“mcmc”` (default) or `“em”`.
`response_type`	`“gaussian”` (default) or `“binary”`. Binary requires `method = “mcmc”`.
`fold_id`	Integer label for progress messages.
`save_rds`	If `TRUE`, the fit object is saved to disk as an RDS file. Set `FALSE` (default) for CV loops.
`save_path`	Optional explicit RDS path. If `NULL` (default), defaults to `“results_bayesa.Rds”` in the current working directory.
`verbose`	If `TRUE`, the Rust engine streams per-iteration progress (start banner, Iter X/Y diagnostics, ESS / Geweke, completion) to stderr. Default `FALSE` keeps the Rust side silent. The brief R post-fit summary prints regardless.

Details

Algorithm choice. MCMC uses marginalised Gibbs sampling and returns full posterior chains. EM is much faster but provides no uncertainty quantification and does not support response_type = “binary”.

Auto-save. By default the returned fit is saved to results_bayesa.Rds in getwd(). Set save_rds = FALSE for CV loops.

Three usage scenarios. run_bayesa() + summary() + predict() support full-data fits, train-test splits, and k-fold CV uniformly. See run_bayesr for example loops.

Derived parameter. The wrapper computes s_squared = sigma2_g * (nu - 2) / (nu * sum(apply(w, 2, var))) so the prior expectation of each marker variance equals sigma2_g / sum_2pq.

Value

An object of class c(“masbayes_bayesa”, “masbayes”) — a list with the following key fields:

beta_hat, mu_hat, sigma2_e_hat, sigma2_j_hat: Posterior point estimates (intercept, allele effects, residual variance, per-marker variances).
beta_samples, sigma2_j_samples, sigma2_e_samples, mu_samples: Posterior chains (single-row matrices for EM).
GEBV / pred_train, h2, sigma2_g, sigma2_e: Training GEBVs (liability scale for binary), heritability, and total variance components.
prob_train: Binary only: training-set predicted probabilities P(y = 1) = pnorm(pred_train) (probit inverse link from Albert-Chib augmentation).
runtime: Elapsed seconds.
training_metrics: R2, RMSE, accuracy (or AUC for binary), bias. For binary, all metrics are on the observed (probability) scale — bias is the calibration slope (1.0 = perfectly calibrated) and RMSE^2 approximates the Brier score. AUC is rank-invariant so unaffected.
diagnostics: ESS / Geweke Z (MCMC only).
variance_components: Per-marker variances binned into tertiles (small / medium / large) reporting mean, range, and marker count.
rds_path: Path of the saved RDS file, or NULL.

References

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819

Examples

library("masbayes")

set.seed(42)
n     <- 200
mcmc  <- list(n_iter = 2000L, n_burn = 1000L, n_thin = 5L, seed = 123L)
X_cov <- cbind(sex = rbinom(n, 1, 0.5), batch = rnorm(n))  # optional

# ---- (A) SNP path ------------------------------------------------------
n_snp <- 100
X     <- matrix(rbinom(n * n_snp, 2, prob = runif(n_snp, 0.1, 0.5)),
                n, n_snp)
W_snp <- construct_snp_matrix(X)$W
y     <- W_snp[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_snp <- run_bayesa(
  w             = W_snp,
  y             = y,
  wtw_diag      = colSums(W_snp^2),
  X             = X_cov,           # optional
  nu            = 4.5,
  sigma2_g      = var(y) * 0.5,
  sigma2_e_init = var(y) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_snp)

# ---- (B) Microhaplotype path ------------------------------------------
n_block <- 50
hap <- matrix(sample.int(3, n * n_block * 2, replace = TRUE), nrow = n)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))
freq <- data.frame(
  haplotype = paste0("hap_", rep(seq_len(n_block), each = 3)),
  allele    = rep(1:3, n_block),
  freq      = rep(c(0.5, 0.3, 0.2), n_block)
)
W_mh <- construct_wah_matrix(hap, colnames(hap), freq)$W_ah
y_mh <- W_mh[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_mh <- run_bayesa(
  w             = W_mh,
  y             = y_mh,
  wtw_diag      = colSums(W_mh^2),
  X             = X_cov,           # optional
  nu            = 4.5,
  sigma2_g      = var(y_mh) * 0.5,
  sigma2_e_init = var(y_mh) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_mh)

# ---- Predict on a held-out test set -----------------------------------
idx  <- sample(n, 0.8 * n)
W_tr <- W_snp[idx, ]
y_tr <- y[idx]
fit  <- run_bayesa(
  w             = W_tr,
  y             = y_tr,
  wtw_diag      = colSums(W_tr^2),
  nu            = 4.5,
  sigma2_g      = var(y_tr) * 0.5,
  sigma2_e_init = var(y_tr) * 0.5,
  mcmc_params   = mcmc
)
pred <- predict(fit, W_snp[-idx, ], y[-idx])
pred$metrics$accuracy