Fit a BayesA Marker-Specific Variance Model (MCMC or stochastic EM)

Fit a BayesA Marker-Specific Variance Model (MCMC or stochastic EM)

Description

BayesA (Meuwissen et al., 2001) places a scaled inverse chi-squared prior on each allele’s effect variance, allowing heavier-tailed effect-size distributions than ridge regression. Use method = “mcmc” for full posterior inference or method = “em” for fast stochastic-EM point estimates (Gaussian only).

Usage

run_bayesa(
  w,
  y,
  wtw_diag,
  X = NULL,
  nu = 4.5,
  sigma2_g,
  sigma2_e_init,
  prior_params = NULL,
  mcmc_params = NULL,
  em_params = NULL,
  method = c("mcmc", "em"),
  response_type = c("gaussian", "binary"),
  fold_id = 0L,
  save_rds = FALSE,
  save_path = NULL,
  verbose = FALSE
)

Arguments

w Numeric design matrix (n x p). Typically the $W_ah element returned by construct_wah_matrix.
y Phenotype vector (length n). Use 0/1 for binary traits.
wtw_diag Pre-computed colSums(w^2).
X Optional fixed-effects design matrix (n x q). When supplied, the model is \(y = X\alpha + W\beta + \mu + \epsilon\) with a flat prior on \(\alpha\). Do not include a column of ones for the intercept; \(\mu\) is sampled separately.
nu Degrees of freedom for the scaled inverse chi-squared prior on marker variances. Smaller values allow heavier tails. Must be > 2. Default 4.5.
sigma2_g Prior total genetic variance. Typically var(y) * 0.5; for binary traits use 1.0.
sigma2_e_init Initial residual variance. For binary traits fix at 1.0.
prior_params Optional named list. Only a0_e (default 10) is used; b0_e is derived as sigma2_e_init * (a0_e - 1).
mcmc_params Optional named list: n_iter (40000), n_burn (20000), n_thin (10), seed (123).
em_params Optional named list: max_iter (500), tol (1e-6).
method Either “mcmc” (default) or “em”.
response_type “gaussian” (default) or “binary”. Binary requires method = “mcmc”.
fold_id Integer label for progress messages.
save_rds If TRUE, the fit object is saved to disk as an RDS file. Set FALSE (default) for CV loops.
save_path Optional explicit RDS path. If NULL (default), defaults to “results_bayesa.Rds” in the current working directory.
verbose If TRUE, the Rust engine streams per-iteration progress (start banner, Iter X/Y diagnostics, ESS / Geweke, completion) to stderr. Default FALSE keeps the Rust side silent. The brief R post-fit summary prints regardless.

Details

Algorithm choice. MCMC uses marginalised Gibbs sampling and returns full posterior chains. EM is much faster but provides no uncertainty quantification and does not support response_type = “binary”.

Auto-save. By default the returned fit is saved to results_bayesa.Rds in getwd(). Set save_rds = FALSE for CV loops.

Three usage scenarios. run_bayesa() + summary() + predict() support full-data fits, train-test splits, and k-fold CV uniformly. See run_bayesr for example loops.

Derived parameter. The wrapper computes s_squared = sigma2_g * (nu - 2) / (nu * sum(apply(w, 2, var))) so the prior expectation of each marker variance equals sigma2_g / sum_2pq.

Value

An object of class c(“masbayes_bayesa”, “masbayes”) — a list with the following key fields:

beta_hat, mu_hat, sigma2_e_hat, sigma2_j_hat
Posterior point estimates (intercept, allele effects, residual variance, per-marker variances).
beta_samples, sigma2_j_samples, sigma2_e_samples, mu_samples
Posterior chains (single-row matrices for EM).
GEBV / pred_train, h2, sigma2_g, sigma2_e
Training GEBVs (liability scale for binary), heritability, and total variance components.
prob_train
Binary only: training-set predicted probabilities P(y = 1) = pnorm(pred_train) (probit inverse link from Albert-Chib augmentation).
runtime
Elapsed seconds.
training_metrics
R2, RMSE, accuracy (or AUC for binary), bias. For binary, all metrics are on the observed (probability) scale — bias is the calibration slope (1.0 = perfectly calibrated) and RMSE^2 approximates the Brier score. AUC is rank-invariant so unaffected.
diagnostics
ESS / Geweke Z (MCMC only).
variance_components
Per-marker variances binned into tertiles (small / medium / large) reporting mean, range, and marker count.
rds_path
Path of the saved RDS file, or NULL.

References

Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819-1829. doi:10.1093/genetics/157.4.1819

See Also

run_bayesr, construct_wah_matrix, summary.masbayes_bayesa, predict.masbayes_bayesa

Examples

library("masbayes")

set.seed(42)
n     <- 200
mcmc  <- list(n_iter = 2000L, n_burn = 1000L, n_thin = 5L, seed = 123L)
X_cov <- cbind(sex = rbinom(n, 1, 0.5), batch = rnorm(n))  # optional

# ---- (A) SNP path ------------------------------------------------------
n_snp <- 100
X     <- matrix(rbinom(n * n_snp, 2, prob = runif(n_snp, 0.1, 0.5)),
                n, n_snp)
W_snp <- construct_snp_matrix(X)$W
y     <- W_snp[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_snp <- run_bayesa(
  w             = W_snp,
  y             = y,
  wtw_diag      = colSums(W_snp^2),
  X             = X_cov,           # optional
  nu            = 4.5,
  sigma2_g      = var(y) * 0.5,
  sigma2_e_init = var(y) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_snp)

# ---- (B) Microhaplotype path ------------------------------------------
n_block <- 50
hap <- matrix(sample.int(3, n * n_block * 2, replace = TRUE), nrow = n)
colnames(hap) <- paste0("hap_", rep(seq_len(n_block), each = 2))
freq <- data.frame(
  haplotype = paste0("hap_", rep(seq_len(n_block), each = 3)),
  allele    = rep(1:3, n_block),
  freq      = rep(c(0.5, 0.3, 0.2), n_block)
)
W_mh <- construct_wah_matrix(hap, colnames(hap), freq)$W_ah
y_mh <- W_mh[, 1:5] %*% rnorm(5, 0, 0.5) + rnorm(n, 0, 1)

fit_mh <- run_bayesa(
  w             = W_mh,
  y             = y_mh,
  wtw_diag      = colSums(W_mh^2),
  X             = X_cov,           # optional
  nu            = 4.5,
  sigma2_g      = var(y_mh) * 0.5,
  sigma2_e_init = var(y_mh) * 0.5,
  mcmc_params   = mcmc
)
summary(fit_mh)

# ---- Predict on a held-out test set -----------------------------------
idx  <- sample(n, 0.8 * n)
W_tr <- W_snp[idx, ]
y_tr <- y[idx]
fit  <- run_bayesa(
  w             = W_tr,
  y             = y_tr,
  wtw_diag      = colSums(W_tr^2),
  nu            = 4.5,
  sigma2_g      = var(y_tr) * 0.5,
  sigma2_e_init = var(y_tr) * 0.5,
  mcmc_params   = mcmc
)
pred <- predict(fit, W_snp[-idx, ], y[-idx])
pred$metrics$accuracy