Universal REML-BLUP for SNP and Microhaplotype Genomic Prediction

Universal REML-BLUP for SNP and Microhaplotype Genomic Prediction

Description

BLUP-based model for genomic prediction using SNP or microhaplotype (MH) markers. Estimates variance components via REML and predicts genomic estimated breeding values (GEBV) via BLUP. Supports SNP additive, SNP dominance, and MH additive relationship matrices, for both continuous and binary traits. For GWAS-assisted prediction, see run_gwas and gwablup.

Usage

masreml(
  y,
  X = NULL,
  markers = NULL,
  G = NULL,
  method = "auto",
  solver = "auto",
  max_iter = 100L,
  tol = 1e-06,
  n_threads = NULL,
  trait = "continuous",
  link = "logit",
  verbose = TRUE,
  save_rds = FALSE,
  save_path = NULL
)

Arguments

y numeric vector of phenotypes (length n). Named vector recommended for ID alignment. For binary trait, must contain only 0 and 1.
X fixed effects design matrix (n x c). If NULL, intercept only is used.
markers

list of raw marker inputs (optional):

  • snp_add: matrix (n x m), raw genotype 0/1/2

  • snp_dom: matrix (n x m), raw genotype 0/1/2

  • mh_add: list of data.frames, one per chromosome. First column = individual ID, remaining columns = paired haplotype allele codes per block (strand1, strand2, strand1, …)

G

list of pre-built relationship matrices (optional):

  • snp_add: numeric matrix (n x n), SNP additive G

  • snp_dom: numeric matrix (n x n), SNP dominance D

  • mh_add: numeric matrix (n x n), MH additive Agh

  • pedigree: numeric matrix (n x n), pedigree A

method

character, REML algorithm for variance component estimation:

  • “auto” (default): AI-REML for continuous, HE regression for binary

  • “HE”: Haseman-Elston regression (fast, one-step)

  • “AI”: Average Information REML (accurate, iterative)

  • “EM”: Expectation-Maximization REML (stable, slower)

  • “HI”: HE-initialized AI-REML

For binary trait, “HE” is recommended for speed. Any method can be used but “AI” may be slow on working response.
solver

character, EBV solver:

  • “auto” (default): Cholesky for n < 10,000, PCG otherwise

  • “cholesky”: direct Cholesky factorization

  • “pcg”: preconditioned conjugate gradient (large n)

max_iter integer, maximum REML iterations (default 100)
tol numeric, convergence tolerance (default 1e-6)
n_threads integer, number of threads (default: all physical cores)
trait

character, trait distribution:

  • “continuous” (default): Gaussian REML-BLUP

  • “binary”: 0/1 phenotype, single-step Laplace approximation on liability scale

link

character, link function for binary trait:

  • “logit” (default): logistic link, sigma2_e = pi^2/3

  • “probit”: probit link, sigma2_e = 1

verbose logical, if TRUE (default) prints a brief post-fit summary banner (algorithm, runtime, convergence, variance components, h2, GEBV summary) when masreml() returns. The existing progress messages (“Building…”, “Running REML…”, “Solving EBV…”) are unaffected by this flag.
save_rds logical, if TRUE the fit is also written to disk as an RDS file. Default FALSE (differs from masbayes::run_bayesr()) because masreml() is often called inside cv_masreml loops where auto-saving would clobber per-fold output. Set to TRUE for one-off full-data fits you want to persist.
save_path optional explicit RDS path. If NULL (default), defaults to “results_masreml.Rds” in the current working directory. Ignored unless save_rds = TRUE.

Details

Auto-summary. On completion, masreml() prints a brief banner mirroring masbayes::run_bayesr()/ masbayes::run_bayesa(). Set verbose = FALSE to suppress. The full summary.masreml method is unchanged and is still the canonical detailed report.

Runtime field. The returned object now contains a runtime field (elapsed seconds for REML + BLUP), populated even when verbose = FALSE.

Value

Object of class “masreml” with elements:

  • gebv: named list of GEBV vectors per component

  • total_gebv: total GEBV (sum across components)

  • fixed_effects: fixed effect estimates

  • varcomp: list with sigma2 and h2

  • loglik: restricted log-likelihood

  • algorithm: REML algorithm used

  • solver: EBV solver used

  • converged: logical, convergence status

  • n_iter: number of REML iterations

  • n: number of individuals

  • runtime: elapsed seconds for REML + BLUP

  • rds_path: path of saved RDS file, or NULL when save_rds = FALSE

  • training_metrics: list of training-set fit metrics. Continuous: R2, RMSE, accuracy (Pearson r), bias (slope of \(y \sim \hat{y}\)). Binary: same fields plus AUC, computed on the observed (probability) scale (y_01 vs mu_hat) so that bias is the calibration slope (1.0 = perfectly calibrated) and RMSE^2 approximates the Brier score. AUC is rank-invariant under inverse-link transformation.

  • call: matched call

For binary trait (trait = “binary”), additional binary slot contains:

  • link: link function used

  • prevalence: proportion of y = 1

  • h2_liability: h2 on liability scale

  • h2_observed: h2 on observed (0/1) scale via Dempster-Falconer transformation

  • auc: Area Under ROC Curve (Wilcoxon-MWW)

  • fitted: fitted probabilities P(y = 1)

  • converged: always TRUE for single-step

References

VanRaden (2008) Efficient methods to compute genomic predictions. J. Dairy Sci. 91:4414-4423.

Da (2015) Multi-allelic haplotype model for genomic prediction. BMC Genet. 16:144.

Dempster & Falconer (1950) Interpretation of high and low liability. Ann. Hum. Genet. 31:195-203.

Johnson & Thompson (1995) Restricted maximum likelihood estimation of variance components. J. Dairy Sci. 78:449-456.

See Also

run_gwas, gwablup, cv_masreml, compute_accuracy

Examples

library("masreml")

# ── Continuous trait ─────────────────────────────────────
# SNP additive GBLUP
fit <- masreml(y = y, markers = list(snp_add = W))
summary(fit)

# MH additive only
fit <- masreml(y = y, markers = list(mh_add = mh_list))

# Combined SNP additive + dominance
fit <- masreml(
  y       = y,
  markers = list(snp_add = W, snp_dom = W),
  method  = "AI"
)

# Pre-built G matrix
fit <- masreml(y = y, G = list(snp_add = G))

# ── Binary trait ─────────────────────────────────────────
# Binary GBLUP (logit link, default HE)
fit <- masreml(
  y     = y_binary,
  markers = list(snp_add = W),
  trait = "binary"
)
summary(fit)

# Binary with probit link
fit <- masreml(
  y     = y_binary,
  markers = list(snp_add = W),
  trait = "binary",
  link  = "probit"
)

# Binary with EM-REML for variance components
fit <- masreml(
  y      = y_binary,
  markers = list(snp_add = W),
  trait  = "binary",
  method = "EM"
)