Predict GEBV for New Individuals Using a Fitted masreml Model

Description

Computes genomic estimated breeding values (GEBV) for test individuals using the relationship between test and training sets. Two modes are supported: pre-built G matrix (full or cross-block) and raw marker data.

Usage

## S3 method for class 'masreml'
predict(
  object,
  G_full = NULL,
  markers_new = NULL,
  markers_train = NULL,
  train_ids = NULL,
  test_ids = NULL,
  y_new = NULL,
  X_new = NULL,
  link = NULL,
  jitter = 1e-06,
  ...
)

Arguments

`object`	fitted `masreml` object from `masreml()`
`G_full`	numeric matrix (n_total x n_total), pre-built genomic relationship matrix including both training and test individuals. Row/column names must include all training IDs and test IDs. Either `G_full` or `markers_new` must be provided. For multi-component models, provide a named list of G matrices, one per component (names must match `names(object$gebv)`).
`markers_new`	list of raw marker data for test individuals only. Used to compute cross-G matrix `G[test, train]` directly from marker data. Supported elements: `snp_add`: integer matrix (n_test x m), raw genotype 0/1/2 `mh_add`: list of data.frames in same format as `build_G_mh()` input (one per chromosome, first col = ID) Allele frequencies and scaling factors are computed from training markers stored via `markers_train` argument.
`markers_train`	list of raw marker data for training individuals, required when `markers_new` is provided. Same format as `markers_new`. Used to compute consistent allele frequencies and scaling (k) for cross-G computation.
`train_ids`	character vector of training individual IDs, in the same order as used when fitting `object`. Required when using `G_full` mode to subset the training block.
`test_ids`	character vector of test individual IDs. Required for `G_full` mode. Inferred from `markers_new` row names if not provided.
`y_new`	optional numeric vector of observed phenotypes for test individuals. When supplied, prediction metrics (R2, RMSE, accuracy or AUC, bias) are computed automatically and stored in `$metrics`. For binary, metrics are on the observed (probability) scale so `bias` is the calibration slope. Length must match the number of test individuals.
`X_new`	optional numeric matrix of fixed-effects design for test individuals (n_test x c). Required only if the original fit used non-intercept fixed effects. Defaults to an intercept column of 1s.
`link`	character, link function for binary trait prediction: `“logit”` (default) or `“probit”`. Inherited from `object$binary$link` if available.
`jitter`	numeric, small value added to diagonal of `G[train,train]` for numerical stability (default 1e-6).
`…`	ignored

Details

The prediction formula per component k is:

$\hat{u}_{test,k} = G_{te,tr,k} \cdot (G_{tr,tr,k} + \lambda_k I)^{-1} \cdot \hat{u}_{train,k}$

where $\lambda_k = \sigma^2_e / \sigma^2_{g,k}$.

For markers_new mode (SNP), the cross-G is:

$G_{te,tr} = \frac{W_{te} W_{tr}'}{k}$

where $W$ matrices are centered using training allele frequencies and $k = \sum 2p_j(1-p_j)$ computed from training data.

For binary trait, prediction is on the liability scale. Fitted probabilities are obtained by applying the inverse link function to $\eta_{test} = \mu_{fixed} + \hat{u}_{test}$.

Value

list of class “masreml_pred”. Field names are aligned with masbayes::predict.masbayes_*() for cross-package workflows. Backward-compatibility aliases (total_gebv, fitted) are retained.

GEBV: named numeric vector of total GEBV for test individuals (sum across components). Liability scale for binary.
total_gebv: alias for GEBV (back-compat).
gebv: named list of per-component GEBV vectors.
prob: fitted probabilities P(y=1) for binary trait via inverse-link, NULL for continuous.
fitted: alias for prob (back-compat).
metrics: list of test-set metrics if y_new supplied (R2, RMSE, accuracy/AUC, bias), else NULL.
h2, sigma2: heritability and variance components carried over from the fit.
response_type: “continuous” or “binary”.
algorithm: REML algorithm from the fit.
eval_scope: “in-sample (training)”, “test set (G_full)”, or “test set (markers_new)”.
has_truth: TRUE if metrics were computed.
train_ids, test_ids: ID vectors.
n_train, n_test: sample sizes.

Examples

library("masreml")

# ── In-sample mode: shortcut to training metrics ──────────
fit  <- masreml(y_train, markers = list(snp_add = W_train))
pred <- predict(fit)               # no test data → returns training info
pred$metrics                       # = fit$training_metrics

# ── Mode A: G_full pre-built, with auto-metrics ───────────
G_full <- build_G_snp(W_all)       # n_total x n_total
fit    <- masreml(y_train, G = list(snp_add = G_train))

pred <- predict(fit,
  G_full    = list(snp_add = G_full),
  train_ids = rownames(W_train),
  test_ids  = rownames(W_test),
  y_new     = y_test)              # auto-compute test metrics
pred$GEBV
pred$metrics$accuracy

# ── Mode B: markers_new + auto-metrics ────────────────────
fit  <- masreml(y_train, markers = list(snp_add = W_train))
pred <- predict(fit,
  markers_new   = list(snp_add = W_test),
  markers_train = list(snp_add = W_train),
  y_new         = y_test)
pred$metrics$bias                  # calibration slope (binary) / dispersion

# ── Forecast only (no y_new) ──────────────────────────────
fc <- predict(fit, markers_new = list(snp_add = W_test),
                   markers_train = list(snp_add = W_train))
fc$GEBV                            # metrics is NULL