Predict GEBV for New Individuals Using a Fitted masreml Model

Predict GEBV for New Individuals Using a Fitted masreml Model

Description

Computes genomic estimated breeding values (GEBV) for test individuals using the relationship between test and training sets. Two modes are supported: pre-built G matrix (full or cross-block) and raw marker data.

Usage

## S3 method for class 'masreml'
predict(
  object,
  G_full = NULL,
  markers_new = NULL,
  markers_train = NULL,
  train_ids = NULL,
  test_ids = NULL,
  y_new = NULL,
  X_new = NULL,
  link = NULL,
  jitter = 1e-06,
  ...
)

Arguments

object fitted masreml object from masreml()
G_full numeric matrix (n_total x n_total), pre-built genomic relationship matrix including both training and test individuals. Row/column names must include all training IDs and test IDs. Either G_full or markers_new must be provided. For multi-component models, provide a named list of G matrices, one per component (names must match names(object$gebv)).
markers_new

list of raw marker data for test individuals only. Used to compute cross-G matrix G[test, train] directly from marker data. Supported elements:

  • snp_add: integer matrix (n_test x m), raw genotype 0/1/2

  • mh_add: list of data.frames in same format as build_G_mh() input (one per chromosome, first col = ID)

Allele frequencies and scaling factors are computed from training markers stored via markers_train argument.
markers_train list of raw marker data for training individuals, required when markers_new is provided. Same format as markers_new. Used to compute consistent allele frequencies and scaling (k) for cross-G computation.
train_ids character vector of training individual IDs, in the same order as used when fitting object. Required when using G_full mode to subset the training block.
test_ids character vector of test individual IDs. Required for G_full mode. Inferred from markers_new row names if not provided.
y_new optional numeric vector of observed phenotypes for test individuals. When supplied, prediction metrics (R2, RMSE, accuracy or AUC, bias) are computed automatically and stored in $metrics. For binary, metrics are on the observed (probability) scale so bias is the calibration slope. Length must match the number of test individuals.
X_new optional numeric matrix of fixed-effects design for test individuals (n_test x c). Required only if the original fit used non-intercept fixed effects. Defaults to an intercept column of 1s.
link character, link function for binary trait prediction: “logit” (default) or “probit”. Inherited from object$binary$link if available.
jitter numeric, small value added to diagonal of G[train,train] for numerical stability (default 1e-6).
ignored

Details

The prediction formula per component k is:

\(\hat{u}_{test,k} = G_{te,tr,k} \cdot (G_{tr,tr,k} + \lambda_k I)^{-1} \cdot \hat{u}_{train,k}\)

where \(\lambda_k = \sigma^2_e / \sigma^2_{g,k}\).

For markers_new mode (SNP), the cross-G is:

\(G_{te,tr} = \frac{W_{te} W_{tr}'}{k}\)

where \(W\) matrices are centered using training allele frequencies and \(k = \sum 2p_j(1-p_j)\) computed from training data.

For binary trait, prediction is on the liability scale. Fitted probabilities are obtained by applying the inverse link function to \(\eta_{test} = \mu_{fixed} + \hat{u}_{test}\).

Value

list of class “masreml_pred”. Field names are aligned with masbayes::predict.masbayes_*() for cross-package workflows. Backward-compatibility aliases (total_gebv, fitted) are retained.

  • GEBV: named numeric vector of total GEBV for test individuals (sum across components). Liability scale for binary.

  • total_gebv: alias for GEBV (back-compat).

  • gebv: named list of per-component GEBV vectors.

  • prob: fitted probabilities P(y=1) for binary trait via inverse-link, NULL for continuous.

  • fitted: alias for prob (back-compat).

  • metrics: list of test-set metrics if y_new supplied (R2, RMSE, accuracy/AUC, bias), else NULL.

  • h2, sigma2: heritability and variance components carried over from the fit.

  • response_type: “continuous” or “binary”.

  • algorithm: REML algorithm from the fit.

  • eval_scope: “in-sample (training)”, “test set (G_full)”, or “test set (markers_new)”.

  • has_truth: TRUE if metrics were computed.

  • train_ids, test_ids: ID vectors.

  • n_train, n_test: sample sizes.

See Also

masreml, build_G_snp, build_G_mh

Examples

library("masreml")

# ── In-sample mode: shortcut to training metrics ──────────
fit  <- masreml(y_train, markers = list(snp_add = W_train))
pred <- predict(fit)               # no test data → returns training info
pred$metrics                       # = fit$training_metrics

# ── Mode A: G_full pre-built, with auto-metrics ───────────
G_full <- build_G_snp(W_all)       # n_total x n_total
fit    <- masreml(y_train, G = list(snp_add = G_train))

pred <- predict(fit,
  G_full    = list(snp_add = G_full),
  train_ids = rownames(W_train),
  test_ids  = rownames(W_test),
  y_new     = y_test)              # auto-compute test metrics
pred$GEBV
pred$metrics$accuracy

# ── Mode B: markers_new + auto-metrics ────────────────────
fit  <- masreml(y_train, markers = list(snp_add = W_train))
pred <- predict(fit,
  markers_new   = list(snp_add = W_test),
  markers_train = list(snp_add = W_train),
  y_new         = y_test)
pred$metrics$bias                  # calibration slope (binary) / dispersion

# ── Forecast only (no y_new) ──────────────────────────────
fc <- predict(fit, markers_new = list(snp_add = W_test),
                   markers_train = list(snp_add = W_train))
fc$GEBV                            # metrics is NULL