library(masreml)
library(masbayes)
d <- masreml::load_data("large")
# --- Continuous: masreml GBLUP -----------------------------------
y_cont <- d$pheno$y_cont_qtl_snp
names(y_cont) <- d$pheno$id
X_fixed <- stats::model.matrix(~ sex, data = d$pheno)
fit_cont <- masreml::masreml(
y = y_cont,
X = X_fixed,
markers = list(snp_add = d$snp),
trait = "continuous"
)
# --- Binary: masreml Laplace (logit) -----------------------------
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id
fit_bin <- masreml::masreml(
y = y_bin,
X = X_fixed,
markers = list(snp_add = d$snp),
trait = "binary",
link = "logit"
)
# fit_bin$h2 is reported on the liability scale.
# --- Binary: masbayes Albert-Chib (BayesR) -----------------------
W <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit_bin_bayesr <- masbayes::run_bayesr(
w = W,
y = y_bin,
wtw_diag = colSums(W ^ 2),
X = X_fixed,
sigma2_e_init = 1.0, # fixed at 1.0 on the liability scale
method = "mcmc",
response_type = "binary"
)
# fit_bin_bayesr$prob_train holds posterior-mean P(y=1).
# --- Evaluation: observed-scale metrics --------------------------
gebv <- predict(fit_bin)$gebv
phat <- predict(fit_bin)$prob
metrics <- masreml::evaluate_prediction(
gebv = gebv,
y = y_bin,
fitted_prob = phat
)
metrics$AUC; metrics$BrierContinuous and Binary trait
1. Traits in quantitative genetics
A trait (or phenotype) is any measurable characteristic of an individual that may have a genetic basis. In quantitative genetics we distinguish traits by the scale of measurement, because this determines the entire downstream statistical machinery.
Continuous traits
Continuous traits take real-valued measurements on an interval scale. They are usually approximately Gaussian after suitable transformation (log, square-root) and respond to many small genetic effects (the infinitesimal limit).
Examples in animal and plant breeding:
- Body weight at harvest (kg)
- Growth rate (g / day)
- Milk yield (L / lactation)
- Egg production (eggs / hen / year)
- Grain yield (kg / ha)
- Days to flowering (days)
For continuous traits, the response is modelled directly with a Gaussian likelihood — see Mixed Model & BLUP Family for the frequentist machinery and Bayesian Alphabet for the Bayesian counterpart.
Binary traits
Binary traits take only two values (0 / 1, presence / absence, diseased / healthy). The genetic basis is still typically polygenic, but the observed response cannot be Gaussian by construction. We therefore introduce a latent liability — an unobserved continuous underlying variable — that takes the role \(\mathbf{y}\) would play in the continuous case; the observed binary outcome is whether the liability crosses a fixed threshold.
Examples in animal and plant breeding:
- Disease presence (resistant / susceptible)
- Survival to weaning (alive / dead)
- Calving difficulty (easy / hard)
- Reproductive success (pregnant / open)
- Flowering occurrence by a fixed date (yes / no)
In the demo dataset (d$pheno), y_bin_qtl_snp and y_bin_qtl_mh are obtained by thresholding the continuous counterparts y_cont_qtl_snp / y_cont_qtl_mh at the population median, so prevalence is 0.5 by construction. See Input Data → Phenotype for previews and downloads.
2. How are traits calculated?
The two packages compute breeding values differently depending on the trait type.
Continuous traits — direct linear solve
The model is the standard mixed model:
\[ \mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e}, \qquad \mathbf{u} \sim N(\mathbf{0}, \mathbf{K}\sigma_u^2), \quad \mathbf{e} \sim N(\mathbf{0}, \mathbf{I}\sigma_e^2). \tag{1}\]
Both masreml and masbayes solve this directly on the observed scale: masreml via Henderson’s MME with REML variance components; masbayes via Gibbs sampling of \(\boldsymbol{\beta}\), \(\sigma_g^2\), \(\sigma_e^2\) in the marker-effects form \(\mathbf{y} = \mathbf{X}\boldsymbol{\alpha} + \mathbf{W}\boldsymbol{\beta} + \boldsymbol{\epsilon}\). GEBV = \(\hat{\mathbf{u}}\) (BLUP) or the posterior mean of \(\mathbf{W}\boldsymbol{\beta}\) (Bayesian).
Binary traits — liability scale + threshold
The observed binary \(y_i\) is taken to be the indicator of an unobserved continuous liability crossing zero:
\[ \ell_i = \mathbf{x}_i^\top \mathbf{b} + \mathbf{z}_i^\top \mathbf{u} + \varepsilon_i, \quad \varepsilon_i \sim N(0, 1), \qquad y_i = \mathbb{1}\bigl[\ell_i > 0\bigr]. \tag{2}\]
The unit residual variance is a probit convention; under the logit link it becomes \(\pi^2 / 3\). The two packages handle the latent \(\boldsymbol{\ell}\) differently.
masreml — single-step Laplace approximation
masreml(y, ..., trait = "binary", link = c("logit", "probit")) implements the Laplace approximation (see masreml/R/binary.R), equivalent to penalized quasi-likelihood (Breslow & Clayton 1993):
- From the observed prevalence, initialise \(\eta^{(0)} = \mathrm{link}(\mu^{(0)})\) and the diagonal Fisher weights \(W^{(0)}_{ii} = (\partial \mu / \partial \eta)|_{\mu^{(0)}_i}\).
- Form the working response \(z^{(t)}_i = \eta^{(t)}_i + (y_i - \mu^{(t)}_i) / W^{(t)}_{ii}\).
- Solve the linear MME on \(z^{(t)}\) with residual variance matrix \(\mathbf{R} = \mathrm{diag}(1 / W^{(t)})\), re-estimating \((\sigma_u^2, \sigma_e^2)\) by REML.
- Update \(\eta^{(t+1)}\) from \(\hat{\mathbf{b}}, \hat{\mathbf{u}}\) and recompute weights. Iterate to convergence.
The converged \(\hat{\mathbf{u}}\) is the posterior mode of \(\mathbf{u}\) on the liability scale. Probit matches the liability-scale derivation exactly; logit is slightly more numerically stable when the design is near-collinear.
masbayes — Albert-Chib data augmentation
run_bayesa(..., response_type = "binary") and run_bayesr(..., response_type = "binary") use the Albert & Chib (1993) augmentation strategy:
- Augment the model with the latent vector \(\boldsymbol{\ell}\).
- Conditional on \(\boldsymbol{\ell}\), the model is exactly Gaussian (unit residual variance, probit link) — sample \(\boldsymbol{\beta}\) and other hyperparameters from their Gaussian full conditionals.
- Conditional on \(\boldsymbol{\beta}\), sample each \(\ell_i\) from a truncated normal: \(\ell_i \sim N(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta}, 1)\), truncated to \((0, \infty)\) if \(y_i = 1\) and \((-\infty, 0]\) if \(y_i = 0\).
- Iterate. The posterior mean of \(\Phi(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta})\) is the predicted probability \(\hat{p}_i\), returned in the fit object as
prob_train/prob_test.
The two routes give nearly identical GEBVs in the common-prevalence regime; Laplace is faster, Albert-Chib is more robust for rare outcomes and gives proper posterior intervals on \(\hat{p}_i\).
3. Variance component & h²
For continuous traits, \(h^2 = \sigma_u^2 / (\sigma_u^2 + \sigma_e^2)\) is the heritability and masreml’s summary() reports it directly.
For binary traits, heritability is reported on two scales.
Liability scale
\[ h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + 1} \quad \text{(probit)}, \qquad h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + \pi^2/3} \quad \text{(logit)}. \tag{3}\]
Observed (0/1) scale
Dempster & Lerner (1950) showed that the observed-scale heritability depends on population prevalence \(K\):
\[ h_{\text{obs}}^2 = h_\ell^2 \cdot \frac{z(K)^2}{K(1-K)}, \tag{4}\]
with \(z(K) = \phi(\Phi^{-1}(K))\) where \(\phi\) and \(\Phi\) are the standard-normal density and CDF. Use Equation 4 when reporting heritability for case-control samples whose prevalence differs from the population.
Calibration metrics for binary predictions
Predictive accuracy on a binary trait is not the Pearson correlation \(r(\hat{u}, y)\) — that statistic conflates discrimination with prevalence. Report instead:
| Metric | What it measures |
|---|---|
| Liability-scale \(r(\hat{\ell}, \ell_{\text{true}})\) | Discrimination on the latent scale (simulation only) |
| AUC | Discrimination on observed 0/1 |
| Brier score | Calibration + discrimination jointly |
| Calibration slope of \(\hat{p}_i\) against \(y_i\) | Pure calibration; should be 1 if perfectly calibrated |
masreml::evaluate_prediction(gebv, y, fitted_prob = phat) reports AUC and the Brier score; the calibration slope is a one-line glm(y ~ phat, family = binomial).
4. When to use
| Outcome type | Use |
|---|---|
| Continuous | Standard mixed model — masreml(y, markers, trait = "continuous") or run_bayes*(..., response_type = "gaussian")
|
| Binary, common (5–50% prevalence) | masreml Laplace — masreml(y, markers, trait = "binary", link = "logit") — fast and accurate |
| Binary, rare (< 1% prevalence) | masbayes Albert-Chib — full MCMC is more robust at the tail than Laplace; or sample only case-enriched subset with explicit ascertainment correction |
| Binary, need posterior intervals on per-marker effects | masbayes (run_bayesr(response_type = "binary") reports PIPs on the liability scale) |
| Need observed-scale h² for a heterogeneous case-control sample | Compute liability-scale \(h_\ell^2\) then transform via Equation 4 — Falconer-Lush ascertainment correction is the user’s responsibility |
5. See it in code
6. References
Dempster, E. R. & Lerner, I. M. (1950). Heritability of threshold characters. Genetics, 35(2), 212–236. doi:10.1093/genetics/35.2.212.
Albert, J. H. & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. doi:10.1080/01621459.1993.10476321.
Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 9–25. doi:10.1080/01621459.1993.10594284.
Falconer, D. S. (1965). The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics, 29(1), 51–76. (Origin of the liability-scale formulation used here.)