Continuous and Binary trait

1. Traits in quantitative genetics

A trait (or phenotype) is any measurable characteristic of an individual that may have a genetic basis. In quantitative genetics we distinguish traits by the scale of measurement, because this determines the entire downstream statistical machinery.

Continuous traits

Continuous traits take real-valued measurements on an interval scale. They are usually approximately Gaussian after suitable transformation (log, square-root) and respond to many small genetic effects (the infinitesimal limit).

Examples in animal and plant breeding:

  • Body weight at harvest (kg)
  • Growth rate (g / day)
  • Milk yield (L / lactation)
  • Egg production (eggs / hen / year)
  • Grain yield (kg / ha)
  • Days to flowering (days)

For continuous traits, the response is modelled directly with a Gaussian likelihood — see Mixed Model & BLUP Family for the frequentist machinery and Bayesian Alphabet for the Bayesian counterpart.

Binary traits

Binary traits take only two values (0 / 1, presence / absence, diseased / healthy). The genetic basis is still typically polygenic, but the observed response cannot be Gaussian by construction. We therefore introduce a latent liability — an unobserved continuous underlying variable — that takes the role \(\mathbf{y}\) would play in the continuous case; the observed binary outcome is whether the liability crosses a fixed threshold.

Examples in animal and plant breeding:

  • Disease presence (resistant / susceptible)
  • Survival to weaning (alive / dead)
  • Calving difficulty (easy / hard)
  • Reproductive success (pregnant / open)
  • Flowering occurrence by a fixed date (yes / no)

In the demo dataset (d$pheno), y_bin_qtl_snp and y_bin_qtl_mh are obtained by thresholding the continuous counterparts y_cont_qtl_snp / y_cont_qtl_mh at the population median, so prevalence is 0.5 by construction. See Input Data → Phenotype for previews and downloads.

2. How are traits calculated?

The two packages compute breeding values differently depending on the trait type.

Continuous traits — direct linear solve

The model is the standard mixed model:

\[ \mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e}, \qquad \mathbf{u} \sim N(\mathbf{0}, \mathbf{K}\sigma_u^2), \quad \mathbf{e} \sim N(\mathbf{0}, \mathbf{I}\sigma_e^2). \tag{1}\]

Both masreml and masbayes solve this directly on the observed scale: masreml via Henderson’s MME with REML variance components; masbayes via Gibbs sampling of \(\boldsymbol{\beta}\), \(\sigma_g^2\), \(\sigma_e^2\) in the marker-effects form \(\mathbf{y} = \mathbf{X}\boldsymbol{\alpha} + \mathbf{W}\boldsymbol{\beta} + \boldsymbol{\epsilon}\). GEBV = \(\hat{\mathbf{u}}\) (BLUP) or the posterior mean of \(\mathbf{W}\boldsymbol{\beta}\) (Bayesian).

Binary traits — liability scale + threshold

The observed binary \(y_i\) is taken to be the indicator of an unobserved continuous liability crossing zero:

\[ \ell_i = \mathbf{x}_i^\top \mathbf{b} + \mathbf{z}_i^\top \mathbf{u} + \varepsilon_i, \quad \varepsilon_i \sim N(0, 1), \qquad y_i = \mathbb{1}\bigl[\ell_i > 0\bigr]. \tag{2}\]

The unit residual variance is a probit convention; under the logit link it becomes \(\pi^2 / 3\). The two packages handle the latent \(\boldsymbol{\ell}\) differently.

masreml — single-step Laplace approximation

masreml(y, ..., trait = "binary", link = c("logit", "probit")) implements the Laplace approximation (see masreml/R/binary.R), equivalent to penalized quasi-likelihood (Breslow & Clayton 1993):

  1. From the observed prevalence, initialise \(\eta^{(0)} = \mathrm{link}(\mu^{(0)})\) and the diagonal Fisher weights \(W^{(0)}_{ii} = (\partial \mu / \partial \eta)|_{\mu^{(0)}_i}\).
  2. Form the working response \(z^{(t)}_i = \eta^{(t)}_i + (y_i - \mu^{(t)}_i) / W^{(t)}_{ii}\).
  3. Solve the linear MME on \(z^{(t)}\) with residual variance matrix \(\mathbf{R} = \mathrm{diag}(1 / W^{(t)})\), re-estimating \((\sigma_u^2, \sigma_e^2)\) by REML.
  4. Update \(\eta^{(t+1)}\) from \(\hat{\mathbf{b}}, \hat{\mathbf{u}}\) and recompute weights. Iterate to convergence.

The converged \(\hat{\mathbf{u}}\) is the posterior mode of \(\mathbf{u}\) on the liability scale. Probit matches the liability-scale derivation exactly; logit is slightly more numerically stable when the design is near-collinear.

masbayes — Albert-Chib data augmentation

run_bayesa(..., response_type = "binary") and run_bayesr(..., response_type = "binary") use the Albert & Chib (1993) augmentation strategy:

  1. Augment the model with the latent vector \(\boldsymbol{\ell}\).
  2. Conditional on \(\boldsymbol{\ell}\), the model is exactly Gaussian (unit residual variance, probit link) — sample \(\boldsymbol{\beta}\) and other hyperparameters from their Gaussian full conditionals.
  3. Conditional on \(\boldsymbol{\beta}\), sample each \(\ell_i\) from a truncated normal: \(\ell_i \sim N(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta}, 1)\), truncated to \((0, \infty)\) if \(y_i = 1\) and \((-\infty, 0]\) if \(y_i = 0\).
  4. Iterate. The posterior mean of \(\Phi(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta})\) is the predicted probability \(\hat{p}_i\), returned in the fit object as prob_train / prob_test.

The two routes give nearly identical GEBVs in the common-prevalence regime; Laplace is faster, Albert-Chib is more robust for rare outcomes and gives proper posterior intervals on \(\hat{p}_i\).

3. Variance component & h²

For continuous traits, \(h^2 = \sigma_u^2 / (\sigma_u^2 + \sigma_e^2)\) is the heritability and masreml’s summary() reports it directly.

For binary traits, heritability is reported on two scales.

Liability scale

\[ h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + 1} \quad \text{(probit)}, \qquad h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + \pi^2/3} \quad \text{(logit)}. \tag{3}\]

Observed (0/1) scale

Dempster & Lerner (1950) showed that the observed-scale heritability depends on population prevalence \(K\):

\[ h_{\text{obs}}^2 = h_\ell^2 \cdot \frac{z(K)^2}{K(1-K)}, \tag{4}\]

with \(z(K) = \phi(\Phi^{-1}(K))\) where \(\phi\) and \(\Phi\) are the standard-normal density and CDF. Use Equation 4 when reporting heritability for case-control samples whose prevalence differs from the population.

Calibration metrics for binary predictions

Predictive accuracy on a binary trait is not the Pearson correlation \(r(\hat{u}, y)\) — that statistic conflates discrimination with prevalence. Report instead:

Metric What it measures
Liability-scale \(r(\hat{\ell}, \ell_{\text{true}})\) Discrimination on the latent scale (simulation only)
AUC Discrimination on observed 0/1
Brier score Calibration + discrimination jointly
Calibration slope of \(\hat{p}_i\) against \(y_i\) Pure calibration; should be 1 if perfectly calibrated

masreml::evaluate_prediction(gebv, y, fitted_prob = phat) reports AUC and the Brier score; the calibration slope is a one-line glm(y ~ phat, family = binomial).

4. When to use

Outcome type Use
Continuous Standard mixed model — masreml(y, markers, trait = "continuous") or run_bayes*(..., response_type = "gaussian")
Binary, common (5–50% prevalence) masreml Laplace — masreml(y, markers, trait = "binary", link = "logit") — fast and accurate
Binary, rare (< 1% prevalence) masbayes Albert-Chib — full MCMC is more robust at the tail than Laplace; or sample only case-enriched subset with explicit ascertainment correction
Binary, need posterior intervals on per-marker effects masbayes (run_bayesr(response_type = "binary") reports PIPs on the liability scale)
Need observed-scale h² for a heterogeneous case-control sample Compute liability-scale \(h_\ell^2\) then transform via Equation 4 — Falconer-Lush ascertainment correction is the user’s responsibility

5. See it in code

library(masreml)
library(masbayes)

d <- masreml::load_data("large")

# --- Continuous: masreml GBLUP -----------------------------------
y_cont <- d$pheno$y_cont_qtl_snp
names(y_cont) <- d$pheno$id
X_fixed <- stats::model.matrix(~ sex, data = d$pheno)

fit_cont <- masreml::masreml(
  y       = y_cont,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "continuous"
)

# --- Binary: masreml Laplace (logit) -----------------------------
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id

fit_bin <- masreml::masreml(
  y       = y_bin,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "binary",
  link    = "logit"
)
# fit_bin$h2 is reported on the liability scale.

# --- Binary: masbayes Albert-Chib (BayesR) -----------------------
W <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit_bin_bayesr <- masbayes::run_bayesr(
  w             = W,
  y             = y_bin,
  wtw_diag      = colSums(W ^ 2),
  X             = X_fixed,
  sigma2_e_init = 1.0,             # fixed at 1.0 on the liability scale
  method        = "mcmc",
  response_type = "binary"
)
# fit_bin_bayesr$prob_train holds posterior-mean P(y=1).

# --- Evaluation: observed-scale metrics --------------------------
gebv  <- predict(fit_bin)$gebv
phat  <- predict(fit_bin)$prob
metrics <- masreml::evaluate_prediction(
  gebv        = gebv,
  y           = y_bin,
  fitted_prob = phat
)
metrics$AUC; metrics$Brier

6. References

  • Dempster, E. R. & Lerner, I. M. (1950). Heritability of threshold characters. Genetics, 35(2), 212–236. doi:10.1093/genetics/35.2.212.

  • Albert, J. H. & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. doi:10.1080/01621459.1993.10476321.

  • Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 9–25. doi:10.1080/01621459.1993.10594284.

  • Falconer, D. S. (1965). The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics, 29(1), 51–76. (Origin of the liability-scale formulation used here.)