Continuous and Binary trait

1. Traits in quantitative genetics

A trait (or phenotype) is any measurable characteristic of an individual that may have a genetic basis. In quantitative genetics we distinguish traits by the scale of measurement, because this determines the entire downstream statistical machinery.

Continuous traits

Continuous traits take real-valued measurements on an interval scale. They are usually approximately Gaussian after suitable transformation (log, square-root) and respond to many small genetic effects (the infinitesimal limit).

Examples in animal and plant breeding:

Body weight at harvest (kg)
Growth rate (g / day)
Milk yield (L / lactation)
Egg production (eggs / hen / year)
Grain yield (kg / ha)
Days to flowering (days)

For continuous traits, the response is modelled directly with a Gaussian likelihood — see Mixed Model & BLUP Family for the frequentist machinery and Bayesian Alphabet for the Bayesian counterpart.

Binary traits

Binary traits take only two values (0 / 1, presence / absence, diseased / healthy). The genetic basis is still typically polygenic, but the observed response cannot be Gaussian by construction. We therefore introduce a latent liability — an unobserved continuous underlying variable — that takes the role $\mathbf{y}$ would play in the continuous case; the observed binary outcome is whether the liability crosses a fixed threshold.

Examples in animal and plant breeding:

Disease presence (resistant / susceptible)
Survival to weaning (alive / dead)
Calving difficulty (easy / hard)
Reproductive success (pregnant / open)
Flowering occurrence by a fixed date (yes / no)

In the demo dataset (d$pheno), y_bin_qtl_snp and y_bin_qtl_mh are obtained by thresholding the continuous counterparts y_cont_qtl_snp / y_cont_qtl_mh at the population median, so prevalence is 0.5 by construction. See Input Data → Phenotype for previews and downloads.

2. How are traits calculated?

The two packages compute breeding values differently depending on the trait type.

Continuous traits — direct linear solve

The model is the standard mixed model:

\[ \mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e}, \qquad \mathbf{u} \sim N(\mathbf{0}, \mathbf{K}\sigma_u^2), \quad \mathbf{e} \sim N(\mathbf{0}, \mathbf{I}\sigma_e^2). \tag{1}\]

Both masreml and masbayes solve this directly on the observed scale: masreml via Henderson’s MME with REML variance components; masbayes via Gibbs sampling of $\boldsymbol{\beta}$, $\sigma_g^2$, $\sigma_e^2$ in the marker-effects form $\mathbf{y} = \mathbf{X}\boldsymbol{\alpha} + \mathbf{W}\boldsymbol{\beta} + \boldsymbol{\epsilon}$. GEBV = $\hat{\mathbf{u}}$ (BLUP) or the posterior mean of $\mathbf{W}\boldsymbol{\beta}$ (Bayesian).

Binary traits — liability scale + threshold

The observed binary $y_i$ is taken to be the indicator of an unobserved continuous liability crossing zero:

\[ \ell_i = \mathbf{x}_i^\top \mathbf{b} + \mathbf{z}_i^\top \mathbf{u} + \varepsilon_i, \quad \varepsilon_i \sim N(0, 1), \qquad y_i = \mathbb{1}\bigl[\ell_i > 0\bigr]. \tag{2}\]

The unit residual variance is a probit convention; under the logit link it becomes $\pi^2 / 3$. The two packages handle the latent $\boldsymbol{\ell}$ differently.

`masreml` — single-step Laplace approximation

masreml(y, ..., trait = "binary", link = c("logit", "probit")) implements the Laplace approximation (see masreml/R/binary.R), equivalent to penalized quasi-likelihood (Breslow & Clayton 1993):

From the observed prevalence, initialise $\eta^{(0)} = \mathrm{link}(\mu^{(0)})$ and the diagonal Fisher weights $W^{(0)}_{ii} = (\partial \mu / \partial \eta)|_{\mu^{(0)}_i}$.
Form the working response $z^{(t)}_i = \eta^{(t)}_i + (y_i - \mu^{(t)}_i) / W^{(t)}_{ii}$.
Solve the linear MME on $z^{(t)}$ with residual variance matrix $\mathbf{R} = \mathrm{diag}(1 / W^{(t)})$, re-estimating $(\sigma_u^2, \sigma_e^2)$ by REML.
Update $\eta^{(t+1)}$ from $\hat{\mathbf{b}}, \hat{\mathbf{u}}$ and recompute weights. Iterate to convergence.

The converged $\hat{\mathbf{u}}$ is the posterior mode of $\mathbf{u}$ on the liability scale. Probit matches the liability-scale derivation exactly; logit is slightly more numerically stable when the design is near-collinear.

`masbayes` — Albert-Chib data augmentation

run_bayesa(..., response_type = "binary") and run_bayesr(..., response_type = "binary") use the Albert & Chib (1993) augmentation strategy:

Augment the model with the latent vector $\boldsymbol{\ell}$.
Conditional on $\boldsymbol{\ell}$, the model is exactly Gaussian (unit residual variance, probit link) — sample $\boldsymbol{\beta}$ and other hyperparameters from their Gaussian full conditionals.
Conditional on $\boldsymbol{\beta}$, sample each $\ell_i$ from a truncated normal: $\ell_i \sim N(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta}, 1)$, truncated to $(0, \infty)$ if $y_i = 1$ and $(-\infty, 0]$ if $y_i = 0$.
Iterate. The posterior mean of $\Phi(\mathbf{x}_i^\top\boldsymbol{\alpha} + \mathbf{w}_i^\top\boldsymbol{\beta})$ is the predicted probability $\hat{p}_i$, returned in the fit object as prob_train / prob_test.

The two routes give nearly identical GEBVs in the common-prevalence regime; Laplace is faster, Albert-Chib is more robust for rare outcomes and gives proper posterior intervals on $\hat{p}_i$.

3. Variance component & h²

For continuous traits, $h^2 = \sigma_u^2 / (\sigma_u^2 + \sigma_e^2)$ is the heritability and masreml’s summary() reports it directly.

For binary traits, heritability is reported on two scales.

Liability scale

\[ h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + 1} \quad \text{(probit)}, \qquad h_\ell^2 = \frac{\sigma_u^2}{\sigma_u^2 + \pi^2/3} \quad \text{(logit)}. \tag{3}\]

Observed (0/1) scale

Dempster & Lerner (1950) showed that the observed-scale heritability depends on population prevalence $K$:

\[ h_{\text{obs}}^2 = h_\ell^2 \cdot \frac{z(K)^2}{K(1-K)}, \tag{4}\]

with $z(K) = \phi(\Phi^{-1}(K))$ where $\phi$ and $\Phi$ are the standard-normal density and CDF. Use Equation 4 when reporting heritability for case-control samples whose prevalence differs from the population.

Calibration metrics for binary predictions

Predictive accuracy on a binary trait is not the Pearson correlation $r(\hat{u}, y)$ — that statistic conflates discrimination with prevalence. Report instead:

Metric	What it measures
Liability-scale $r(\hat{\ell}, \ell_{\text{true}})$	Discrimination on the latent scale (simulation only)
AUC	Discrimination on observed 0/1
Brier score	Calibration + discrimination jointly
Calibration slope of $\hat{p}_i$ against $y_i$	Pure calibration; should be 1 if perfectly calibrated

masreml::evaluate_prediction(gebv, y, fitted_prob = phat) reports AUC and the Brier score; the calibration slope is a one-line glm(y ~ phat, family = binomial).

4. When to use

Outcome type	Use
Continuous	Standard mixed model — `masreml(y, markers, trait = "continuous")` or `run_bayes*(..., response_type = "gaussian")`
Binary, common (5–50% prevalence)	masreml Laplace — `masreml(y, markers, trait = "binary", link = "logit")` — fast and accurate
Binary, rare (< 1% prevalence)	masbayes Albert-Chib — full MCMC is more robust at the tail than Laplace; or sample only case-enriched subset with explicit ascertainment correction
Binary, need posterior intervals on per-marker effects	masbayes (`run_bayesr(response_type = "binary")` reports PIPs on the liability scale)
Need observed-scale h² for a heterogeneous case-control sample	Compute liability-scale $h_\ell^2$ then transform via Equation 4 — Falconer-Lush ascertainment correction is the user’s responsibility

5. See it in code

library(masreml)
library(masbayes)

d <- masreml::load_data("large")

# --- Continuous: masreml GBLUP -----------------------------------
y_cont <- d$pheno$y_cont_qtl_snp
names(y_cont) <- d$pheno$id
X_fixed <- stats::model.matrix(~ sex, data = d$pheno)

fit_cont <- masreml::masreml(
  y       = y_cont,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "continuous"
)

# --- Binary: masreml Laplace (logit) -----------------------------
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id

fit_bin <- masreml::masreml(
  y       = y_bin,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "binary",
  link    = "logit"
)
# fit_bin$h2 is reported on the liability scale.

# --- Binary: masbayes Albert-Chib (BayesR) -----------------------
W <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit_bin_bayesr <- masbayes::run_bayesr(
  w             = W,
  y             = y_bin,
  wtw_diag      = colSums(W ^ 2),
  X             = X_fixed,
  sigma2_e_init = 1.0,             # fixed at 1.0 on the liability scale
  method        = "mcmc",
  response_type = "binary"
)
# fit_bin_bayesr$prob_train holds posterior-mean P(y=1).

# --- Evaluation: observed-scale metrics --------------------------
gebv  <- predict(fit_bin)$gebv
phat  <- predict(fit_bin)$prob
metrics <- masreml::evaluate_prediction(
  gebv        = gebv,
  y           = y_bin,
  fitted_prob = phat
)
metrics$AUC; metrics$Brier

6. References

Dempster, E. R. & Lerner, I. M. (1950). Heritability of threshold characters. Genetics, 35(2), 212–236. doi:10.1093/genetics/35.2.212.
Albert, J. H. & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. doi:10.1080/01621459.1993.10476321.
Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 9–25. doi:10.1080/01621459.1993.10594284.
Falconer, D. S. (1965). The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics, 29(1), 51–76. (Origin of the liability-scale formulation used here.)