Phenotype

Schema

Both packages consume phenotypes as a data frame with one row per individual and at least three column kinds:

Column kind Type Required by Description
Identifier character both Joins to rownames(snp) / rownames(mh)
Fixed-effect covariates numeric or factor both e.g. sex, age, batch
Response(s) numeric or integer 0/1 both One column per trait — continuous OR binary
(Optional) tbv_* numeric none True breeding values; only meaningful for simulation studies

The demo dataset bundles both continuous and binary versions of two traits (SNP-architecture and MH-architecture). Standalone CSVs split them out:

File Columns
pheno_continuous_G1.csv id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh
pheno_binary_G1.csv id, sex, y_bin_qtl_snp, y_bin_qtl_mh, tbv_qtl_snp, tbv_qtl_mh

d$pheno carries all eight columns in one frame.

Example (from the demo dataset)

Continuous trait (pheno_continuous_G1.csv)

Columns: id, sex, y_cont_qtl_snp, y_cont_qtl_mh, tbv_qtl_snp, tbv_qtl_mh.

Download pheno_continuous_G1.csv (large)   Download pheno_continuous_G1.csv (small)

Binary trait (pheno_binary_G1.csv)

Same schema but with the median-thresholded y_bin_* columns in place of the continuous ones. Prevalence is 0.5 by construction.

Download pheno_binary_G1.csv (large)   Download pheno_binary_G1.csv (small)

Case/control balance

QTL ground truth

Ten QTL under each architecture (SNP and MH), with unit-length-normalised effects (\(\sum \beta_i^2 = 1\)). Use these for QTL recovery analysis only, not as model input. Four standalone .rds files mirror d$qtl with human-readable identifiers — SNP names and MH block ids — so downstream tools can join against d$map_snp / d$map_mh directly.

Causal SNPs

Causal MH blocks

d$qtl$mh_idx indexes columns of \(W_{ah}\) rather than block columns directly, because the causal “allele” within a multi-allelic block is itself a free parameter. The derived qtl_mh_blocks.rds reduces each \(W_{ah}\) column to its parent block id; if two QTL fall in the same block, the block id appears twice.

Download qtl_snp_ids.rds Download qtl_mh_blocks.rds Download effects_snp.rds Download effects_mh.rds

Loading

library(masreml)
d <- masreml::load_data("large")

# Continuous trait, SNP architecture
y_cont <- d$pheno$y_cont_qtl_snp

# Binary trait, MH architecture
y_bin  <- d$pheno$y_bin_qtl_mh

# Or load the standalone CSVs directly:
pheno_c <- read.csv("demo-data/main/pheno_continuous_G1.csv")
pheno_b <- read.csv("demo-data/main/pheno_binary_G1.csv")

Passing phenotypes to each package

masreml::masreml() takes the response as a named numeric vector y, the fixed-effects design matrix X explicitly (build it with stats::model.matrix()), and markers as a named list (snp_add, snp_dom, mh_add). There is no formula argument and no direct pedigree argument — for pedigree-based prediction, precompute \(\mathbf{A}\) via build_A_ped() and pass it through the G slot.

# Continuous trait, SNP-additive GBLUP with sex as fixed effect.
y <- d$pheno$y_cont_qtl_snp
names(y) <- d$pheno$id

X_fixed <- stats::model.matrix(~ sex, data = d$pheno)

fit <- masreml::masreml(
  y       = y,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "continuous"
)

# masbayes: BayesA on the same trait. The wrapper takes the
# pre-standardised design matrix `w` plus colSums(w^2) as inputs;
# sigma2_g and sigma2_e_init are required hyperparameters.
W   <- masbayes::construct_snp_matrix(d$snp, encoding = "vanRaden")
fit <- masbayes::run_bayesa(
  w             = W,
  y             = y,
  wtw_diag      = colSums(W^2),
  X             = X_fixed,
  marker_type   = "snp",
  nu            = 4.5,
  sigma2_g      = var(y) * 0.5,
  sigma2_e_init = var(y) * 0.5,
  method        = "mcmc"
)

# Binary trait via threshold model in masreml (Laplace approximation).
y_bin <- d$pheno$y_bin_qtl_snp
names(y_bin) <- d$pheno$id
fit_bin <- masreml::masreml(
  y       = y_bin,
  X       = X_fixed,
  markers = list(snp_add = d$snp),
  trait   = "binary",
  link    = "logit"
)

Common pitfalls

Warning

Row order must match the genotype matrix. masreml::masreml() aligns on names(y) against rownames(markers$snp_add) (and rownames of X). This is the most common source of silent bugs when ID conventions differ. Always verify before fitting:

stopifnot(identical(d$pheno$id, rownames(d$snp)))
Warning

Factor levels must be consistent across train and test. If your fixed-effect covariate is a factor (e.g. sex), make sure both train and test splits see all levels before calling stats::model.matrix() — otherwise the design-matrix column count diverges and prediction breaks.

Warning

Binary phenotypes go through a threshold model, not regression on 0/1. Pass trait = "binary" (with link = "logit" or "probit") in masreml::masreml() so the liability-scale variance is identified. Naive Gaussian fitting of binary data (trait = "continuous") produces unbiased EBVs in some regimes but distorts heritability estimates.

Tip

tbv_* columns are ground truth, not input. They are only useful for simulation: comparing \(\widehat{u}\) against tbv is how the Cross-validation tutorial computes accuracy. Never include tbv_* as a covariate.

📐 Related theory: Mixed Model & BLUP Family · Continuous and Binary Trait

🛠 Used in: Genomic Prediction · Binary trait analysis · Cross-validation